PERF: unroll take_1d/take_2d inner loops with contig fast path by jbrockmendel · Pull Request #65295 · pandas-dev/pandas

jbrockmendel · 2026-04-19T15:17:48Z

Summary

Adds a contiguous-stride fast path to take_1d, take_2d_axis0, take_2d_axis1, and take_2d_multi in pandas/_libs/algos_take_helper.pxi.in. When all input arrays have strides == sizeof(dtype) along the relevant axes (the common case), the kernel walks raw typed pointers instead of memoryview byte-offsets — this lets the C compiler drop the runtime stride multiply. The inner gather loop is then manually unrolled 4× so the out-of-order pipeline can overlap independent indexed loads (aarch64 has no SIMD gather; this is the main available lever).

The existing same-dtype 2D axis0 memcpy fast path is preserved.

Motivation

Inspecting the pre-patch disassembly of take_1d_int64_int64 showed the hot loop was 7 scalar instructions per element with a per-iteration mul for the stride multiply — the memoryview indexing expansion *(data + i * strides[0]) defeats auto-vectorization and auto-unrolling. The patched loop is 14 instructions per 4 elements (3.5/elem) with 4 independent indexed loads exposed to the pipeline.

Benchmarks

Kernel microbenchmarks (aarch64, M-series Mac, n=1M unless noted):

Kernel	Case	Speedup
`take_1d_i64_i64`	no-fill monotonic	1.24×
`take_1d_i64_i64`	fill monotonic	1.58×
`take_1d_i64_i64`	random	1.27–1.35×
`take_2d_axis0` f64→f64	narrow k (< 32 elem)	3.0×
`take_2d_axis0` different-dtype (e.g. int32→int64, float32→float64)		1.9–2.0×
`take_2d_axis1` f64→f64	random	1.2×
`take_2d_multi` f64→f64		1.2–1.3×

End-to-end (back-to-back against main):

Operation	Baseline	Patched	Speedup
`Series[float64].take` mono n=1M	1255 µs	982 µs	+22%
`DataFrame[100k×10].take` mono	656 µs	439 µs	+33%
`DataFrame[10k×100].take` mono	537 µs	374 µs	+30%
`DataFrame[1k×1k].take` random	523 µs	389 µs	+25%
`Series[bool_mask]` n=1M	4581 µs	4083 µs	+11%
`DataFrame[bool_mask]` n=1M	4271 µs	3902 µs	+9%
`Series.reindex` n=1M	3215 µs	2729 µs	+15%

Test plan

pandas/tests/test_take.py (86 tests)
pandas/tests/series/indexing/test_take.py + pandas/tests/frame/indexing/test_take.py
pandas/tests/series/methods/test_reindex.py + pandas/tests/frame/methods/test_reindex.py
pandas/tests/indexing/ full sweep (4950 passed)
x86 CI — aarch64 has no SIMD gather; on x86 with AVX2+ the same pattern could additionally emit vpgatherdq. Worth confirming wins translate.
Wheel size delta (unroll inflates per-dtype instantiation by ~40 lines × ~20 type pairs)

TODO

whatsnew entry in v3.1.0.rst (will add after PR number is known)

🤖 Generated with Claude Code

Add a contiguous-stride fast path to take_1d, take_2d_axis0, take_2d_axis1, and take_2d_multi that walks raw typed pointers instead of memoryview byte-offsets, dropping the runtime stride multiply. Manually unroll the inner gather loop 4x so the CPU can overlap independent indexed loads. Speeds up Series.take, DataFrame.take, boolean-mask indexing, and reindex on NumPy-backed dtypes by roughly 10-30% end-to-end for typical sizes; isolated 1D kernel benchmarks show 1.2-1.6x and 2D axis0 with narrow columns up to 3x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jbrockmendel added the Performance Memory or execution speed performance label Apr 19, 2026

DOC: whatsnew entry for take perf improvement

0f0daea

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jbrockmendel marked this pull request as ready for review April 19, 2026 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: unroll take_1d/take_2d inner loops with contig fast path#65295

PERF: unroll take_1d/take_2d inner loops with contig fast path#65295
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel:perf-take

jbrockmendel commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jbrockmendel commented Apr 19, 2026

Summary

Motivation

Benchmarks

Test plan

TODO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant