Skip to content

PERF: unroll take_1d/take_2d inner loops with contig fast path#65295

Open
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel:perf-take
Open

PERF: unroll take_1d/take_2d inner loops with contig fast path#65295
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel:perf-take

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

Adds a contiguous-stride fast path to take_1d, take_2d_axis0, take_2d_axis1, and take_2d_multi in pandas/_libs/algos_take_helper.pxi.in. When all input arrays have strides == sizeof(dtype) along the relevant axes (the common case), the kernel walks raw typed pointers instead of memoryview byte-offsets — this lets the C compiler drop the runtime stride multiply. The inner gather loop is then manually unrolled 4× so the out-of-order pipeline can overlap independent indexed loads (aarch64 has no SIMD gather; this is the main available lever).

The existing same-dtype 2D axis0 memcpy fast path is preserved.

Motivation

Inspecting the pre-patch disassembly of take_1d_int64_int64 showed the hot loop was 7 scalar instructions per element with a per-iteration mul for the stride multiply — the memoryview indexing expansion *(data + i * strides[0]) defeats auto-vectorization and auto-unrolling. The patched loop is 14 instructions per 4 elements (3.5/elem) with 4 independent indexed loads exposed to the pipeline.

Benchmarks

Kernel microbenchmarks (aarch64, M-series Mac, n=1M unless noted):

Kernel Case Speedup
take_1d_i64_i64 no-fill monotonic 1.24×
take_1d_i64_i64 fill monotonic 1.58×
take_1d_i64_i64 random 1.27–1.35×
take_2d_axis0 f64→f64 narrow k (< 32 elem) 3.0×
take_2d_axis0 different-dtype (e.g. int32→int64, float32→float64) 1.9–2.0×
take_2d_axis1 f64→f64 random 1.2×
take_2d_multi f64→f64 1.2–1.3×

End-to-end (back-to-back against main):

Operation Baseline Patched Speedup
Series[float64].take mono n=1M 1255 µs 982 µs +22%
DataFrame[100k×10].take mono 656 µs 439 µs +33%
DataFrame[10k×100].take mono 537 µs 374 µs +30%
DataFrame[1k×1k].take random 523 µs 389 µs +25%
Series[bool_mask] n=1M 4581 µs 4083 µs +11%
DataFrame[bool_mask] n=1M 4271 µs 3902 µs +9%
Series.reindex n=1M 3215 µs 2729 µs +15%

Test plan

  • pandas/tests/test_take.py (86 tests)
  • pandas/tests/series/indexing/test_take.py + pandas/tests/frame/indexing/test_take.py
  • pandas/tests/series/methods/test_reindex.py + pandas/tests/frame/methods/test_reindex.py
  • pandas/tests/indexing/ full sweep (4950 passed)
  • x86 CI — aarch64 has no SIMD gather; on x86 with AVX2+ the same pattern could additionally emit vpgatherdq. Worth confirming wins translate.
  • Wheel size delta (unroll inflates per-dtype instantiation by ~40 lines × ~20 type pairs)

TODO

  • whatsnew entry in v3.1.0.rst (will add after PR number is known)

🤖 Generated with Claude Code

Add a contiguous-stride fast path to take_1d, take_2d_axis0,
take_2d_axis1, and take_2d_multi that walks raw typed pointers
instead of memoryview byte-offsets, dropping the runtime stride
multiply. Manually unroll the inner gather loop 4x so the CPU
can overlap independent indexed loads.

Speeds up Series.take, DataFrame.take, boolean-mask indexing,
and reindex on NumPy-backed dtypes by roughly 10-30% end-to-end
for typical sizes; isolated 1D kernel benchmarks show 1.2-1.6x
and 2D axis0 with narrow columns up to 3x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 19, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel marked this pull request as ready for review April 19, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant