PERF: unroll take_1d/take_2d inner loops with contig fast path#65295
Open
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
Open
PERF: unroll take_1d/take_2d inner loops with contig fast path#65295jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
jbrockmendel wants to merge 2 commits intopandas-dev:mainfrom
Conversation
Add a contiguous-stride fast path to take_1d, take_2d_axis0, take_2d_axis1, and take_2d_multi that walks raw typed pointers instead of memoryview byte-offsets, dropping the runtime stride multiply. Manually unroll the inner gather loop 4x so the CPU can overlap independent indexed loads. Speeds up Series.take, DataFrame.take, boolean-mask indexing, and reindex on NumPy-backed dtypes by roughly 10-30% end-to-end for typical sizes; isolated 1D kernel benchmarks show 1.2-1.6x and 2D axis0 with narrow columns up to 3x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a contiguous-stride fast path to
take_1d,take_2d_axis0,take_2d_axis1, andtake_2d_multiinpandas/_libs/algos_take_helper.pxi.in. When all input arrays havestrides == sizeof(dtype)along the relevant axes (the common case), the kernel walks raw typed pointers instead of memoryview byte-offsets — this lets the C compiler drop the runtime stride multiply. The inner gather loop is then manually unrolled 4× so the out-of-order pipeline can overlap independent indexed loads (aarch64 has no SIMD gather; this is the main available lever).The existing same-dtype 2D axis0
memcpyfast path is preserved.Motivation
Inspecting the pre-patch disassembly of
take_1d_int64_int64showed the hot loop was 7 scalar instructions per element with a per-iterationmulfor the stride multiply — the memoryview indexing expansion*(data + i * strides[0])defeats auto-vectorization and auto-unrolling. The patched loop is 14 instructions per 4 elements (3.5/elem) with 4 independent indexed loads exposed to the pipeline.Benchmarks
Kernel microbenchmarks (aarch64, M-series Mac, n=1M unless noted):
take_1d_i64_i64take_1d_i64_i64take_1d_i64_i64take_2d_axis0f64→f64take_2d_axis0different-dtype (e.g. int32→int64, float32→float64)take_2d_axis1f64→f64take_2d_multif64→f64End-to-end (back-to-back against main):
Series[float64].takemono n=1MDataFrame[100k×10].takemonoDataFrame[10k×100].takemonoDataFrame[1k×1k].takerandomSeries[bool_mask]n=1MDataFrame[bool_mask]n=1MSeries.reindexn=1MTest plan
pandas/tests/test_take.py(86 tests)pandas/tests/series/indexing/test_take.py+pandas/tests/frame/indexing/test_take.pypandas/tests/series/methods/test_reindex.py+pandas/tests/frame/methods/test_reindex.pypandas/tests/indexing/full sweep (4950 passed)vpgatherdq. Worth confirming wins translate.TODO
v3.1.0.rst(will add after PR number is known)🤖 Generated with Claude Code