PERF: avoid materializing set/list in isin for common dtype cases#65284
Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
Draft
PERF: avoid materializing set/list in isin for common dtype cases#65284jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
Conversation
- Add a set fast path in isin for integer/bool comps that tests membership against the set directly, avoiding O(len(values)) materialization (GH#25507). - Narrow the GH#46485 object-cast trigger to the precision-losing int+uint case; other safe common dtypes now keep the numeric path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
algorithms.isinfor integer/bool comps: membership is tested directly against the set, skipping an O(len(values)) materialization that previously dominated when comps << values (closes pandas.Series.isin() is slow on large sets due to conversion of set to list #25507).Benchmarks
GH#25507 repro (
Series(range(100)).isin(set_of_1M_squares)): 142 ms → 0.03 ms.Series(1M uint8).isin([3 ints])(item 2): ~100 ms → ~11 ms (numeric path retained instead of falling back to object hashtable).GH#46485 correctness regression verified — the original precision-loss case still returns the expected empty result.
Test plan
pandas/tests/series/methods/test_isin.pycovering set/frozenset across int64/int32/uint8/uint64/bool and empty set.pytest pandas/tests/test_algos.py pandas/tests/series/ pandas/tests/frame/methods/test_isin.py pandas/tests/dtypes/ pandas/tests/indexes/multi/test_isin.py— 19651 passed.mypy pandas/core/algorithms.py— clean (only pre-existing unrelated error in frame.py).pre-commit runon changed files — clean.