Skip to content

PERF: avoid materializing set/list in isin for common dtype cases#65284

Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-isin-set
Draft

PERF: avoid materializing set/list in isin for common dtype cases#65284
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-isin-set

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

  • Add a set/frozenset fast path in algorithms.isin for integer/bool comps: membership is tested directly against the set, skipping an O(len(values)) materialization that previously dominated when comps << values (closes pandas.Series.isin() is slow on large sets due to conversion of set to list #25507).
  • Narrow the existing GH#46485 object-cast trigger to the actually precision-losing int+uint case. Other safe common dtypes (e.g. uint8 + int, float32 + float, bool + int) now keep the numeric ndarray and hit the fast htable path.

Benchmarks

GH#25507 repro (Series(range(100)).isin(set_of_1M_squares)): 142 ms → 0.03 ms.

Series(1M uint8).isin([3 ints]) (item 2): ~100 ms → ~11 ms (numeric path retained instead of falling back to object hashtable).

GH#46485 correctness regression verified — the original precision-loss case still returns the expected empty result.

Test plan

  • New module-level tests in pandas/tests/series/methods/test_isin.py covering set/frozenset across int64/int32/uint8/uint64/bool and empty set.
  • pytest pandas/tests/test_algos.py pandas/tests/series/ pandas/tests/frame/methods/test_isin.py pandas/tests/dtypes/ pandas/tests/indexes/multi/test_isin.py — 19651 passed.
  • mypy pandas/core/algorithms.py — clean (only pre-existing unrelated error in frame.py).
  • pre-commit run on changed files — clean.

- Add a set fast path in isin for integer/bool comps that tests
  membership against the set directly, avoiding O(len(values))
  materialization (GH#25507).
- Narrow the GH#46485 object-cast trigger to the precision-losing
  int+uint case; other safe common dtypes now keep the numeric path.
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pandas.Series.isin() is slow on large sets due to conversion of set to list

1 participant