Fix performance regression in Coordinates.to_index#11306
Fix performance regression in Coordinates.to_index#11306thodson-usgs wants to merge 1 commit intopydata:mainfrom
Conversation
The codes passed to pd.MultiIndex were being converted from cache-friendly ndarrays into Python lists to silence a mypy arg-type error introduced in pydata#10694. The extra per-element conversion dominates runtime for large indexes (~13s on a 100x2000x300 array). Pass the ndarrays directly and suppress the type error the same way as for `levels` just above. Fixes pydata#11305 Co-authored-by: Claude <noreply@anthropic.com>
|
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
|
cc @max-sixty (introduced the [This is Claude Code on behalf of Tim Hodson] |
|
@ianhi , I ran the script to verify the performance improvement. Since this is a triage one-liner, I'm going to open the PR. Thanks for your feedback. |
Summary
Fixes #11305.
The
codespassed topd.MultiIndexinCoordinates.to_indexwere being converted from the cache-friendly ndarrays produced bynp.tile/np.repeatinto Pythonlists. That conversion was introduced in #10694 to silence a mypyarg-typeerror (pandas-stubs declarescodes: Sequence[Sequence[int]]). The per-element materialisation dominates runtime for large indexes.This PR passes
code_listdirectly and suppresses the mypy complaint the same way we already do forlevelson the line above (# type: ignore[arg-type,unused-ignore]).Verified that
pd.MultiIndexwith ndarraycodesproduces a result that is.identical()to the list-of-int version (same internal code dtype, same values), so the change is purely a performance fix.MVCE (from #11305)
Run with
uv run script.py. Local results on my machine:to_dataframemain(before)~40× speed-up, matching the ~10s reporter saw.
Test plan
Ran locally with
uv run pytest -n auto:xarray/tests/test_coordinates.py— 27 passedxarray/tests/test_variable.py,test_concat.py,test_missing.py— 827 passed, 69 skipped, 9 xfailed, 3 xpassedxarray/tests/test_dataset.py,test_dataarray.py,test_groupby.py— 1458 passed, 91 skipped, 5 xfailed, 2 xpassedpre-commitclean on touched files, no new mypy errors introduced onxarray/core/coordinates.pyI grepped for callers of
Coordinates.to_index(xarray/core/{dataset,dataarray,variable,indexes,missing,accessor_dt}.py,xarray/plot/facetgrid.py,xarray/groupers.py) — all paths produce apd.MultiIndexwhose consumers rely on.values/.get_loc/ iteration, which are unaffected by whethercodeswas built from ndarrays or lists.[This is Claude Code on behalf of Tim Hodson]