Skip to content

Fix performance regression in Coordinates.to_index#11306

Open
thodson-usgs wants to merge 1 commit intopydata:mainfrom
thodson-usgs:fix/11305-to-index-perf-regression
Open

Fix performance regression in Coordinates.to_index#11306
thodson-usgs wants to merge 1 commit intopydata:mainfrom
thodson-usgs:fix/11305-to-index-perf-regression

Conversation

@thodson-usgs
Copy link
Copy Markdown

@thodson-usgs thodson-usgs commented Apr 18, 2026

Summary

Fixes #11305.

The codes passed to pd.MultiIndex in Coordinates.to_index were being converted from the cache-friendly ndarrays produced by np.tile/np.repeat into Python lists. That conversion was introduced in #10694 to silence a mypy arg-type error (pandas-stubs declares codes: Sequence[Sequence[int]]). The per-element materialisation dominates runtime for large indexes.

This PR passes code_list directly and suppresses the mypy complaint the same way we already do for levels on the line above (# type: ignore[arg-type,unused-ignore]).

Verified that pd.MultiIndex with ndarray codes produces a result that is .identical() to the list-of-int version (same internal code dtype, same values), so the change is purely a performance fix.

MVCE (from #11305)

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "xarray[complete]@git+https://github.com/pydata/xarray.git@refs/pull/11306/head",
# ]
# ///
import time
import numpy as np
import xarray as xr

da = xr.DataArray(np.arange(100 * 2000 * 300).reshape(100, 2000, 300), name="foo")
t = time.perf_counter()
da.to_dataframe()
print(f"to_dataframe: {time.perf_counter() - t:.3f}s")

Run with uv run script.py. Local results on my machine:

to_dataframe
main (before) 10.33s
this PR (after) 0.26s

~40× speed-up, matching the ~10s reporter saw.

Test plan

Ran locally with uv run pytest -n auto:

  • xarray/tests/test_coordinates.py — 27 passed
  • xarray/tests/test_variable.py, test_concat.py, test_missing.py — 827 passed, 69 skipped, 9 xfailed, 3 xpassed
  • xarray/tests/test_dataset.py, test_dataarray.py, test_groupby.py — 1458 passed, 91 skipped, 5 xfailed, 2 xpassed
  • pre-commit clean on touched files, no new mypy errors introduced on xarray/core/coordinates.py

I grepped for callers of Coordinates.to_index (xarray/core/{dataset,dataarray,variable,indexes,missing,accessor_dt}.py, xarray/plot/facetgrid.py, xarray/groupers.py) — all paths produce a pd.MultiIndex whose consumers rely on .values / .get_loc / iteration, which are unaffected by whether codes was built from ndarrays or lists.


[This is Claude Code on behalf of Tim Hodson]

The codes passed to pd.MultiIndex were being converted from cache-friendly
ndarrays into Python lists to silence a mypy arg-type error introduced in
pydata#10694. The extra per-element conversion dominates runtime for large indexes
(~13s on a 100x2000x300 array). Pass the ndarrays directly and suppress the
type error the same way as for `levels` just above.

Fixes pydata#11305

Co-authored-by: Claude <noreply@anthropic.com>
@welcome
Copy link
Copy Markdown

welcome bot commented Apr 18, 2026

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@thodson-usgs
Copy link
Copy Markdown
Author

cc @max-sixty (introduced the codes=[list(c) for c in code_list] conversion in #10694 to silence a mypy arg-type error) and @ianhi (reverted the analogous levels conversion in #10705 for dtype reasons, left the codes one in place) — curious if either of you has context on why the list(...) conversion was needed for codes specifically beyond the mypy stub, or any concern with the suppress-via-type-ignore approach used here.

[This is Claude Code on behalf of Tim Hodson]

@thodson-usgs thodson-usgs marked this pull request as ready for review April 18, 2026 15:12
@thodson-usgs thodson-usgs marked this pull request as draft April 18, 2026 16:03
@ianhi
Copy link
Copy Markdown
Collaborator

ianhi commented Apr 20, 2026

and @ianhi (reverted the analogous levels conversion in #10705 for dtype reasons, left the codes one in place) —

I strongly suspect this was an oversight on my part - or rather that I wasn't seeing an immediate error from it so I didn't fix it

@thodson-usgs
Copy link
Copy Markdown
Author

@ianhi , I ran the script to verify the performance improvement. Since this is a triage one-liner, I'm going to open the PR. Thanks for your feedback.

@thodson-usgs thodson-usgs marked this pull request as ready for review April 20, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Major performance regression in Coordinates.to_index()

3 participants