PERF: read_csv emits pyarrow-backed string arrays directly from the C parser#65283
Draft
jbrockmendel wants to merge 3 commits intopandas-dev:mainfrom
Draft
PERF: read_csv emits pyarrow-backed string arrays directly from the C parser#65283jbrockmendel wants to merge 3 commits intopandas-dev:mainfrom
jbrockmendel wants to merge 3 commits intopandas-dev:mainfrom
Conversation
… parser For string columns under the default ``future.infer_string`` or ``dtype_backend="pyarrow"``, the C parser previously materialized an ``ndarray[object]`` of Python strs and then downstream called ``ArrowStringArray._from_sequence`` / ``_box_pa_array``, which ran a per-chunk ``isna`` scan and a second pass through the data to build the pyarrow buffers. This change adds a new ``_string_pyarrow_utf8`` cdef that fills pyarrow's offsets / data / validity buffers directly from the tokenizer and returns a pyarrow-backed ExtensionArray, skipping both the intermediate object array and the downstream scan. ``parse_dates`` columns continue to return object ndarrays so ``to_datetime`` can take its fast tslibs path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes ModuleNotFoundError in CI environments without pyarrow when using_string_dtype() is on; falls back to the object-ndarray path so python-backed StringArray is produced as before.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For string columns under the default
future.infer_stringordtype_backend="pyarrow", the C parser previously built anndarray[object]of Python strs and downstreamArrowStringArray._from_sequenceran a per-chunkisnascan plus a second pass to construct the pyarrow buffers. This PR adds a new_string_pyarrow_utf8cdef inparsers.pyxthat fills pyarrow's offsets / data / validity buffers directly from the tokenizer and returns a pyarrow-backed ExtensionArray, skipping both the intermediate object array and the downstream scan.pa.large_string()wrapped inArrowStringArrayfor the default (StringDtype(na_value=np.nan)) path.pa.string()wrapped inArrowExtensionArrayfordtype_backend="pyarrow".encoding_errors="strict"(other modes keep the object path for codec fidelity).parse_datescolumns are left as object ndarrays soto_datetimecan take its fast tslibs path.Benchmarks
1M rows, GC-controlled, best of 6,
pd.read_csv("file.csv")default:Peak RSS (sampled during the read, baseline-subtracted):
The savings scale with non-repeating strings because the old path allocated one
PyUnicodeper unique value; the new path stores bytes directly in pyarrow's data buffer with no Python-object tax.Adversarial sweep confirmed no regressions after a fix for
parse_dates(the initial version routednoconvertcolumns through the fast path, which madeto_datetimefall back to theSeries.mappath — a ~13% regression; nownoconvertstays on the object path). Bypassed cases (dtype=object,encoding_errors!="strict", explicit ExtensionDtype) correctly no-op and tie main.Notes / design choices
pa.string()(int32 offsets) fordtype_backend="pyarrow"matches the prior convention. Chunks exceeding 2 GiB raiseOverflowError— unreachable in practice with default chunk sizes, but worth flagging.pa.large_string()(int64 offsets) for the default path to matchStringDtype's pyarrow storage exactly (avoids a downstream cast).TextReadercallers now getArrowStringArrayunderfuture.infer_string. Three tests intest_textreader.pywere updated to normalize vianp.asarraybefore comparison.Test plan
pandas/tests/io/parser/suite passes (4406 passed, 174 skipped, 230 xfailed)pre-commit runcleanna_filter=False, iterator/chunksize,parse_dates, explicitdtype=object,encoding_errors="replace", 100% NA strings)🤖 Generated with Claude Code