PERF: read_csv emits pyarrow-backed string arrays directly from the C parser by jbrockmendel · Pull Request #65283 · pandas-dev/pandas

jbrockmendel · 2026-04-18T22:08:58Z

Summary

For string columns under the default future.infer_string or dtype_backend="pyarrow", the C parser previously built an ndarray[object] of Python strs and downstream ArrowStringArray._from_sequence ran a per-chunk isna scan plus a second pass to construct the pyarrow buffers. This PR adds a new _string_pyarrow_utf8 cdef in parsers.pyx that fills pyarrow's offsets / data / validity buffers directly from the tokenizer and returns a pyarrow-backed ExtensionArray, skipping both the intermediate object array and the downstream scan.

Targets pa.large_string() wrapped in ArrowStringArray for the default (StringDtype(na_value=np.nan)) path.
Targets pa.string() wrapped in ArrowExtensionArray for dtype_backend="pyarrow".
Gated on encoding_errors="strict" (other modes keep the object path for codec fidelity).
parse_dates columns are left as object ndarrays so to_datetime can take its fast tslibs path.

Benchmarks

1M rows, GC-controlled, best of 6, pd.read_csv("file.csv") default:

Workload	main	this PR	Speedup
10 string cols, 1000-string pool	810 ms	415 ms	1.95x
5 string cols, all unique	1144 ms	236 ms	4.85x
50k rows x 500 string cols	3270 ms	1074 ms	3.04x
1 string col x 1M rows	81 ms	47 ms	1.72x

Peak RSS (sampled during the read, baseline-subtracted):

Workload	main	this PR	Reduction
1M x 10 dedup	403 MB	198 MB	-51%
1M x 5 all-unique	560 MB	110 MB	-80%
50k x 500	998 MB	441 MB	-56%

The savings scale with non-repeating strings because the old path allocated one PyUnicode per unique value; the new path stores bytes directly in pyarrow's data buffer with no Python-object tax.

Adversarial sweep confirmed no regressions after a fix for parse_dates (the initial version routed noconvert columns through the fast path, which made to_datetime fall back to the Series.map path — a ~13% regression; now noconvert stays on the object path). Bypassed cases (dtype=object, encoding_errors!="strict", explicit ExtensionDtype) correctly no-op and tie main.

Notes / design choices

pa.string() (int32 offsets) for dtype_backend="pyarrow" matches the prior convention. Chunks exceeding 2 GiB raise OverflowError — unreachable in practice with default chunk sizes, but worth flagging.
pa.large_string() (int64 offsets) for the default path to match StringDtype's pyarrow storage exactly (avoids a downstream cast).
Direct TextReader callers now get ArrowStringArray under future.infer_string. Three tests in test_textreader.py were updated to normalize via np.asarray before comparison.

Test plan

Full pandas/tests/io/parser/ suite passes (4406 passed, 174 skipped, 230 xfailed)
pre-commit run clean
Benchmarks above reproduced
Adversarial sweep (short strings, long strings, heavy dedup, wide/shallow, usecols, na_filter=False, iterator/chunksize, parse_dates, explicit dtype=object, encoding_errors="replace", 100% NA strings)
CI green

🤖 Generated with Claude Code

… parser For string columns under the default ``future.infer_string`` or ``dtype_backend="pyarrow"``, the C parser previously materialized an ``ndarray[object]`` of Python strs and then downstream called ``ArrowStringArray._from_sequence`` / ``_box_pa_array``, which ran a per-chunk ``isna`` scan and a second pass through the data to build the pyarrow buffers. This change adds a new ``_string_pyarrow_utf8`` cdef that fills pyarrow's offsets / data / validity buffers directly from the tokenizer and returns a pyarrow-backed ExtensionArray, skipping both the intermediate object array and the downstream scan. ``parse_dates`` columns continue to return object ndarrays so ``to_datetime`` can take its fast tslibs path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes ModuleNotFoundError in CI environments without pyarrow when using_string_dtype() is on; falls back to the object-ndarray path so python-backed StringArray is produced as before.

jbrockmendel added the Performance Memory or execution speed performance label Apr 18, 2026

jbrockmendel and others added 2 commits April 18, 2026 15:10

DOC: point whatsnew entry at actual PR number

a838c8f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BUG: gate read_csv pyarrow string fast path on HAS_PYARROW

c57c2b2

Fixes ModuleNotFoundError in CI environments without pyarrow when using_string_dtype() is on; falls back to the object-ndarray path so python-backed StringArray is produced as before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: read_csv emits pyarrow-backed string arrays directly from the C parser#65283

PERF: read_csv emits pyarrow-backed string arrays directly from the C parser#65283
jbrockmendel wants to merge 3 commits intopandas-dev:mainfrom
jbrockmendel:perf-read_csv-strings

jbrockmendel commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jbrockmendel commented Apr 18, 2026

Summary

Benchmarks

Notes / design choices

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant