Skip to content

PERF: read_csv emits pyarrow-backed string arrays directly from the C parser#65283

Draft
jbrockmendel wants to merge 3 commits intopandas-dev:mainfrom
jbrockmendel:perf-read_csv-strings
Draft

PERF: read_csv emits pyarrow-backed string arrays directly from the C parser#65283
jbrockmendel wants to merge 3 commits intopandas-dev:mainfrom
jbrockmendel:perf-read_csv-strings

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

For string columns under the default future.infer_string or dtype_backend="pyarrow", the C parser previously built an ndarray[object] of Python strs and downstream ArrowStringArray._from_sequence ran a per-chunk isna scan plus a second pass to construct the pyarrow buffers. This PR adds a new _string_pyarrow_utf8 cdef in parsers.pyx that fills pyarrow's offsets / data / validity buffers directly from the tokenizer and returns a pyarrow-backed ExtensionArray, skipping both the intermediate object array and the downstream scan.

  • Targets pa.large_string() wrapped in ArrowStringArray for the default (StringDtype(na_value=np.nan)) path.
  • Targets pa.string() wrapped in ArrowExtensionArray for dtype_backend="pyarrow".
  • Gated on encoding_errors="strict" (other modes keep the object path for codec fidelity).
  • parse_dates columns are left as object ndarrays so to_datetime can take its fast tslibs path.

Benchmarks

1M rows, GC-controlled, best of 6, pd.read_csv("file.csv") default:

Workload main this PR Speedup
10 string cols, 1000-string pool 810 ms 415 ms 1.95x
5 string cols, all unique 1144 ms 236 ms 4.85x
50k rows x 500 string cols 3270 ms 1074 ms 3.04x
1 string col x 1M rows 81 ms 47 ms 1.72x

Peak RSS (sampled during the read, baseline-subtracted):

Workload main this PR Reduction
1M x 10 dedup 403 MB 198 MB -51%
1M x 5 all-unique 560 MB 110 MB -80%
50k x 500 998 MB 441 MB -56%

The savings scale with non-repeating strings because the old path allocated one PyUnicode per unique value; the new path stores bytes directly in pyarrow's data buffer with no Python-object tax.

Adversarial sweep confirmed no regressions after a fix for parse_dates (the initial version routed noconvert columns through the fast path, which made to_datetime fall back to the Series.map path — a ~13% regression; now noconvert stays on the object path). Bypassed cases (dtype=object, encoding_errors!="strict", explicit ExtensionDtype) correctly no-op and tie main.

Notes / design choices

  • pa.string() (int32 offsets) for dtype_backend="pyarrow" matches the prior convention. Chunks exceeding 2 GiB raise OverflowError — unreachable in practice with default chunk sizes, but worth flagging.
  • pa.large_string() (int64 offsets) for the default path to match StringDtype's pyarrow storage exactly (avoids a downstream cast).
  • Direct TextReader callers now get ArrowStringArray under future.infer_string. Three tests in test_textreader.py were updated to normalize via np.asarray before comparison.

Test plan

  • Full pandas/tests/io/parser/ suite passes (4406 passed, 174 skipped, 230 xfailed)
  • pre-commit run clean
  • Benchmarks above reproduced
  • Adversarial sweep (short strings, long strings, heavy dedup, wide/shallow, usecols, na_filter=False, iterator/chunksize, parse_dates, explicit dtype=object, encoding_errors="replace", 100% NA strings)
  • CI green

🤖 Generated with Claude Code

… parser

For string columns under the default ``future.infer_string`` or
``dtype_backend="pyarrow"``, the C parser previously materialized an
``ndarray[object]`` of Python strs and then downstream called
``ArrowStringArray._from_sequence`` / ``_box_pa_array``, which ran a
per-chunk ``isna`` scan and a second pass through the data to build the
pyarrow buffers. This change adds a new ``_string_pyarrow_utf8`` cdef that
fills pyarrow's offsets / data / validity buffers directly from the
tokenizer and returns a pyarrow-backed ExtensionArray, skipping both the
intermediate object array and the downstream scan. ``parse_dates`` columns
continue to return object ndarrays so ``to_datetime`` can take its fast
tslibs path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 18, 2026
jbrockmendel and others added 2 commits April 18, 2026 15:10
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes ModuleNotFoundError in CI environments without pyarrow when
using_string_dtype() is on; falls back to the object-ndarray path so
python-backed StringArray is produced as before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant