Skip to content

bench: compare pyarrow and native arrow scans#7

Draft
abnobdoss wants to merge 1 commit into
aba-167-native-arrow-read-integration-testsfrom
aba-160-native-scan-benchmarks
Draft

bench: compare pyarrow and native arrow scans#7
abnobdoss wants to merge 1 commit into
aba-167-native-arrow-read-integration-testsfrom
aba-160-native-scan-benchmarks

Conversation

@abnobdoss
Copy link
Copy Markdown
Owner

@abnobdoss abnobdoss commented May 25, 2026

Stack position: Python PR after #6 (ABA-160).

Expands the manual benchmark harness for comparing PyArrow and opt-in Rust-backed Arrow scan paths.

What changed:

  • Provisions realistic benchmark tables through Spark Connect instead of relying on tiny integration fixtures.
  • Default stress profile:
    • 5M rows / 2k files for many-file scans
    • 5M rows / 2k files for partitioned scans
    • 1M rows / 500 files with merge-on-read positional deletes
  • Runs each measured scan in a fresh child process so RSS includes native Arrow/Rust allocations in the Python client process.
  • Reports latency, rows/sec, batches, RSS delta, sampled peak RSS, process max RSS, and Python-only tracemalloc peak.
  • Validates native vs PyArrow parity before timing via row counts, column names, and numeric checksum aggregates.
  • Supports custom REST/Spark/S3 endpoints, refresh/skip-provision modes, JSON output, and Markdown output.

Validation:

  • uv run prek run --files dev/bench_arrow_scan.py
  • Small smoke with --refresh --rows 10000 --files 20 --delete-rows 5000 --delete-files 10 --runs 1 --warmups 0 --table-prefix bench_native_scan_smoke --s3-endpoint http://localhost:19000
  • Full stress rerun with existing provisioned tables:
    • uv run python dev/bench_arrow_scan.py --skip-provision --runs 1 --warmups 0 --s3-endpoint http://localhost:19000 --json-out /tmp/native_scan_bench_stress.json --markdown-out /tmp/native_scan_bench_stress.md

Headline full-stress results:

Scenario Native median PyArrow median Native peak RSS PyArrow peak RSS
many_files_full 859.8 ms 2112.1 ms 215.5 MB 395.2 MB
many_files_project_id 610.8 ms 1783.8 ms 206.5 MB 232.4 MB
many_files_filter_id 871.5 ms 2502.2 ms 205.4 MB 241.2 MB
many_files_limit_1000 235.5 ms 253.6 ms 197.4 MB 200.6 MB
partition_pruned_part_7 40.4 ms 33.7 ms 176.6 MB 164.7 MB
partitioned_project 113.9 ms 128.1 ms 201.9 MB 226.8 MB
pos_deletes_full 302.2 ms 1618.6 ms 206.6 MB 232.1 MB
pos_deletes_filter 306.2 ms 1751.7 ms 201.6 MB 197.0 MB

Caveat: memory metrics are for the Python benchmark/client process only, not Spark/REST/MinIO containers. Max RSS includes interpreter/import peaks; Peak RSS is parent-sampled during the measured scan and is the more useful comparison column.

@abnobdoss abnobdoss force-pushed the aba-167-native-arrow-read-integration-tests branch 2 times, most recently from 4ae56ff to 443819e Compare May 25, 2026 02:22
@abnobdoss abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from 08c21db to e3ce3e4 Compare May 25, 2026 02:23
@abnobdoss abnobdoss force-pushed the aba-167-native-arrow-read-integration-tests branch from 443819e to abe8da0 Compare May 25, 2026 02:44
@abnobdoss abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from e3ce3e4 to 33818f4 Compare May 25, 2026 02:44
@abnobdoss abnobdoss force-pushed the aba-167-native-arrow-read-integration-tests branch from abe8da0 to bb205f9 Compare May 25, 2026 02:59
@abnobdoss abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from 33818f4 to ed3dee6 Compare May 25, 2026 02:59
@abnobdoss abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from ed3dee6 to 4cc8409 Compare May 25, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant