bench: compare pyarrow and native arrow scans by abnobdoss · Pull Request #7 · abnobdoss/iceberg-python

abnobdoss · 2026-05-25T01:57:21Z

Stack position: Python PR after #6 (ABA-160).

Expands the manual benchmark harness for comparing PyArrow and opt-in Rust-backed Arrow scan paths.

What changed:

Provisions realistic benchmark tables through Spark Connect instead of relying on tiny integration fixtures.
Default stress profile:
- 5M rows / 2k files for many-file scans
- 5M rows / 2k files for partitioned scans
- 1M rows / 500 files with merge-on-read positional deletes
Runs each measured scan in a fresh child process so RSS includes native Arrow/Rust allocations in the Python client process.
Reports latency, rows/sec, batches, RSS delta, sampled peak RSS, process max RSS, and Python-only tracemalloc peak.
Validates native vs PyArrow parity before timing via row counts, column names, and numeric checksum aggregates.
Supports custom REST/Spark/S3 endpoints, refresh/skip-provision modes, JSON output, and Markdown output.

Validation:

uv run prek run --files dev/bench_arrow_scan.py
Small smoke with --refresh --rows 10000 --files 20 --delete-rows 5000 --delete-files 10 --runs 1 --warmups 0 --table-prefix bench_native_scan_smoke --s3-endpoint http://localhost:19000
Full stress rerun with existing provisioned tables:
- uv run python dev/bench_arrow_scan.py --skip-provision --runs 1 --warmups 0 --s3-endpoint http://localhost:19000 --json-out /tmp/native_scan_bench_stress.json --markdown-out /tmp/native_scan_bench_stress.md

Headline full-stress results:

Scenario	Native median	PyArrow median	Native peak RSS	PyArrow peak RSS
many_files_full	859.8 ms	2112.1 ms	215.5 MB	395.2 MB
many_files_project_id	610.8 ms	1783.8 ms	206.5 MB	232.4 MB
many_files_filter_id	871.5 ms	2502.2 ms	205.4 MB	241.2 MB
many_files_limit_1000	235.5 ms	253.6 ms	197.4 MB	200.6 MB
partition_pruned_part_7	40.4 ms	33.7 ms	176.6 MB	164.7 MB
partitioned_project	113.9 ms	128.1 ms	201.9 MB	226.8 MB
pos_deletes_full	302.2 ms	1618.6 ms	206.6 MB	232.1 MB
pos_deletes_filter	306.2 ms	1751.7 ms	201.6 MB	197.0 MB

Caveat: memory metrics are for the Python benchmark/client process only, not Spark/REST/MinIO containers. Max RSS includes interpreter/import peaks; Peak RSS is parent-sampled during the measured scan and is the more useful comparison column.

abnobdoss force-pushed the aba-167-native-arrow-read-integration-tests branch 2 times, most recently from 4ae56ff to 443819e Compare May 25, 2026 02:22

abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from 08c21db to e3ce3e4 Compare May 25, 2026 02:23

abnobdoss force-pushed the aba-167-native-arrow-read-integration-tests branch from 443819e to abe8da0 Compare May 25, 2026 02:44

abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from e3ce3e4 to 33818f4 Compare May 25, 2026 02:44

abnobdoss force-pushed the aba-167-native-arrow-read-integration-tests branch from abe8da0 to bb205f9 Compare May 25, 2026 02:59

abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from 33818f4 to ed3dee6 Compare May 25, 2026 02:59

bench: compare pyarrow and native arrow scans

4cc8409

abnobdoss force-pushed the aba-160-native-scan-benchmarks branch from ed3dee6 to 4cc8409 Compare May 25, 2026 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: compare pyarrow and native arrow scans#7

bench: compare pyarrow and native arrow scans#7
abnobdoss wants to merge 1 commit into
aba-167-native-arrow-read-integration-testsfrom
aba-160-native-scan-benchmarks

abnobdoss commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abnobdoss commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abnobdoss commented May 25, 2026 •

edited

Loading