feat(format-parquet): configurable codec with ZSTD support#84
Conversation
The parquet sink encoder hardcoded SNAPPY. For ai_gateway_messages, the `tools`/`system_text` columns are large JSON blobs repeated per message row; hyparquet-writer abandons dictionary encoding once the dictionary exceeds `pageSize` (default 1 MiB), so the ~13 distinct ~80KB `tools` values fell back to PLAIN and were stored on every row. SNAPPY's ~64KB window can't dedupe blobs that large, producing a ~1.85 GB claude partition export (377k rows). Make the codec configurable via the plugin's config slice (`codec`, plus `page_size` and `zstd_level`). ZSTD is wired through Node's built-in `zlib.zstdCompressSync`, feature-detected so a runtime without zstd (Node < 22.15) degrades to SNAPPY with a warning rather than hard-failing. Reads are already covered: the query path wires `hyparquet-compressors` (decompressZstd). Measured on the live claude partition: 1.85 GB -> 176 MiB (~10.5x), and faster to encode. Read-back verified through the same decompressors the query path uses. Adds test/core/format-parquet-codec.test.js covering default/SNAPPY, ZSTD round-trip, unavailable/unknown-codec fallback, and page_size parsing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Claude | Missing ZSTD encode/read integration test (minor, format-parquet-codec.test.js; index.js:160-164) | Risks (missing test coverage); Targets (encodePartition, writeOpts wiring) |
| Codex | Test Evidence Quality — no test that encoder emits a readable ZSTD parquet (cat 9, minor, index.js:159-164) | Risks (missing test coverage); same surface as Claude finding |
| Claude | zstd_level floored but not range-clamped / no config schema (nit, index.js:56-59) |
Config field chain (unbounded new key); Targets (resolveEncodeSettings) |
Codex review
Fix Validations
Hardcoded SNAPPY parquet encoder
- Status: correct
- Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:38, hypaware-core/plugins-workspace/format-parquet/src/index.js:59, hypaware-core/plugins-workspace/format-parquet/src/index.js:161
- Assessment: The codec is now resolved from plugin config, defaults to
SNAPPY, andZSTDsupplies a writer compressor when available.
Old-Node ZSTD fallback
- Status: correct
- Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:23, hypaware-core/plugins-workspace/format-parquet/src/index.js:46, hypaware-core/plugins-workspace/format-parquet/src/index.js:54
- Assessment: Runtime feature detection happens before encoder registration, and unavailable ZSTD degrades to SNAPPY with a warning rather than failing activation.
Findings
9) Test Evidence Quality
- Severity: minor
- Confidence: high
- Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:159, hypaware-core/plugins-workspace/format-parquet/src/index.js:164, test/core/format-parquet-codec.test.js:35, test/core/format-parquet-codec.test.js:46
- Why it matters: The new tests prove config resolution and raw zstd compression, but not that the encoder emits a ZSTD Parquet file that the repo’s Parquet query path can read.
- Suggested fix: Add one integration-style test that activates or calls the real parquet encoder with
{ codec: 'ZSTD' }, writes viahyparquet-writer, then reads the resulting buffer through the same compressor-enabled query path used bysrc/core/query/parquet-source.js.
No Finding
- Behavioral Correctness
- Contract & Interface Fidelity
- Change Impact / Blast Radius
- Concurrency, Ordering & State Safety
- Error Handling & Resilience
- Security Surface
- Resource Lifecycle & Cleanup
- Release Safety
- Architectural Consistency
- Debuggability & Operability
Evidence Bundle
- Changed hot paths:
resolveEncodeSettingsconfig parsing at hypaware-core/plugins-workspace/format-parquet/src/index.js:38; encoder registration at hypaware-core/plugins-workspace/format-parquet/src/index.js:99; writer options at hypaware-core/plugins-workspace/format-parquet/src/index.js:161. - Impacted callers: local-fs export calls the encoder at hypaware-core/plugins-workspace/local-fs/src/index.js:105; S3 export calls the encoder at hypaware-core/plugins-workspace/s3/src/index.js:268; core wrapper calls
encoder.encodePartitionat src/core/sinks/encoder.js:46. - Impacted tests: new codec unit tests at test/core/format-parquet-codec.test.js:25, test/core/format-parquet-codec.test.js:35, test/core/format-parquet-codec.test.js:49, test/core/format-parquet-codec.test.js:58, test/core/format-parquet-codec.test.js:67; existing default-SNAPPY smoke assertions at hypaware-core/smoke/flows/blob_sink_parquet_local_fs.js:304 and hypaware-core/smoke/flows/local_parquet_export.js:307.
- Unresolved uncertainty: I ran
node --test test/core/format-parquet-codec.test.jsonly; full suite, lint, typecheck, and old-Node fallback were not independently run here.
Claude review
Claude review
Five independent review subagents (guidance compliance, shallow bug scan,
historical context, contract & callers, comments & tests) examined the PR at
its head. Guidance compliance, bug scan, and contract/caller lenses each
returned clean (the encode path was empirically write-then-read verified on
Node 24 — a 200-row ZSTD parquet round-tripped byte-identical through
hyparquet-compressors). Two findings survived the >=80 confidence filter and
adversarial cross-check; both are non-blocking.
One candidate finding was refuted and dropped: an out-of-range zstd_level
was hypothesized to throw at encode time, but a reviewer empirically tested
levels 1000/-1000/0/23 and none threw — libzstd clamps internally. The residual
(unbounded config value) is recorded below as a nit only.
Missing end-to-end ZSTD encode/read regression test on the load-bearing path
- Severity: minor
- Confidence: 82
- Evidence: test/core/format-parquet-codec.test.js (whole file); hypaware-core/plugins-workspace/format-parquet/src/index.js:160-164
- Why it matters: The PR's entire purpose (ZSTD makes S3 parquet exports ~10x smaller and stays queryable) is unproven by tests — they only exercise
resolveEncodeSettings, neverencodePartition/parquetWriteBuffer. The writer applies compression viacompressors[codec]?.(bytes) ?? bytes, so if theif (settings.compressors) writeOpts.compressors = ...wiring (index.js:162) is ever dropped, it silently emits uncompressed pages labelled ZSTD — both oversized and corrupt to readers — and every existing test still passes. (Codex independently flagged the same gap as category 9, minor.) - Suggested fix: Add one integration test: call
encodePartition(orparquetWriteBufferwith the resolved ZSTD settings) on a small fixture, read it back withparquetReadObjects+{ compressors }fromhyparquet-compressors(the same wiringsrc/core/query/parquet-source.js:4,100uses), and assert rows match and the ZSTD buffer is materially smaller than the uncompressed/SNAPPY buffer for compressible input.
zstd_level is floored but not range-clamped (unbounded config value)
- Severity: nit
- Confidence: 70
- Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:56-59
- Why it matters:
levelaccepts any finite number and is onlyMath.floor'd; there is no plugin config schema validating it (the manifest declares none, andvalidatePluginConfigis a no-op when a plugin registers no section). In practice this is harmless — libzstd silently clamps to its supported range, so no runtime throw was observed — but the "validated config slice" wording in the docstring (index.js:26) overclaims a validation step that does not exist for these keys. - Suggested fix: Clamp
levelto zstd's supported range inresolveEncodeSettings(consistent with the "never hard-fail on bad config" promise the docstring already makes), and/or soften the comment to drop "validated".
Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr84/dual-review/pr-84
Problem
The parquet sink encoder hardcoded SNAPPY. For
ai_gateway_messages, thetoolsandsystem_textcolumns are large JSON blobs repeated on every message row.hyparquet-writerabandons dictionary encoding once the dictionary exceedspageSize(default 1 MiB) — thetoolscolumn has ~13 distinct ~80KB values (>1 MiB), so it falls back to PLAIN and stores the full blob on every one of 377k rows. SNAPPY's ~64KB window can't dedupe blobs that large, so a single claude-partition export was ~1.85 GB.Column breakdown of the old 1.8 GB export:
toolssystem_textChange
Make the encoder codec configurable via the plugin's config slice:
{ "name": "@hypaware/format-parquet", "config": { "codec": "ZSTD" } }codec:SNAPPY(default) orZSTD; alsopage_sizeandzstd_level.zlib.zstdCompressSync, feature-detected — on a runtime without zstd (Node < 22.15) it logs a warning and falls back to SNAPPY, soengines: ">=20"stays valid and no dependency is added.query/parquet-source.js) wireshyparquet-compressors, which includesdecompressZstd.Results
Measured on the live claude partition (377k rows):
Encode is also faster, and read-back was verified through the same decompressors the query path uses. Query cost: local full-scan decode is ~+17% slower than SNAPPY, but S3-backed queries are net faster (transfers ~10x fewer bytes).
Tests
test/core/format-parquet-codec.test.js: default/SNAPPY, ZSTD round-trip, unavailable-ZSTD and unknown-codec fallback,page_sizeparsing.npm run lint,npm run typecheckclean.Notes / follow-ups (not in this PR)
blob_sink_parquet_local_fs,s3_sink_export_fixture) fail identically onmaster(fixture rows materialize asrow_count=0in the tempHYP_HOME); unrelated to this change.🤖 Generated with Claude Code