feat(format-parquet): configurable codec with ZSTD support by philcunliffe · Pull Request #84 · hyparam/hypaware

philcunliffe · 2026-06-04T19:17:04Z

Problem

The parquet sink encoder hardcoded SNAPPY. For ai_gateway_messages, the tools and system_text columns are large JSON blobs repeated on every message row. hyparquet-writer abandons dictionary encoding once the dictionary exceeds pageSize (default 1 MiB) — the tools column has ~13 distinct ~80KB values (>1 MiB), so it falls back to PLAIN and stores the full blob on every one of 377k rows. SNAPPY's ~64KB window can't dedupe blobs that large, so a single claude-partition export was ~1.85 GB.

Column breakdown of the old 1.8 GB export:

column	size	share
`tools`	1,087 MiB	57%
`system_text`	471 MiB	25%
all 48 others	~336 MiB	18%

Change

Make the encoder codec configurable via the plugin's config slice:

{ "name": "@hypaware/format-parquet", "config": { "codec": "ZSTD" } }

codec: SNAPPY (default) or ZSTD; also page_size and zstd_level.
ZSTD is wired through Node's built-in zlib.zstdCompressSync, feature-detected — on a runtime without zstd (Node < 22.15) it logs a warning and falls back to SNAPPY, so engines: ">=20" stays valid and no dependency is added.
Reads already work: the query path (query/parquet-source.js) wires hyparquet-compressors, which includes decompressZstd.

Results

Measured on the live claude partition (377k rows):

	codec	size
before	SNAPPY	1.85 GB
after	ZSTD	176 MiB (~10.5x)

Encode is also faster, and read-back was verified through the same decompressors the query path uses. Query cost: local full-scan decode is ~+17% slower than SNAPPY, but S3-backed queries are net faster (transfers ~10x fewer bytes).

Tests

test/core/format-parquet-codec.test.js: default/SNAPPY, ZSTD round-trip, unavailable-ZSTD and unknown-codec fallback, page_size parsing.
Full suite: 752 pass. npm run lint, npm run typecheck clean.

Notes / follow-ups (not in this PR)

ZSTD shrinks the output, not the encoder's in-memory footprint — the encoder still materializes all rows before writing, so large-partition exports still need a raised heap and should not auto-run inside the capturing daemon. Streaming encode + S3 multipart is the proper follow-up.
Two pre-existing smoke failures (blob_sink_parquet_local_fs, s3_sink_export_fixture) fail identically on master (fixture rows materialize as row_count=0 in the temp HYP_HOME); unrelated to this change.

🤖 Generated with Claude Code

The parquet sink encoder hardcoded SNAPPY. For ai_gateway_messages, the `tools`/`system_text` columns are large JSON blobs repeated per message row; hyparquet-writer abandons dictionary encoding once the dictionary exceeds `pageSize` (default 1 MiB), so the ~13 distinct ~80KB `tools` values fell back to PLAIN and were stored on every row. SNAPPY's ~64KB window can't dedupe blobs that large, producing a ~1.85 GB claude partition export (377k rows). Make the codec configurable via the plugin's config slice (`codec`, plus `page_size` and `zstd_level`). ZSTD is wired through Node's built-in `zlib.zstdCompressSync`, feature-detected so a runtime without zstd (Node < 22.15) degrades to SNAPPY with a warning rather than hard-failing. Reads are already covered: the query path wires `hyparquet-compressors` (decompressZstd). Measured on the live claude partition: 1.85 GB -> 176 MiB (~10.5x), and faster to encode. Read-back verified through the same decompressors the query path uses. Adds test/core/format-parquet-codec.test.js covering default/SNAPPY, ZSTD round-trip, unavailable/unknown-codec fallback, and page_size parsing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

philcunliffe · 2026-06-04T21:13:03Z

Dual-agent review — `approve`

Verdict: approve
Risk class: low
Auto-merge advisory: 👎 thumbs down — requires human merge gate

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Claude	Missing ZSTD encode/read integration test (minor, format-parquet-codec.test.js; index.js:160-164)	Risks (missing test coverage); Targets (`encodePartition`, writeOpts wiring)
Codex	Test Evidence Quality — no test that encoder emits a readable ZSTD parquet (cat 9, minor, index.js:159-164)	Risks (missing test coverage); same surface as Claude finding
Claude	`zstd_level` floored but not range-clamped / no config schema (nit, index.js:56-59)	Config field chain (unbounded new key); Targets (`resolveEncodeSettings`)

Codex review

Fix Validations

Hardcoded SNAPPY parquet encoder

Status: correct
Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:38, hypaware-core/plugins-workspace/format-parquet/src/index.js:59, hypaware-core/plugins-workspace/format-parquet/src/index.js:161
Assessment: The codec is now resolved from plugin config, defaults to SNAPPY, and ZSTD supplies a writer compressor when available.

Old-Node ZSTD fallback

Status: correct
Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:23, hypaware-core/plugins-workspace/format-parquet/src/index.js:46, hypaware-core/plugins-workspace/format-parquet/src/index.js:54
Assessment: Runtime feature detection happens before encoder registration, and unavailable ZSTD degrades to SNAPPY with a warning rather than failing activation.

Findings

9) Test Evidence Quality

Severity: minor
Confidence: high
Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:159, hypaware-core/plugins-workspace/format-parquet/src/index.js:164, test/core/format-parquet-codec.test.js:35, test/core/format-parquet-codec.test.js:46
Why it matters: The new tests prove config resolution and raw zstd compression, but not that the encoder emits a ZSTD Parquet file that the repo’s Parquet query path can read.
Suggested fix: Add one integration-style test that activates or calls the real parquet encoder with { codec: 'ZSTD' }, writes via hyparquet-writer, then reads the resulting buffer through the same compressor-enabled query path used by src/core/query/parquet-source.js.

No Finding

Behavioral Correctness
Contract & Interface Fidelity
Change Impact / Blast Radius
Concurrency, Ordering & State Safety
Error Handling & Resilience
Security Surface
Resource Lifecycle & Cleanup
Release Safety
Architectural Consistency
Debuggability & Operability

Evidence Bundle

Changed hot paths: resolveEncodeSettings config parsing at hypaware-core/plugins-workspace/format-parquet/src/index.js:38; encoder registration at hypaware-core/plugins-workspace/format-parquet/src/index.js:99; writer options at hypaware-core/plugins-workspace/format-parquet/src/index.js:161.
Impacted callers: local-fs export calls the encoder at hypaware-core/plugins-workspace/local-fs/src/index.js:105; S3 export calls the encoder at hypaware-core/plugins-workspace/s3/src/index.js:268; core wrapper calls encoder.encodePartition at src/core/sinks/encoder.js:46.
Impacted tests: new codec unit tests at test/core/format-parquet-codec.test.js:25, test/core/format-parquet-codec.test.js:35, test/core/format-parquet-codec.test.js:49, test/core/format-parquet-codec.test.js:58, test/core/format-parquet-codec.test.js:67; existing default-SNAPPY smoke assertions at hypaware-core/smoke/flows/blob_sink_parquet_local_fs.js:304 and hypaware-core/smoke/flows/local_parquet_export.js:307.
Unresolved uncertainty: I ran node --test test/core/format-parquet-codec.test.js only; full suite, lint, typecheck, and old-Node fallback were not independently run here.

Claude review

Claude review

Five independent review subagents (guidance compliance, shallow bug scan,
historical context, contract & callers, comments & tests) examined the PR at
its head. Guidance compliance, bug scan, and contract/caller lenses each
returned clean (the encode path was empirically write-then-read verified on
Node 24 — a 200-row ZSTD parquet round-tripped byte-identical through
hyparquet-compressors). Two findings survived the >=80 confidence filter and
adversarial cross-check; both are non-blocking.

One candidate finding was refuted and dropped: an out-of-range zstd_level
was hypothesized to throw at encode time, but a reviewer empirically tested
levels 1000/-1000/0/23 and none threw — libzstd clamps internally. The residual
(unbounded config value) is recorded below as a nit only.

Missing end-to-end ZSTD encode/read regression test on the load-bearing path

Severity: minor
Confidence: 82
Evidence: test/core/format-parquet-codec.test.js (whole file); hypaware-core/plugins-workspace/format-parquet/src/index.js:160-164
Why it matters: The PR's entire purpose (ZSTD makes S3 parquet exports ~10x smaller and stays queryable) is unproven by tests — they only exercise resolveEncodeSettings, never encodePartition/parquetWriteBuffer. The writer applies compression via compressors[codec]?.(bytes) ?? bytes, so if the if (settings.compressors) writeOpts.compressors = ... wiring (index.js:162) is ever dropped, it silently emits uncompressed pages labelled ZSTD — both oversized and corrupt to readers — and every existing test still passes. (Codex independently flagged the same gap as category 9, minor.)
Suggested fix: Add one integration test: call encodePartition (or parquetWriteBuffer with the resolved ZSTD settings) on a small fixture, read it back with parquetReadObjects + { compressors } from hyparquet-compressors (the same wiring src/core/query/parquet-source.js:4,100 uses), and assert rows match and the ZSTD buffer is materially smaller than the uncompressed/SNAPPY buffer for compressible input.

`zstd_level` is floored but not range-clamped (unbounded config value)

Severity: nit
Confidence: 70
Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:56-59
Why it matters: level accepts any finite number and is only Math.floor'd; there is no plugin config schema validating it (the manifest declares none, and validatePluginConfig is a no-op when a plugin registers no section). In practice this is harmless — libzstd silently clamps to its supported range, so no runtime throw was observed — but the "validated config slice" wording in the docstring (index.js:26) overclaims a validation step that does not exist for these keys.
Suggested fix: Clamp level to zstd's supported range in resolveEncodeSettings (consistent with the "never hard-fail on bad config" promise the docstring already makes), and/or soften the comment to drop "validated".

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr84/dual-review/pr-84

philcunliffe merged commit 72da9e8 into master Jun 4, 2026
6 checks passed

philcunliffe deleted the feat/parquet-export-zstd-codec branch June 4, 2026 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(format-parquet): configurable codec with ZSTD support#84

feat(format-parquet): configurable codec with ZSTD support#84
philcunliffe merged 1 commit into
masterfrom
feat/parquet-export-zstd-codec

philcunliffe commented Jun 4, 2026

Uh oh!

philcunliffe commented Jun 4, 2026

Fix Validations

Hardcoded SNAPPY parquet encoder

Old-Node ZSTD fallback

Findings

9) Test Evidence Quality

No Finding

Evidence Bundle

Claude review

Missing end-to-end ZSTD encode/read regression test on the load-bearing path

`zstd_level` is floored but not range-clamped (unbounded config value)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented Jun 4, 2026

Problem

Change

Results

Tests

Notes / follow-ups (not in this PR)

Uh oh!

philcunliffe commented Jun 4, 2026

Dual-agent review — approve

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Fix Validations

Hardcoded SNAPPY parquet encoder

Old-Node ZSTD fallback

Findings

9) Test Evidence Quality

No Finding

Evidence Bundle

Claude review

Missing end-to-end ZSTD encode/read regression test on the load-bearing path

zstd_level is floored but not range-clamped (unbounded config value)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dual-agent review — `approve`

`zstd_level` is floored but not range-clamped (unbounded config value)