Skip to content

feat(format-parquet): configurable codec with ZSTD support#84

Merged
philcunliffe merged 1 commit into
masterfrom
feat/parquet-export-zstd-codec
Jun 4, 2026
Merged

feat(format-parquet): configurable codec with ZSTD support#84
philcunliffe merged 1 commit into
masterfrom
feat/parquet-export-zstd-codec

Conversation

@philcunliffe
Copy link
Copy Markdown
Contributor

Problem

The parquet sink encoder hardcoded SNAPPY. For ai_gateway_messages, the tools and system_text columns are large JSON blobs repeated on every message row. hyparquet-writer abandons dictionary encoding once the dictionary exceeds pageSize (default 1 MiB) — the tools column has ~13 distinct ~80KB values (>1 MiB), so it falls back to PLAIN and stores the full blob on every one of 377k rows. SNAPPY's ~64KB window can't dedupe blobs that large, so a single claude-partition export was ~1.85 GB.

Column breakdown of the old 1.8 GB export:

column size share
tools 1,087 MiB 57%
system_text 471 MiB 25%
all 48 others ~336 MiB 18%

Change

Make the encoder codec configurable via the plugin's config slice:

{ "name": "@hypaware/format-parquet", "config": { "codec": "ZSTD" } }
  • codec: SNAPPY (default) or ZSTD; also page_size and zstd_level.
  • ZSTD is wired through Node's built-in zlib.zstdCompressSync, feature-detected — on a runtime without zstd (Node < 22.15) it logs a warning and falls back to SNAPPY, so engines: ">=20" stays valid and no dependency is added.
  • Reads already work: the query path (query/parquet-source.js) wires hyparquet-compressors, which includes decompressZstd.

Results

Measured on the live claude partition (377k rows):

codec size
before SNAPPY 1.85 GB
after ZSTD 176 MiB (~10.5x)

Encode is also faster, and read-back was verified through the same decompressors the query path uses. Query cost: local full-scan decode is ~+17% slower than SNAPPY, but S3-backed queries are net faster (transfers ~10x fewer bytes).

Tests

  • test/core/format-parquet-codec.test.js: default/SNAPPY, ZSTD round-trip, unavailable-ZSTD and unknown-codec fallback, page_size parsing.
  • Full suite: 752 pass. npm run lint, npm run typecheck clean.

Notes / follow-ups (not in this PR)

  • ZSTD shrinks the output, not the encoder's in-memory footprint — the encoder still materializes all rows before writing, so large-partition exports still need a raised heap and should not auto-run inside the capturing daemon. Streaming encode + S3 multipart is the proper follow-up.
  • Two pre-existing smoke failures (blob_sink_parquet_local_fs, s3_sink_export_fixture) fail identically on master (fixture rows materialize as row_count=0 in the temp HYP_HOME); unrelated to this change.

🤖 Generated with Claude Code

The parquet sink encoder hardcoded SNAPPY. For ai_gateway_messages, the
`tools`/`system_text` columns are large JSON blobs repeated per message
row; hyparquet-writer abandons dictionary encoding once the dictionary
exceeds `pageSize` (default 1 MiB), so the ~13 distinct ~80KB `tools`
values fell back to PLAIN and were stored on every row. SNAPPY's ~64KB
window can't dedupe blobs that large, producing a ~1.85 GB claude
partition export (377k rows).

Make the codec configurable via the plugin's config slice
(`codec`, plus `page_size` and `zstd_level`). ZSTD is wired through
Node's built-in `zlib.zstdCompressSync`, feature-detected so a runtime
without zstd (Node < 22.15) degrades to SNAPPY with a warning rather
than hard-failing. Reads are already covered: the query path wires
`hyparquet-compressors` (decompressZstd).

Measured on the live claude partition: 1.85 GB -> 176 MiB (~10.5x), and
faster to encode. Read-back verified through the same decompressors the
query path uses.

Adds test/core/format-parquet-codec.test.js covering default/SNAPPY,
ZSTD round-trip, unavailable/unknown-codec fallback, and page_size
parsing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe
Copy link
Copy Markdown
Contributor Author

Dual-agent review — approve

  • Verdict: approve
  • Risk class: low
  • Auto-merge advisory: 👎 thumbs down — requires human merge gate

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Claude Missing ZSTD encode/read integration test (minor, format-parquet-codec.test.js; index.js:160-164) Risks (missing test coverage); Targets (encodePartition, writeOpts wiring)
Codex Test Evidence Quality — no test that encoder emits a readable ZSTD parquet (cat 9, minor, index.js:159-164) Risks (missing test coverage); same surface as Claude finding
Claude zstd_level floored but not range-clamped / no config schema (nit, index.js:56-59) Config field chain (unbounded new key); Targets (resolveEncodeSettings)
Codex review

Fix Validations

Hardcoded SNAPPY parquet encoder

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:38, hypaware-core/plugins-workspace/format-parquet/src/index.js:59, hypaware-core/plugins-workspace/format-parquet/src/index.js:161
  • Assessment: The codec is now resolved from plugin config, defaults to SNAPPY, and ZSTD supplies a writer compressor when available.

Old-Node ZSTD fallback

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:23, hypaware-core/plugins-workspace/format-parquet/src/index.js:46, hypaware-core/plugins-workspace/format-parquet/src/index.js:54
  • Assessment: Runtime feature detection happens before encoder registration, and unavailable ZSTD degrades to SNAPPY with a warning rather than failing activation.

Findings

9) Test Evidence Quality

  • Severity: minor
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:159, hypaware-core/plugins-workspace/format-parquet/src/index.js:164, test/core/format-parquet-codec.test.js:35, test/core/format-parquet-codec.test.js:46
  • Why it matters: The new tests prove config resolution and raw zstd compression, but not that the encoder emits a ZSTD Parquet file that the repo’s Parquet query path can read.
  • Suggested fix: Add one integration-style test that activates or calls the real parquet encoder with { codec: 'ZSTD' }, writes via hyparquet-writer, then reads the resulting buffer through the same compressor-enabled query path used by src/core/query/parquet-source.js.

No Finding

  1. Behavioral Correctness
  2. Contract & Interface Fidelity
  3. Change Impact / Blast Radius
  4. Concurrency, Ordering & State Safety
  5. Error Handling & Resilience
  6. Security Surface
  7. Resource Lifecycle & Cleanup
  8. Release Safety
  9. Architectural Consistency
  10. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: resolveEncodeSettings config parsing at hypaware-core/plugins-workspace/format-parquet/src/index.js:38; encoder registration at hypaware-core/plugins-workspace/format-parquet/src/index.js:99; writer options at hypaware-core/plugins-workspace/format-parquet/src/index.js:161.
  • Impacted callers: local-fs export calls the encoder at hypaware-core/plugins-workspace/local-fs/src/index.js:105; S3 export calls the encoder at hypaware-core/plugins-workspace/s3/src/index.js:268; core wrapper calls encoder.encodePartition at src/core/sinks/encoder.js:46.
  • Impacted tests: new codec unit tests at test/core/format-parquet-codec.test.js:25, test/core/format-parquet-codec.test.js:35, test/core/format-parquet-codec.test.js:49, test/core/format-parquet-codec.test.js:58, test/core/format-parquet-codec.test.js:67; existing default-SNAPPY smoke assertions at hypaware-core/smoke/flows/blob_sink_parquet_local_fs.js:304 and hypaware-core/smoke/flows/local_parquet_export.js:307.
  • Unresolved uncertainty: I ran node --test test/core/format-parquet-codec.test.js only; full suite, lint, typecheck, and old-Node fallback were not independently run here.
Claude review

Claude review

Five independent review subagents (guidance compliance, shallow bug scan,
historical context, contract & callers, comments & tests) examined the PR at
its head. Guidance compliance, bug scan, and contract/caller lenses each
returned clean (the encode path was empirically write-then-read verified on
Node 24 — a 200-row ZSTD parquet round-tripped byte-identical through
hyparquet-compressors). Two findings survived the >=80 confidence filter and
adversarial cross-check; both are non-blocking.

One candidate finding was refuted and dropped: an out-of-range zstd_level
was hypothesized to throw at encode time, but a reviewer empirically tested
levels 1000/-1000/0/23 and none threw — libzstd clamps internally. The residual
(unbounded config value) is recorded below as a nit only.

Missing end-to-end ZSTD encode/read regression test on the load-bearing path

  • Severity: minor
  • Confidence: 82
  • Evidence: test/core/format-parquet-codec.test.js (whole file); hypaware-core/plugins-workspace/format-parquet/src/index.js:160-164
  • Why it matters: The PR's entire purpose (ZSTD makes S3 parquet exports ~10x smaller and stays queryable) is unproven by tests — they only exercise resolveEncodeSettings, never encodePartition/parquetWriteBuffer. The writer applies compression via compressors[codec]?.(bytes) ?? bytes, so if the if (settings.compressors) writeOpts.compressors = ... wiring (index.js:162) is ever dropped, it silently emits uncompressed pages labelled ZSTD — both oversized and corrupt to readers — and every existing test still passes. (Codex independently flagged the same gap as category 9, minor.)
  • Suggested fix: Add one integration test: call encodePartition (or parquetWriteBuffer with the resolved ZSTD settings) on a small fixture, read it back with parquetReadObjects + { compressors } from hyparquet-compressors (the same wiring src/core/query/parquet-source.js:4,100 uses), and assert rows match and the ZSTD buffer is materially smaller than the uncompressed/SNAPPY buffer for compressible input.

zstd_level is floored but not range-clamped (unbounded config value)

  • Severity: nit
  • Confidence: 70
  • Evidence: hypaware-core/plugins-workspace/format-parquet/src/index.js:56-59
  • Why it matters: level accepts any finite number and is only Math.floor'd; there is no plugin config schema validating it (the manifest declares none, and validatePluginConfig is a no-op when a plugin registers no section). In practice this is harmless — libzstd silently clamps to its supported range, so no runtime throw was observed — but the "validated config slice" wording in the docstring (index.js:26) overclaims a validation step that does not exist for these keys.
  • Suggested fix: Clamp level to zstd's supported range in resolveEncodeSettings (consistent with the "never hard-fail on bad config" promise the docstring already makes), and/or soften the comment to drop "validated".

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr84/dual-review/pr-84

@philcunliffe philcunliffe merged commit 72da9e8 into master Jun 4, 2026
6 checks passed
@philcunliffe philcunliffe deleted the feat/parquet-export-zstd-codec branch June 4, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant