fix: make cache compaction memory-safe and reclaim orphaned generations#82
Merged
Conversation
The daemon was OOM-crash-looping: every gateway exchange appends a separate tiny Iceberg data file, so a large per-row column (notably the denormalized `tools` definitions, ~128KB) is physically re-stored in every file. Compaction is meant to merge these, but it batched a fixed 10,000 rows regardless of size — ~1.3GB per batch on real data — and blew Node's ~4GB heap before finishing. Each crash severed the in-flight proxied request, surfacing to clients as "the socket connection was closed unexpectedly". Crashes mid-compaction also left `table-<seq>` generations with no `.retired` marker, which the old sweep skipped forever, leaking ~9.5GB. Changes: - Byte-aware compaction batching: flush at `compact_batch_bytes` (new config, default 32MB) or 10k rows, whichever comes first. Caps peak heap regardless of per-row payload size. Applied to both the active source-table and legacy epoch compaction paths. - Orphan-generation cleanup: the retired-dir sweep now also reclaims any cursor-unreferenced `table-*`/`epoch=*` generation older than a 1h grace, even without a `.retired` marker, while protecting the live generation and any freshly written (possibly in-flight) one. - Tests: byte-budget batching splits a fat column across files while preserving all rows; a generous budget compacts to one file; the orphan sweep reclaims a stale unreferenced dir and keeps live + fresh. Validated against a real 1.3GB / 3,529-file table under a 4GB cap: compacts to 299MB / 117 files, all 358,892 rows preserved, peak RSS 2.8GB, no OOM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Claude | Orphan sweep deletes live table on unreadable cursor (blocker, maintenance.js walkForRetired / partition.js:47-49) | Risks#1, Concurrency surface, Direct callers (cleanRetiredEpochs every tick) |
| Claude + Codex | compact_batch_bytes stripped by validator (major, schema.js:493-496) | Risks#2, Config field chain |
| Claude + Codex | Public QueryCacheMaintenanceConfig omits field (minor, kernel-types.d.ts:460) | Config field chain |
| Claude | No epoch-layout coverage (minor, maintenance.js compactPartition) | Risks#3 |
Codex review
Fix Validations
Byte-aware compaction batching
- Status: correct
- Evidence: src/core/cache/maintenance.js:484, src/core/cache/maintenance.js:502, src/core/cache/maintenance.js:599, src/core/cache/maintenance.js:617
- Assessment: Both source-table and legacy epoch compaction now flush on
COMPACT_BATCH_SIZEorcompact_batch_bytes, so the fixed 10k-row batch failure mode is addressed in the compaction code paths.
Orphaned generation cleanup
- Status: correct
- Evidence: src/core/cache/maintenance.js:697, src/core/cache/maintenance.js:704, src/core/cache/maintenance.js:721
- Assessment: The sweep now determines the live generation from
cursor.json, skips it, and reclaims stale unreferencedtable*/epoch=*directories after the orphan grace window.
Findings
2) Contract & Interface Fidelity
- Severity: major
- Confidence: high
- Evidence: src/core/cache/maintenance.js:80, src/core/config/schema.js:493, src/core/config/schema.js:495, collectivus-plugin-kernel-types.d.ts:460
- Why it matters:
compact_batch_bytesis normalized internally, but user config parsing and the publicQueryCacheMaintenanceConfigomit it, so daemon/CLI config cannot actually set the new memory cap. - Suggested fix: Add
compact_batch_bytes?: numbertoQueryCacheMaintenanceConfig, include it in thequery.cache.maintenancenumeric schema allowlist, and add a config-schema test proving the field survives parsing.
No Finding
- Behavioral Correctness
- Change Impact / Blast Radius
- Concurrency, Ordering & State Safety
- Error Handling & Resilience
- Security Surface
- Resource Lifecycle & Cleanup
- Release Safety
- Test Evidence Quality
- Architectural Consistency
- Debuggability & Operability
Evidence Bundle
- Changed hot paths:
normalizeMaintenanceConfigat src/core/cache/maintenance.js:71; source-table compaction batching at src/core/cache/maintenance.js:488; legacy epoch batching at src/core/cache/maintenance.js:603; retired/orphan sweep at src/core/cache/maintenance.js:688. - Impacted callers: daemon maintenance config flow at src/core/daemon/runtime.js:292 and src/core/daemon/runtime.js:296; CLI maintain config pass-through at src/core/cli/core_commands.js:864 and src/core/cli/core_commands.js:871; direct smoke caller at hypaware-core/smoke/flows/cache_lifecycle_maintenance.js:98.
- Impacted tests: default normalization at test/core/cache-retention-maintenance.test.js:586; explicit normalizer field at test/core/cache-retention-maintenance.test.js:591; byte-budget source-table split at test/core/cache-retention-maintenance.test.js:596; generous-budget single file at test/core/cache-retention-maintenance.test.js:633; stale orphan reclaim at test/core/cache-retention-maintenance.test.js:659.
- Unresolved uncertainty: I did not run tests; this was a diff-focused review under the prompt’s file-read limit.
Claude review
Claude review
Orphan sweep deletes the live table generation when cursor.json is unreadable
- Severity: blocker
- Confidence: 88
- Evidence: src/core/cache/maintenance.js (PR-added
walkForRetiredorphan branch +liveGenerationDir); root cause src/core/cache/partition.js:47-49 (readCursorSyncswallows parse errors and returns{ epoch: 0, rowCount: 0, compaction: null }) - Why it matters: When a partition's
cursor.jsonexists as a dirent butJSON.parsefails (partial/corrupt write — plausible precisely in this daemon's crash scenarios),hasCursoris still true, soliveDirName = liveGenerationDir(readCursorSync(dir))resolves to the synthetic"epoch=0". A source-table partition's real live dir (table/table-<seq>) then fails theentry.name === liveDirNameguard, has no.retiredmarker, and — once its mtime ages past the 1hORPHAN_GRACE_MS(normal for idle/low-traffic sources) — isfs.rmSync'd, destroying the only live cached data. The pre-PR design deleted ONLY dirs carrying an explicit.retiredmarker, so an unreadable cursor could never delete live data; this PR's mtime branch removes that protection. - Suggested fix: Make
readCursorSyncdistinguish "file absent" from "file present but unparseable" (return null / throw on parse failure of an existing file). InwalkForRetired, run the orphan branch ONLY when the live generation is positively known — never fall back to a synthetic"epoch=0". Add a regression test: a partition with a corruptcursor.jsonmust preserve its livetabledir.
compact_batch_bytes is stripped by config validation, so the knob is unsettable from config
- Severity: major
- Confidence: 92
- Evidence: src/core/config/schema.js:493-496
- Why it matters:
parseQueryConfigrebuildsquery.cache.maintenancefrom a fixed numeric-key allow-list that omitscompact_batch_bytes, so a user-supplied value is silently dropped before reachingnormalizeMaintenanceConfig; the new memory cap can never be tuned from a config file despite the interface JSDoc presenting it as configurable (it always falls back to the 32 MiB default). Independently flagged by Codex (category 2, major). - Suggested fix: Add
'compact_batch_bytes'to the allow-list tuple at src/core/config/schema.js:494-495 and add a config-schema test proving the field survives parsing.
User-facing QueryCacheMaintenanceConfig omits compact_batch_bytes
- Severity: minor
- Confidence: 85
- Evidence: collectivus-plugin-kernel-types.d.ts:460-469
- Why it matters: The internal
MaintenanceConfiggained the field but the public config type that flows fromquery.cache.maintenancedid not, so even after fixing the validator a typed config author has no declaration for the knob. - Suggested fix: Add
compact_batch_bytes?: numbertoQueryCacheMaintenanceConfigalongsidecompact_avg_file_bytes.
No test coverage for the legacy epoch-layout byte batching or epoch orphan sweep
- Severity: minor
- Confidence: 80
- Evidence: src/core/cache/maintenance.js (compactPartition byte-batching;
liveGenerationDirreturningepoch=<n>); test/core/cache-retention-maintenance.test.js (no epoch-layout byte-cap or orphan test) - Why it matters: Both new behaviors are duplicated for the legacy epoch path but only exercised for source-table layout, so a regression in the epoch path would pass CI.
- Suggested fix: Add one epoch-layout orphan-sweep test (stale
epoch=<old>reclaimed, liveepoch=<n>preserved) and an epoch byte-budget assertion.
Reports: /Users/phil/workspace/hypaware/.git/worktrees/memory-safe-compaction/dual-review/pr-82
…_batch_bytes through config
Blocker (orphan sweep could delete live data): the sweep determined the
live generation via readCursorSync, which collapses "cursor missing" and
"cursor present but unparseable" into the same default {epoch:0}. A
corrupt/partial cursor.json (plausible exactly in the crash scenarios
this PR targets) made liveGenerationDir synthesize "epoch=0", so a
source-table partition's live `table` dir was misclassified as an orphan
and reclaimed after the grace window.
- Add partition.tryReadCursorSync: returns null when the cursor file is
missing OR unreadable/unparseable (readCursorSync now delegates to it).
- walkForRetired uses tryReadCursorSync; the orphan branch runs only when
a live generation is positively known — never against a synthetic name.
- liveGenerationDir is layout-aware: a source-table cursor without an
explicit tableDir resolves to `table`, never an `epoch=*` name.
- Regression test: a corrupt cursor.json must preserve the live table
(verified to fail without the guard).
Major (compact_batch_bytes was non-functional via config): the field was
stripped by the config validator allow-list and absent from the public
type, so the new memory cap could never be set from a config file.
- Add compact_batch_bytes to the query.cache.maintenance validator
allow-list (non-negative number check) and to QueryCacheMaintenanceConfig.
- Tests: the field survives parsing; a negative value is rejected.
Minor: add legacy epoch-layout orphan-sweep coverage.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The local daemon was in an OOM crash-loop, surfacing to clients (e.g. Claude Code routing through the gateway) as:
Each crash severed the in-flight proxied request. Observed locally: 62 OOMs / 84 restarts in a day.
Root cause
Every gateway exchange appends a separate tiny Iceberg data file, so dedup/dictionary encoding can't span files and a large per-row column — notably the denormalized
toolsdefinitions (~128 KB) — is physically re-stored in every file (3,529 files in one real table). Compaction is the intended cure, but it batched a fixed 10,000 rows regardless of size (~1.3 GB/batch on real data) and blew Node's ~4 GB heap before finishing. So small files never merged and the daemon kept dying.Crashes mid-compaction also left
table-<seq>generations with no.retiredmarker, which the old sweep skipped forever — leaking ~9.5 GB of orphaned generations.Changes
compact_batch_bytes(newMaintenanceConfigfield, default 32 MB) or 10k rows, whichever comes first. Peak heap is now bounded regardless of per-row payload size. Applied to both the active source-table and legacy epoch paths.walkForRetirednow also reclaims any cursor-unreferencedtable-*/epoch=*generation older than a 1 h grace, even with no.retiredmarker, while protecting the live generation and any freshly written (possibly in-flight) one.Validation
npm test: 688 pass (683 + 5 new).npm run typecheck+npm run lint: clean.Smokes:
cache_lifecycle_maintenance(the changed path),cache_roundtrip,cache_spool_batching,status_diagnostics,core_boot_noopall pass. (gateway_claude_captureandlocal_parquet_exportfail identically on the base branch — pre-existing, unrelated to this change.)Real data, under a 4 GB heap cap (the ceiling that was crashing the daemon):
Follow-ups (not in this PR)
tools/system_textrepresentation would remove it at the source — worth a separate design.hyp query maintainafter upgrading to recompact existing tables.🤖 Generated with Claude Code