Skip to content

fix: make cache compaction memory-safe and reclaim orphaned generations#82

Merged
philcunliffe merged 2 commits into
masterfrom
worktree-memory-safe-compaction
Jun 2, 2026
Merged

fix: make cache compaction memory-safe and reclaim orphaned generations#82
philcunliffe merged 2 commits into
masterfrom
worktree-memory-safe-compaction

Conversation

@philcunliffe
Copy link
Copy Markdown
Contributor

Problem

The local daemon was in an OOM crash-loop, surfacing to clients (e.g. Claude Code routing through the gateway) as:

API Error: The socket connection was closed unexpectedly

Each crash severed the in-flight proxied request. Observed locally: 62 OOMs / 84 restarts in a day.

Root cause

Every gateway exchange appends a separate tiny Iceberg data file, so dedup/dictionary encoding can't span files and a large per-row column — notably the denormalized tools definitions (~128 KB) — is physically re-stored in every file (3,529 files in one real table). Compaction is the intended cure, but it batched a fixed 10,000 rows regardless of size (~1.3 GB/batch on real data) and blew Node's ~4 GB heap before finishing. So small files never merged and the daemon kept dying.

Crashes mid-compaction also left table-<seq> generations with no .retired marker, which the old sweep skipped forever — leaking ~9.5 GB of orphaned generations.

Changes

  • Byte-aware compaction batching — flush at compact_batch_bytes (new MaintenanceConfig field, default 32 MB) or 10k rows, whichever comes first. Peak heap is now bounded regardless of per-row payload size. Applied to both the active source-table and legacy epoch paths.
  • Orphan-generation cleanupwalkForRetired now also reclaims any cursor-unreferenced table-*/epoch=* generation older than a 1 h grace, even with no .retired marker, while protecting the live generation and any freshly written (possibly in-flight) one.
  • Tests (5 new) — byte-budget batching splits a fat column across files while preserving all rows; a generous budget compacts to a single file; the orphan sweep reclaims a stale unreferenced dir and keeps live + fresh.

Validation

  • npm test: 688 pass (683 + 5 new). npm run typecheck + npm run lint: clean.

  • Smokes: cache_lifecycle_maintenance (the changed path), cache_roundtrip, cache_spool_batching, status_diagnostics, core_boot_noop all pass. (gateway_claude_capture and local_parquet_export fail identically on the base branch — pre-existing, unrelated to this change.)

  • Real data, under a 4 GB heap cap (the ceiling that was crashing the daemon):

    Before After
    Size 1,321 MB 299 MB
    Files 3,529 117
    Rows 358,892 (all preserved)
    Largest batch ~1.3 GB → OOM 32–64 MB
    Peak RSS >4 GB → crash 2.8 GB

Follow-ups (not in this PR)

  • The underlying redundancy (large conversation-level blobs stored per message-part row across thousands of incremental files) is mitigated here, not eliminated. A normalized/hash-referenced tools/system_text representation would remove it at the source — worth a separate design.
  • Operators on an already-bloated cache should run hyp query maintain after upgrading to recompact existing tables.

🤖 Generated with Claude Code

The daemon was OOM-crash-looping: every gateway exchange appends a
separate tiny Iceberg data file, so a large per-row column (notably the
denormalized `tools` definitions, ~128KB) is physically re-stored in
every file. Compaction is meant to merge these, but it batched a fixed
10,000 rows regardless of size — ~1.3GB per batch on real data — and
blew Node's ~4GB heap before finishing. Each crash severed the in-flight
proxied request, surfacing to clients as "the socket connection was
closed unexpectedly". Crashes mid-compaction also left `table-<seq>`
generations with no `.retired` marker, which the old sweep skipped
forever, leaking ~9.5GB.

Changes:
- Byte-aware compaction batching: flush at `compact_batch_bytes` (new
  config, default 32MB) or 10k rows, whichever comes first. Caps peak
  heap regardless of per-row payload size. Applied to both the active
  source-table and legacy epoch compaction paths.
- Orphan-generation cleanup: the retired-dir sweep now also reclaims any
  cursor-unreferenced `table-*`/`epoch=*` generation older than a 1h
  grace, even without a `.retired` marker, while protecting the live
  generation and any freshly written (possibly in-flight) one.
- Tests: byte-budget batching splits a fat column across files while
  preserving all rows; a generous budget compacts to one file; the
  orphan sweep reclaims a stale unreferenced dir and keeps live + fresh.

Validated against a real 1.3GB / 3,529-file table under a 4GB cap:
compacts to 299MB / 117 files, all 358,892 rows preserved, peak RSS
2.8GB, no OOM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe
Copy link
Copy Markdown
Contributor Author

Dual-agent review — block

  • Verdict: block
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is block; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Claude Orphan sweep deletes live table on unreadable cursor (blocker, maintenance.js walkForRetired / partition.js:47-49) Risks#1, Concurrency surface, Direct callers (cleanRetiredEpochs every tick)
Claude + Codex compact_batch_bytes stripped by validator (major, schema.js:493-496) Risks#2, Config field chain
Claude + Codex Public QueryCacheMaintenanceConfig omits field (minor, kernel-types.d.ts:460) Config field chain
Claude No epoch-layout coverage (minor, maintenance.js compactPartition) Risks#3
Codex review

Fix Validations

Byte-aware compaction batching

  • Status: correct
  • Evidence: src/core/cache/maintenance.js:484, src/core/cache/maintenance.js:502, src/core/cache/maintenance.js:599, src/core/cache/maintenance.js:617
  • Assessment: Both source-table and legacy epoch compaction now flush on COMPACT_BATCH_SIZE or compact_batch_bytes, so the fixed 10k-row batch failure mode is addressed in the compaction code paths.

Orphaned generation cleanup

  • Status: correct
  • Evidence: src/core/cache/maintenance.js:697, src/core/cache/maintenance.js:704, src/core/cache/maintenance.js:721
  • Assessment: The sweep now determines the live generation from cursor.json, skips it, and reclaims stale unreferenced table*/epoch=* directories after the orphan grace window.

Findings

2) Contract & Interface Fidelity

  • Severity: major
  • Confidence: high
  • Evidence: src/core/cache/maintenance.js:80, src/core/config/schema.js:493, src/core/config/schema.js:495, collectivus-plugin-kernel-types.d.ts:460
  • Why it matters: compact_batch_bytes is normalized internally, but user config parsing and the public QueryCacheMaintenanceConfig omit it, so daemon/CLI config cannot actually set the new memory cap.
  • Suggested fix: Add compact_batch_bytes?: number to QueryCacheMaintenanceConfig, include it in the query.cache.maintenance numeric schema allowlist, and add a config-schema test proving the field survives parsing.

No Finding

  1. Behavioral Correctness
  2. Change Impact / Blast Radius
  3. Concurrency, Ordering & State Safety
  4. Error Handling & Resilience
  5. Security Surface
  6. Resource Lifecycle & Cleanup
  7. Release Safety
  8. Test Evidence Quality
  9. Architectural Consistency
  10. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: normalizeMaintenanceConfig at src/core/cache/maintenance.js:71; source-table compaction batching at src/core/cache/maintenance.js:488; legacy epoch batching at src/core/cache/maintenance.js:603; retired/orphan sweep at src/core/cache/maintenance.js:688.
  • Impacted callers: daemon maintenance config flow at src/core/daemon/runtime.js:292 and src/core/daemon/runtime.js:296; CLI maintain config pass-through at src/core/cli/core_commands.js:864 and src/core/cli/core_commands.js:871; direct smoke caller at hypaware-core/smoke/flows/cache_lifecycle_maintenance.js:98.
  • Impacted tests: default normalization at test/core/cache-retention-maintenance.test.js:586; explicit normalizer field at test/core/cache-retention-maintenance.test.js:591; byte-budget source-table split at test/core/cache-retention-maintenance.test.js:596; generous-budget single file at test/core/cache-retention-maintenance.test.js:633; stale orphan reclaim at test/core/cache-retention-maintenance.test.js:659.
  • Unresolved uncertainty: I did not run tests; this was a diff-focused review under the prompt’s file-read limit.
Claude review

Claude review

Orphan sweep deletes the live table generation when cursor.json is unreadable

  • Severity: blocker
  • Confidence: 88
  • Evidence: src/core/cache/maintenance.js (PR-added walkForRetired orphan branch + liveGenerationDir); root cause src/core/cache/partition.js:47-49 (readCursorSync swallows parse errors and returns { epoch: 0, rowCount: 0, compaction: null })
  • Why it matters: When a partition's cursor.json exists as a dirent but JSON.parse fails (partial/corrupt write — plausible precisely in this daemon's crash scenarios), hasCursor is still true, so liveDirName = liveGenerationDir(readCursorSync(dir)) resolves to the synthetic "epoch=0". A source-table partition's real live dir (table / table-<seq>) then fails the entry.name === liveDirName guard, has no .retired marker, and — once its mtime ages past the 1h ORPHAN_GRACE_MS (normal for idle/low-traffic sources) — is fs.rmSync'd, destroying the only live cached data. The pre-PR design deleted ONLY dirs carrying an explicit .retired marker, so an unreadable cursor could never delete live data; this PR's mtime branch removes that protection.
  • Suggested fix: Make readCursorSync distinguish "file absent" from "file present but unparseable" (return null / throw on parse failure of an existing file). In walkForRetired, run the orphan branch ONLY when the live generation is positively known — never fall back to a synthetic "epoch=0". Add a regression test: a partition with a corrupt cursor.json must preserve its live table dir.

compact_batch_bytes is stripped by config validation, so the knob is unsettable from config

  • Severity: major
  • Confidence: 92
  • Evidence: src/core/config/schema.js:493-496
  • Why it matters: parseQueryConfig rebuilds query.cache.maintenance from a fixed numeric-key allow-list that omits compact_batch_bytes, so a user-supplied value is silently dropped before reaching normalizeMaintenanceConfig; the new memory cap can never be tuned from a config file despite the interface JSDoc presenting it as configurable (it always falls back to the 32 MiB default). Independently flagged by Codex (category 2, major).
  • Suggested fix: Add 'compact_batch_bytes' to the allow-list tuple at src/core/config/schema.js:494-495 and add a config-schema test proving the field survives parsing.

User-facing QueryCacheMaintenanceConfig omits compact_batch_bytes

  • Severity: minor
  • Confidence: 85
  • Evidence: collectivus-plugin-kernel-types.d.ts:460-469
  • Why it matters: The internal MaintenanceConfig gained the field but the public config type that flows from query.cache.maintenance did not, so even after fixing the validator a typed config author has no declaration for the knob.
  • Suggested fix: Add compact_batch_bytes?: number to QueryCacheMaintenanceConfig alongside compact_avg_file_bytes.

No test coverage for the legacy epoch-layout byte batching or epoch orphan sweep

  • Severity: minor
  • Confidence: 80
  • Evidence: src/core/cache/maintenance.js (compactPartition byte-batching; liveGenerationDir returning epoch=<n>); test/core/cache-retention-maintenance.test.js (no epoch-layout byte-cap or orphan test)
  • Why it matters: Both new behaviors are duplicated for the legacy epoch path but only exercised for source-table layout, so a regression in the epoch path would pass CI.
  • Suggested fix: Add one epoch-layout orphan-sweep test (stale epoch=<old> reclaimed, live epoch=<n> preserved) and an epoch byte-budget assertion.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/memory-safe-compaction/dual-review/pr-82

…_batch_bytes through config

Blocker (orphan sweep could delete live data): the sweep determined the
live generation via readCursorSync, which collapses "cursor missing" and
"cursor present but unparseable" into the same default {epoch:0}. A
corrupt/partial cursor.json (plausible exactly in the crash scenarios
this PR targets) made liveGenerationDir synthesize "epoch=0", so a
source-table partition's live `table` dir was misclassified as an orphan
and reclaimed after the grace window.
- Add partition.tryReadCursorSync: returns null when the cursor file is
  missing OR unreadable/unparseable (readCursorSync now delegates to it).
- walkForRetired uses tryReadCursorSync; the orphan branch runs only when
  a live generation is positively known — never against a synthetic name.
- liveGenerationDir is layout-aware: a source-table cursor without an
  explicit tableDir resolves to `table`, never an `epoch=*` name.
- Regression test: a corrupt cursor.json must preserve the live table
  (verified to fail without the guard).

Major (compact_batch_bytes was non-functional via config): the field was
stripped by the config validator allow-list and absent from the public
type, so the new memory cap could never be set from a config file.
- Add compact_batch_bytes to the query.cache.maintenance validator
  allow-list (non-negative number check) and to QueryCacheMaintenanceConfig.
- Tests: the field survives parsing; a negative value is rejected.

Minor: add legacy epoch-layout orphan-sweep coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe philcunliffe merged commit 085af71 into master Jun 2, 2026
6 checks passed
@philcunliffe philcunliffe deleted the worktree-memory-safe-compaction branch June 2, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant