PCP fork improvements: index-time importance ranking, token-budgeted repo-map tool, memory-ceiling backpressure, MCP param hardening, and pre-existing test-suite fixes by petercoxphoto · Pull Request #776 · DeusData/codebase-memory-mcp

petercoxphoto · 2026-07-02T15:53:47Z

This PR contributes a batch of improvements developed on the petercoxphoto fork. They fall into four independent areas plus a test-suite hardening pass; each was built test-first and reviewed. Happy to split into separate PRs if you'd prefer to take them piecemeal.

1. Index-time importance scoring (P3)

Computes and persists a per-symbol importance score at index time, so ranking is available without a query-time pass. Includes a canonical cbm_is_test_path classifier so real-corpus test files are penalised consistently in ranking.

2. Token-budgeted, seed-aware repo-map tool (P4)

A new repo_map MCP tool returning a ranked, token-bounded structural map of a repository, optionally seeded by anchor files/symbols (for orientation use-cases where you want the most relevant slice within a fixed token budget).

3. Memory-ceiling backpressure

Enforces an RSS memory ceiling with a per-file parse cap so a large repo can't balloon the indexer's resident set unbounded (we saw a full index reach ~15 GB before this). Also fixes cbm_mem_rss() under-counting on Linux, which had been blinding the ceiling/backpressure logic.

4. MCP parameter hardening

search_code and the project-reading tools now distinguish a missing required parameter from an unknown project (previously both returned a misleading "project not found"), and accept project_id as an alias for project.

5. Pre-existing test-suite fixes (hardening pass)

Fixes three latent defects that made the default test suite unreliable:

A real heap-use-after-free in the parallel indexer (not a test bug): extract_worker's final teardown called cbm_slab_destroy_thread() assuming the thread parser was already destroyed. A worker reaching that point with a live cached parser had its lexer's included_ranges slab page freed underneath it, causing a use-after-free on the next ts_lexer_goto. Fixed by destroying the thread parser unconditionally before the slab teardown, mirroring the two already-correct sites (whose comments document the exact invariant). Reproduced ~3 runs in 4 before the fix; clean across 20 consecutive runs after.
LeakSanitizer leaks from unfreed cbm_node_t fields in a test, which had forced an LSAN_OPTIONS=exitcode=0 workaround to keep the suite green.
An unguarded divide-by-zero (SIGFPE on a starved store) and a setenv/HOME cleanup leak on assertion failure that could poison later tests.

After this pass the full suite exits 0 under default sanitizer options (ASan + UBSan) with no leak-suppression workaround.

Known issue not addressed here (deliberately)

A separate, pre-existing flaky assertion in tests/test_convergence_probe.c (an LSP convergence-probe timing assertion) still fails intermittently — roughly 1 run in 10. It is memory-clean and unrelated to the fixes above; we left it untouched rather than fold an unrelated change into this PR.

AC1 tests (tests/test_importance.c) + suite wiring + a no-op pass stub (pass_importance.c) registered as the 7th predump pass in run_predump_passes (after all edge-producing passes). Tests fail against the stub (no importance key persisted) — the red-test boundary. Formula lands in the impl commit. Red evidence: full ASan test-runner run captured at $ARTIFACT_DIR/red-run.log — all 6 importance tests FAIL (test_importance.c:91/167/215/279/331/398). The pre-existing lang_contract ASan abort (vendored ts_runtime lexer UAF) is a baseline flake unrelated to this change. Boundary committed by the orchestrator (recovery: the builder agent authored these files but died on an API overload before committing).

pass_importance.c: weighted-degree Aider model persisted per Function/Method/ Class node as a numeric "importance" key on properties_json (append pattern from pass_complexity.c). importance = sqrt(num_refs) x priv x generic x distinct x test_penalty: num_refs = incoming CALLS + USAGE edges priv x0.1 leading-underscore generic x0.1 name defined in >=5 distinct files distinct x10 snake_case|camelCase and len>=8 test_penalty x0.1 edge-based (TARGET of a TESTS edge, or file is the SOURCE of a TESTS_FILE edge) -- replaces the filename strstr; a 'test'-substring path with no test edge is NOT penalised. Weighted-degree only; PageRank deferred (measured decision, builder-notes). All 10 fixture importance tests pass. Does not touch compute_search_score.

Real-data verification (full connectors re-index) showed the spec's edge-only TESTS/TESTS_FILE penalty does NOT achieve AC1's exclusion goal: TESTS_FILE has only 15 edges (needs a test-file->prod-file mapping), TESTS never targets a test-resident symbol, and node is_test is stamped on <2% of nodes and none of the test fixtures/doubles. Edge-only left 33/40 of the top ranked symbols as test scaffolding (session_dir, FakeConnectorClient, ...). Fix: penalise when the graph's OWN canonical classifier cbm_is_test_path() (the one pass_tests uses: test_ prefix, _test.<ext>, /tests/, ...) flags the file, OR the symbol is the target of an incoming TESTS edge (prod helpers in non-test files). This is NOT the naive strstr the spec warned against — it leaves src/testutil_helpers.go unpenalised (test-plan DeusData#6 inversion still green). Result after re-index: top-40 = 0 test-file symbols (was 33); named exclusions _run/now/run_hook/client and make_proposals/make_golden/seed_recording all rank >100 (well outside top-40); top is real domain symbols. All 10 unit tests pass. Deviation from the spec's stated mechanism — surfaced to Peter.

26 failing AC tests derived from the P4 test plan (rows 1-5, 8-13): registration/dispatch, token-budget fit (default/tight/tiny/huge/invalid), seed-boost ranking + inversion, weak-seed widen walk, empty/unresolvable-seed global fallback, score-absence gating, graceful input errors, signature-level rendering, determinism, cross-project isolation, repeated-call stability. All 26 verified FAILING against unmodified code (unknown tool). Five tests carry explicit positive discriminators so the dispatch-level error cannot trivially satisfy them. Rows 6-7 (real corpus, latency) are scripted post- implementation against a worktree binary + scratch CBM_CACHE_DIR. Pre-existing failure (not this change): test_incremental.c:302 RSS threshold (2854-2872MB vs 2048 cap under ASan), file last touched pre-base (eed87fd). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KK6YFYGwhT69NjYevuEUCr

New MCP tool repo_map(project, seed_anchors[], token_budget) reading P3's persisted importance scores (spec wire contract frozen for P5). - Two first-class modes (P2 finding): seed-boost (resolved anchors + 1-hop CALLS/USAGE neighbourhood x50) with global importance ranking as the no-seed/unresolvable fallback. Anchor resolution: exact name -> qualified name -> file path. - Weak-seed widen (P2 shape hints): 1-hop symbol neighbourhood <= 3 -> file-of-symbol seeding + 2-hop expansion at x25. - Module/File 1-hop neighbours (file-level co-usage USAGE edges) expand to their file's symbols at x25 — this is how P2's ground-truth co-change neighbourhood (sender.py/gmail_client.py -> extract_address) travels. - Budget: chars/4 estimator (no server-side LLM tokenizer), default 1600 tokens, greedy maximal prefix over the ranked lines; token_budget <= 0 is an input error. - Determinism: effective score DESC, qualified_name ASC (total order). - Score-absence gate (spec AC7): symbol nodes with zero importance coverage -> explicit 'unscored' error advising re-index; partial coverage proceeds (NULLs rank last). - Rendering: 'file: symbol(sig)' one line per symbol; param-only signatures get the name prefixed; whitespace/newlines flattened; no bodies. - Store layer: cbm_store_importance_coverage + cbm_store_top_symbols_by_ importance (json_extract ORDER BY, project-filtered). Verified on real corpus (worktree binary, scratch CBM_CACHE_DIR): budget fit on connectors + codebase-memory-mcp in both modes; zero P3-pinned noise/test-leakage names in top-40; P2 seeded ground truth surfaces 28 sender/gmail lines from rank 3; AC6 net latency 52ms global / 55ms seeded on connectors (bound 500ms). 28/28 AC tests green; suite 5651 PASS with only the pre-existing test_incremental RSS failure. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KK6YFYGwhT69NjYevuEUCr

AC tests for the per-file parse-size cap constant (CBM_MAX_FILE_MB) and the enforcing RSS memory ceiling, derived from test-plan.md rows 1-11. Fails to compile against unmodified HEAD (cbm_max_file_bytes/cbm_mem_ceiling/ cbm_mem_over_ceiling/CBM_DEFAULT_MAX_FILE_MB don't exist yet) — -Werror=implicit-function-declaration, confirmed on IO (make -f Makefile.cbm test, DONE_EXIT=2). Rows 5 and 11 (CBM_PERCENT / backpressure characterization pins) and row 7 (healthy-path regression pin) are behavior-preserving by design and exempt from the red-must-fail rule. - tests/test_mem.c: 19 new C unit tests (file-cap clamp/env-override/ boundary via a real cbm_parallel_extract run; ceiling clamp/floor/ env-override/live-RSS comparison). - tests/check_no_duplicate_file_cap.sh: structural grep against COMMITTED source (git show HEAD:) for row 4 (single-source-of-truth) + row 5 pin + row 11 pin. - tests/gen_mem_ceiling_repro.sh: synthetic large-file repro generator (committed for provenance — none existed in done-data). - tests/mem_ceiling_abort.sh: IO-only real-process harness driving the compiled binary (cli index_repository) for rows 6, 7, 9, 10, and the R4 store-integrity check.

Two independent fixes for the 2026-07-01 memory-balloon diagnosis (pai/diagnose-codebase-memory), both design decisions already settled by Peter (handover pai/handover-cbm-balloon): 1. Per-file parse cap: collapsed 7 duplicated "100 MB" cap sites (6x the misused CBM_PERCENT * CBM_SZ_1K * CBM_SZ_1K, 1x the PXC_MAX_FILE_BYTES_FACTOR twin in pass_lsp_cross.c) into a single named, env-overridable resolver cbm_max_file_bytes() (system_info.c, declared platform.h), mirroring cbm_default_worker_count()'s CBM_WORKERS clamp shape exactly. Default CBM_DEFAULT_MAX_FILE_MB=10 (constants.h) clears sqlite3.c's ~8MB amalgamation; env override CBM_MAX_FILE_MB clamped to [1, 1024] MB, invalid/blank/unset all fall back to the default (never coerce to 0). CBM_PERCENT itself (constants.h:48) is UNTOUCHED — its 5 legitimate percentage/depth consumers (mem.c:92,102; vmem.c:78,90; cypher.c:2838) still reference it directly. Oversized-file semantics unchanged: SKIP (read_file returns NULL), not abort. 2. Enforcing RSS memory ceiling: layered a hard-abort verdict on top of the EXISTING RSS/budget machinery (cbm_mem_rss/cbm_mem_budget/ cbm_mem_over_budget — no new RSS probe needed). New mem.h/mem.c API: cbm_mem_ceiling() (fraction of cgroup-aware detected RAM, floored, env-overridable via CBM_MEM_CEILING_MB), cbm_mem_over_ceiling(), and cbm_mem_abort_if_over_ceiling(file, phase) — logs a structured diagnostic (file, phase, rss_mb, ceiling_mb) then abort()s (SIGABRT). Wired into pass_parallel.c's extract_worker loop immediately after the existing advisory backpressure spin (pass_parallel.c:559-567) exhausts its bounded spins, so the hard check only fires once the advisory path has already tried and failed. Ceiling is always strictly above the advisory budget so a soft-overshoot repo never reaches abort. R4 (store-integrity): traced pipeline.c — cbm_parallel_extract() (where the abort is wired) always returns before dump_and_persist_hashes()/ cbm_gbuf_dump_to_sqlite() (the only SQLite write) is reached, so a hard abort here can never leave a half-written store. Not touched: the CBM_WORKERS=8 stop-gap in ~/.claude.json (Phase-5 orchestrator step, out of scope for this commit).

The row-5 characterization pin in check_no_duplicate_file_cap.sh asserted CBM_PERCENT at exact line numbers (mem.c:92,102). The implementation commit legitimately added new ceiling constants/functions earlier in mem.c, shifting those references to lines 104/114 without touching their content — an exact-line pin false-positived on this harmless drift. Switched to a per-file minimum-occurrence-count check, which still catches the real regression this row guards (the collapse repurposing or deleting a legitimate CBM_PERCENT use) without being brittle to unrelated line shifts elsewhere in the same file.

…ssure Discovered while calibrating tests/mem_ceiling_abort.sh: a synthetic large-file index genuinely reached ~2.2-2.75GB RSS (confirmed via ps and mi_process_info's own peak_rss field, which read correctly), yet cbm_mem_rss() reported ~4MB at the exact same moments, so neither the new hard ceiling nor the PRE-EXISTING advisory budget/backpressure (pass_parallel.c:559, cbm_mem_over_budget) ever saw real memory pressure during concurrent large-file parsing. Root cause: vendored/mimalloc/src/prim/unix/prim.c's _mi_prim_process_info() never sets pinfo->current_rss on Linux — only peak_rss (via getrusage's ru_maxrss, a monotonic high-water mark). current_rss silently falls back to mimalloc's own internal committed-page counter, which this project deliberately tunes low (mi_option_arena_eager_commit=0 + purge_decommits=1 + purge_delay=0, set in cbm_mem_init to reduce upfront memory) — so on Linux, "current RSS" was never actually current RSS, it was a low-biased mimalloc-internal metric that happened to be nonzero often enough to defeat the existing `current_rss > 0` ASan-only fallback guard. Fix: on Linux, cbm_mem_rss() now prefers os_rss() (/proc/self/statm, already implemented, unaffected by mimalloc's internal accounting) as the PRIMARY source, falling back to mimalloc's value only if /proc is unavailable. macOS/Windows unaffected (mimalloc's primitives set current_rss correctly there via task_info/GetProcessMemoryInfo). Verified via a calibrated 40x8MB synthetic large-file index on IO: before the fix, a 2100MB forced ceiling never tripped despite RSS reaching 2.75GB; after the fix, the same fixture aborts cleanly at rss_mb=2135 against ceiling_mb=2100 with the expected diagnostic dump. Full ASan test suite re-run clean (0 FAIL) after the fix — the pre-existing test_incremental.c:302 RSS-delta flake (a live FastAPI clone that had grown past its 2048MB assumption) did not recur, consistent with more accurate RSS reporting also improving the existing backpressure's effectiveness. tests/mem_ceiling_abort.sh: fixed two harness bugs found while calibrating this — (1) `mode=full` index_repository does not force a re-parse when file hashes are unchanged from a prior run against the same repo_path (routes to incremental.noop), so every run needing a genuine full extraction now resets the target's cached .db first; (2) the forced test ceiling is now calibrated against a directly-probed peak RSS for the exact fixture shape (2100MB, between the 2048MB floor and an observed ~2.2-2.75GB peak) instead of an unverified guess.

…oject_id alias) AC tests from the pai/codebase-memory-search test plan, committed FAILING against 132460f (red-test boundary). 8 red tests / 11 FAIL assertions, all in tests/test_mcp.c; rows 5-7 are characterization pins that pass by design. Row-4 precedence also passes at red by design: pre-fix 'alias ignored' is indistinguishable from 'project wins' — the discriminating red for the alias is the row-3 pair. Full-suite run at this boundary: red-test-run.log in the /develop run dir (11 FAILs, every non-mcp suite green incl. incremental). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t_id alias Root cause of the misleading error (pai/codebase-memory-search): a missing 'project' param and an unknown project both reached REQUIRE_STORE's 'project not found or not indexed' + available_projects error, so a caller using the wrong param name (project_id) was told the project they named - visible in the very list printed - was not found. - New get_project_arg() helper: reads 'project', silently accepts 'project_id' as an alias when 'project' is absent/empty/non-string; a present 'project' always wins. Alias is NOT advertised in the TOOLS[] schemas (upstream param name preserved; error text teaches the canonical name). - All 12 project-reading handlers (search_graph, query_graph, trace_path, get_code_snippet, get_graph_schema, get_architecture, repo_map, search_code, delete_project, index_status, detect_changes, manage_adr) now return plain 'project is required' for a missing/empty/non-string project BEFORE any store lookup; the available_projects shape is reserved for a present-but-unknown value. - detect_changes/manage_adr not-found errors upgraded from bare 'project not found' to the standard build_project_list_error shape. - search_code's missing-param error changed from the half-conflated build_project_list_error('project is required') to the plain message. - ingest_traces stub untouched (never reads project); index_repository/ list_projects take no project param. - Param-check order preserved: query/function_name/pattern/qualified_name are still reported before project (pinned by tests). Tests: 5682 passed / 0 failed under ASan+UBSan (see run-dir test-results.txt; red boundary 353d7bf). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… unguarded div, env-poison, regex leak Four defects + one hygiene fix, all pre-existing at base 132460f (surfaced during the pai/codebase-memory-search /develop run 2026-07-02). Makes the default suite exit 0 LSan-clean with NO LSAN_OPTIONS=exitcode=0 workaround. 1. Production heap-UAF/bad-free (Layer 2): extract_worker's final cleanup called cbm_slab_destroy_thread() while a 0-file / leftover-cached tl_parser was still live, freeing the slab page backing its lexer.included_ranges -> UAF on next ts_lexer_goto or bad-free at ts_parser_delete. Add an unconditional (idempotent, NULL-safe) cbm_destroy_thread_parser() before cbm_slab_destroy_thread() at pass_parallel.c:694, mirroring the correct loop-body (682) and k8s-pass (2325) sites. Fix the stale comment. 2. LSan leaks (test_importance.c): free the caller-owned cbm_node_t fields strdup'd by cbm_store_find_node_by_qn at the 3 call sites (cbm_node_free_fields on out/out1/out2). 3a. Unguarded division (test_incremental.c): incr_db_deleted_recovery divided node counts by nodes_before, which is 0 for a starved store -> integer SIGFPE. Extract a testable compute_recovery_diff_pct() helper that refuses to divide on a <=0 baseline (FAIL loudly); add a deterministic network-free guard test for the 0/negative path. 3b. setenv-leak / HOME-poison on ASSERT failure (test_ui.c): an ASSERT between setenv(HOME,tmp) and the restore skipped cleanup on failure, leaving HOME pointing at an rmtree'd dir (poisons later suites) and leaking old_home. Add an ASSERT_HOME macro that restores HOME and frees old_home before its early return. (test_userconfig.c was checked and left unchanged — it restores env before asserting; old_appdata is a stack buffer, so no poison window and no leak.) 4. Regex leak (mcp.c, observation): handle_search_code compiled path_regex before its required-param early returns without freeing it; add cbm_regfree on those three paths. Verification (ASan+UBSan build): full suite exits 0, 5683 passed, 0 leaks; 10/10 bare runs memory-clean (no UAF/bad-free). NULL-safety of cbm_destroy_thread_parser confirmed at cbm.c:387. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… paths The cycle-1 audit found the regex-leak fix (defect 4) covered only the missing-param early returns; the write_pattern_file-failure (mcp.c:3813) and cbm_popen-failure (mcp.c:3845) returns still leaked the compiled path_regex when a path_filter was supplied. Add the same has_path_filter-guarded cbm_regfree used at the other early returns. Rebuild + full suite green (SUITE_EXIT=0, 5683 passed, 0 leaks/UAF under default sanitizer options). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DeusData · 2026-07-02T20:21:13Z

Hey @petercoxphoto, thx will check

AC tests (test-plan rows 1-6, 8-11) for two additive index observability fields, committed FAILING before implementation: - repo_map seeds_unresolved[]: mixed/all-unresolved/duplicate cases assert the array lists exactly the anchor names rm_resolve_anchor rejected; all-resolved / zero-requested / malformed-coerced cases assert the key is omitted (pinned shape). - index_status indexed_at: new success-path coverage (ready + empty project) asserting an ISO-8601 UTC string field alongside nodes/edges/ status. Discriminating new-behavior assertions fail red (field absent -> NULL / rm_json_str_arr returns -1). ASan/UBSan leak reports at the tail are the expected early-return-skips-cleanup symptom of the failing ASSERTs; they clear when the tests go green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WuawHYdivB84yiHwpQWWZ2

Two additive JSON response fields closing the repo-map observability gaps found on injection's first live day (2026-07-02). handle_repo_map: capture the anchor names rm_resolve_anchor rejects (ownership transferred out of anchors[] in the resolution loop instead of the unconditional free) and emit them as `seeds_unresolved`. Copied into the doc via yyjson_mut_arr_add_strcpy so the retained names are freed immediately after the response is serialized; also freed on the importance-ranking-query error path between the loop and the JSON build. Omitted when every seed resolved or none were requested. The unscored gate and all other early returns sit BEFORE the loop, so no other leak path exists (verified by grep of returns in [loop, JSON-build)). handle_index_status: first cbm_store_get_project call in this handler; emit proj.indexed_at (ISO-8601, the persisted column). Added no-copy via yyjson_mut_obj_add_str, so cbm_project_free_fields(&proj) runs strictly after yy_doc_to_str(doc) — matching the existing `project` heap-string ordering in the same function. Both additive; existing fields unchanged. Full suite green under ASan/UBSan/LeakSanitizer: 5686 passed, no ASan/LSan/UAF (vendored tree-sitter grammar UBSan findings are pre-existing, unrelated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WuawHYdivB84yiHwpQWWZ2

petercoxphoto and others added 13 commits July 1, 2026 16:03

petercoxphoto requested a review from DeusData as a code owner July 2, 2026 15:53

petercoxphoto and others added 2 commits July 2, 2026 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCP fork improvements: index-time importance ranking, token-budgeted repo-map tool, memory-ceiling backpressure, MCP param hardening, and pre-existing test-suite fixes#776

PCP fork improvements: index-time importance ranking, token-budgeted repo-map tool, memory-ceiling backpressure, MCP param hardening, and pre-existing test-suite fixes#776
petercoxphoto wants to merge 15 commits into
DeusData:mainfrom
petercoxphoto:main

petercoxphoto commented Jul 2, 2026

Uh oh!

DeusData commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petercoxphoto commented Jul 2, 2026

1. Index-time importance scoring (P3)

2. Token-budgeted, seed-aware repo-map tool (P4)

3. Memory-ceiling backpressure

4. MCP parameter hardening

5. Pre-existing test-suite fixes (hardening pass)

Known issue not addressed here (deliberately)

Uh oh!

DeusData commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants