Skip to content

PCP fork improvements: index-time importance ranking, token-budgeted repo-map tool, memory-ceiling backpressure, MCP param hardening, and pre-existing test-suite fixes#776

Open
petercoxphoto wants to merge 15 commits into
DeusData:mainfrom
petercoxphoto:main
Open

PCP fork improvements: index-time importance ranking, token-budgeted repo-map tool, memory-ceiling backpressure, MCP param hardening, and pre-existing test-suite fixes#776
petercoxphoto wants to merge 15 commits into
DeusData:mainfrom
petercoxphoto:main

Conversation

@petercoxphoto

Copy link
Copy Markdown
Contributor

This PR contributes a batch of improvements developed on the petercoxphoto fork. They fall into four independent areas plus a test-suite hardening pass; each was built test-first and reviewed. Happy to split into separate PRs if you'd prefer to take them piecemeal.

1. Index-time importance scoring (P3)

Computes and persists a per-symbol importance score at index time, so ranking is available without a query-time pass. Includes a canonical cbm_is_test_path classifier so real-corpus test files are penalised consistently in ranking.

2. Token-budgeted, seed-aware repo-map tool (P4)

A new repo_map MCP tool returning a ranked, token-bounded structural map of a repository, optionally seeded by anchor files/symbols (for orientation use-cases where you want the most relevant slice within a fixed token budget).

3. Memory-ceiling backpressure

Enforces an RSS memory ceiling with a per-file parse cap so a large repo can't balloon the indexer's resident set unbounded (we saw a full index reach ~15 GB before this). Also fixes cbm_mem_rss() under-counting on Linux, which had been blinding the ceiling/backpressure logic.

4. MCP parameter hardening

search_code and the project-reading tools now distinguish a missing required parameter from an unknown project (previously both returned a misleading "project not found"), and accept project_id as an alias for project.

5. Pre-existing test-suite fixes (hardening pass)

Fixes three latent defects that made the default test suite unreliable:

  • A real heap-use-after-free in the parallel indexer (not a test bug): extract_worker's final teardown called cbm_slab_destroy_thread() assuming the thread parser was already destroyed. A worker reaching that point with a live cached parser had its lexer's included_ranges slab page freed underneath it, causing a use-after-free on the next ts_lexer_goto. Fixed by destroying the thread parser unconditionally before the slab teardown, mirroring the two already-correct sites (whose comments document the exact invariant). Reproduced ~3 runs in 4 before the fix; clean across 20 consecutive runs after.
  • LeakSanitizer leaks from unfreed cbm_node_t fields in a test, which had forced an LSAN_OPTIONS=exitcode=0 workaround to keep the suite green.
  • An unguarded divide-by-zero (SIGFPE on a starved store) and a setenv/HOME cleanup leak on assertion failure that could poison later tests.

After this pass the full suite exits 0 under default sanitizer options (ASan + UBSan) with no leak-suppression workaround.

Known issue not addressed here (deliberately)

A separate, pre-existing flaky assertion in tests/test_convergence_probe.c (an LSP convergence-probe timing assertion) still fails intermittently — roughly 1 run in 10. It is memory-clean and unrelated to the fixes above; we left it untouched rather than fold an unrelated change into this PR.

petercoxphoto and others added 13 commits July 1, 2026 16:03
AC1 tests (tests/test_importance.c) + suite wiring + a no-op pass stub
(pass_importance.c) registered as the 7th predump pass in run_predump_passes
(after all edge-producing passes). Tests fail against the stub (no importance
key persisted) — the red-test boundary. Formula lands in the impl commit.

Red evidence: full ASan test-runner run captured at $ARTIFACT_DIR/red-run.log —
all 6 importance tests FAIL (test_importance.c:91/167/215/279/331/398). The
pre-existing lang_contract ASan abort (vendored ts_runtime lexer UAF) is a
baseline flake unrelated to this change.

Boundary committed by the orchestrator (recovery: the builder agent authored
these files but died on an API overload before committing).
pass_importance.c: weighted-degree Aider model persisted per Function/Method/
Class node as a numeric "importance" key on properties_json (append pattern
from pass_complexity.c). importance = sqrt(num_refs) x priv x generic x
distinct x test_penalty:
  num_refs = incoming CALLS + USAGE edges
  priv     x0.1 leading-underscore
  generic  x0.1 name defined in >=5 distinct files
  distinct x10  snake_case|camelCase and len>=8
  test_penalty x0.1 edge-based (TARGET of a TESTS edge, or file is the SOURCE
                    of a TESTS_FILE edge) -- replaces the filename strstr; a
                    'test'-substring path with no test edge is NOT penalised.
Weighted-degree only; PageRank deferred (measured decision, builder-notes).
All 10 fixture importance tests pass. Does not touch compute_search_score.
Real-data verification (full connectors re-index) showed the spec's edge-only
TESTS/TESTS_FILE penalty does NOT achieve AC1's exclusion goal: TESTS_FILE has
only 15 edges (needs a test-file->prod-file mapping), TESTS never targets a
test-resident symbol, and node is_test is stamped on <2% of nodes and none of
the test fixtures/doubles. Edge-only left 33/40 of the top ranked symbols as
test scaffolding (session_dir, FakeConnectorClient, ...).

Fix: penalise when the graph's OWN canonical classifier cbm_is_test_path()
(the one pass_tests uses: test_ prefix, _test.<ext>, /tests/, ...) flags the
file, OR the symbol is the target of an incoming TESTS edge (prod helpers in
non-test files). This is NOT the naive strstr the spec warned against — it
leaves src/testutil_helpers.go unpenalised (test-plan DeusData#6 inversion still green).

Result after re-index: top-40 = 0 test-file symbols (was 33); named exclusions
_run/now/run_hook/client and make_proposals/make_golden/seed_recording all rank
>100 (well outside top-40); top is real domain symbols. All 10 unit tests pass.
Deviation from the spec's stated mechanism — surfaced to Peter.
26 failing AC tests derived from the P4 test plan (rows 1-5, 8-13):
registration/dispatch, token-budget fit (default/tight/tiny/huge/invalid),
seed-boost ranking + inversion, weak-seed widen walk, empty/unresolvable-seed
global fallback, score-absence gating, graceful input errors, signature-level
rendering, determinism, cross-project isolation, repeated-call stability.

All 26 verified FAILING against unmodified code (unknown tool). Five tests
carry explicit positive discriminators so the dispatch-level error cannot
trivially satisfy them. Rows 6-7 (real corpus, latency) are scripted post-
implementation against a worktree binary + scratch CBM_CACHE_DIR.

Pre-existing failure (not this change): test_incremental.c:302 RSS threshold
(2854-2872MB vs 2048 cap under ASan), file last touched pre-base (eed87fd).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KK6YFYGwhT69NjYevuEUCr
New MCP tool repo_map(project, seed_anchors[], token_budget) reading P3's
persisted importance scores (spec wire contract frozen for P5).

- Two first-class modes (P2 finding): seed-boost (resolved anchors + 1-hop
  CALLS/USAGE neighbourhood x50) with global importance ranking as the
  no-seed/unresolvable fallback. Anchor resolution: exact name -> qualified
  name -> file path.
- Weak-seed widen (P2 shape hints): 1-hop symbol neighbourhood <= 3 ->
  file-of-symbol seeding + 2-hop expansion at x25.
- Module/File 1-hop neighbours (file-level co-usage USAGE edges) expand to
  their file's symbols at x25 — this is how P2's ground-truth co-change
  neighbourhood (sender.py/gmail_client.py -> extract_address) travels.
- Budget: chars/4 estimator (no server-side LLM tokenizer), default 1600
  tokens, greedy maximal prefix over the ranked lines; token_budget <= 0 is
  an input error.
- Determinism: effective score DESC, qualified_name ASC (total order).
- Score-absence gate (spec AC7): symbol nodes with zero importance coverage
  -> explicit 'unscored' error advising re-index; partial coverage proceeds
  (NULLs rank last).
- Rendering: 'file: symbol(sig)' one line per symbol; param-only signatures
  get the name prefixed; whitespace/newlines flattened; no bodies.
- Store layer: cbm_store_importance_coverage + cbm_store_top_symbols_by_
  importance (json_extract ORDER BY, project-filtered).

Verified on real corpus (worktree binary, scratch CBM_CACHE_DIR):
budget fit on connectors + codebase-memory-mcp in both modes; zero
P3-pinned noise/test-leakage names in top-40; P2 seeded ground truth
surfaces 28 sender/gmail lines from rank 3; AC6 net latency 52ms global /
55ms seeded on connectors (bound 500ms). 28/28 AC tests green; suite
5651 PASS with only the pre-existing test_incremental RSS failure.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KK6YFYGwhT69NjYevuEUCr
AC tests for the per-file parse-size cap constant (CBM_MAX_FILE_MB) and the
enforcing RSS memory ceiling, derived from test-plan.md rows 1-11. Fails to
compile against unmodified HEAD (cbm_max_file_bytes/cbm_mem_ceiling/
cbm_mem_over_ceiling/CBM_DEFAULT_MAX_FILE_MB don't exist yet) —
-Werror=implicit-function-declaration, confirmed on IO (make -f
Makefile.cbm test, DONE_EXIT=2). Rows 5 and 11 (CBM_PERCENT / backpressure
characterization pins) and row 7 (healthy-path regression pin) are
behavior-preserving by design and exempt from the red-must-fail rule.

- tests/test_mem.c: 19 new C unit tests (file-cap clamp/env-override/
  boundary via a real cbm_parallel_extract run; ceiling clamp/floor/
  env-override/live-RSS comparison).
- tests/check_no_duplicate_file_cap.sh: structural grep against COMMITTED
  source (git show HEAD:) for row 4 (single-source-of-truth) + row 5 pin +
  row 11 pin.
- tests/gen_mem_ceiling_repro.sh: synthetic large-file repro generator
  (committed for provenance — none existed in done-data).
- tests/mem_ceiling_abort.sh: IO-only real-process harness driving the
  compiled binary (cli index_repository) for rows 6, 7, 9, 10, and the R4
  store-integrity check.
Two independent fixes for the 2026-07-01 memory-balloon diagnosis
(pai/diagnose-codebase-memory), both design decisions already settled by
Peter (handover pai/handover-cbm-balloon):

1. Per-file parse cap: collapsed 7 duplicated "100 MB" cap sites (6x the
   misused CBM_PERCENT * CBM_SZ_1K * CBM_SZ_1K, 1x the PXC_MAX_FILE_BYTES_FACTOR
   twin in pass_lsp_cross.c) into a single named, env-overridable resolver
   cbm_max_file_bytes() (system_info.c, declared platform.h), mirroring
   cbm_default_worker_count()'s CBM_WORKERS clamp shape exactly. Default
   CBM_DEFAULT_MAX_FILE_MB=10 (constants.h) clears sqlite3.c's ~8MB
   amalgamation; env override CBM_MAX_FILE_MB clamped to [1, 1024] MB,
   invalid/blank/unset all fall back to the default (never coerce to 0).
   CBM_PERCENT itself (constants.h:48) is UNTOUCHED — its 5 legitimate
   percentage/depth consumers (mem.c:92,102; vmem.c:78,90; cypher.c:2838)
   still reference it directly. Oversized-file semantics unchanged: SKIP
   (read_file returns NULL), not abort.

2. Enforcing RSS memory ceiling: layered a hard-abort verdict on top of the
   EXISTING RSS/budget machinery (cbm_mem_rss/cbm_mem_budget/
   cbm_mem_over_budget — no new RSS probe needed). New mem.h/mem.c API:
   cbm_mem_ceiling() (fraction of cgroup-aware detected RAM, floored,
   env-overridable via CBM_MEM_CEILING_MB), cbm_mem_over_ceiling(), and
   cbm_mem_abort_if_over_ceiling(file, phase) — logs a structured
   diagnostic (file, phase, rss_mb, ceiling_mb) then abort()s (SIGABRT).
   Wired into pass_parallel.c's extract_worker loop immediately after the
   existing advisory backpressure spin (pass_parallel.c:559-567) exhausts
   its bounded spins, so the hard check only fires once the advisory path
   has already tried and failed. Ceiling is always strictly above the
   advisory budget so a soft-overshoot repo never reaches abort.

R4 (store-integrity): traced pipeline.c — cbm_parallel_extract() (where the
abort is wired) always returns before dump_and_persist_hashes()/
cbm_gbuf_dump_to_sqlite() (the only SQLite write) is reached, so a hard
abort here can never leave a half-written store.

Not touched: the CBM_WORKERS=8 stop-gap in ~/.claude.json (Phase-5
orchestrator step, out of scope for this commit).
The row-5 characterization pin in check_no_duplicate_file_cap.sh asserted
CBM_PERCENT at exact line numbers (mem.c:92,102). The implementation
commit legitimately added new ceiling constants/functions earlier in
mem.c, shifting those references to lines 104/114 without touching their
content — an exact-line pin false-positived on this harmless drift.
Switched to a per-file minimum-occurrence-count check, which still catches
the real regression this row guards (the collapse repurposing or deleting
a legitimate CBM_PERCENT use) without being brittle to unrelated line
shifts elsewhere in the same file.
…ssure

Discovered while calibrating tests/mem_ceiling_abort.sh: a synthetic
large-file index genuinely reached ~2.2-2.75GB RSS (confirmed via ps and
mi_process_info's own peak_rss field, which read correctly), yet
cbm_mem_rss() reported ~4MB at the exact same moments, so neither the new
hard ceiling nor the PRE-EXISTING advisory budget/backpressure
(pass_parallel.c:559, cbm_mem_over_budget) ever saw real memory pressure
during concurrent large-file parsing.

Root cause: vendored/mimalloc/src/prim/unix/prim.c's
_mi_prim_process_info() never sets pinfo->current_rss on Linux — only
peak_rss (via getrusage's ru_maxrss, a monotonic high-water mark).
current_rss silently falls back to mimalloc's own internal committed-page
counter, which this project deliberately tunes low
(mi_option_arena_eager_commit=0 + purge_decommits=1 + purge_delay=0, set
in cbm_mem_init to reduce upfront memory) — so on Linux, "current RSS" was
never actually current RSS, it was a low-biased mimalloc-internal metric
that happened to be nonzero often enough to defeat the existing
`current_rss > 0` ASan-only fallback guard.

Fix: on Linux, cbm_mem_rss() now prefers os_rss() (/proc/self/statm,
already implemented, unaffected by mimalloc's internal accounting) as the
PRIMARY source, falling back to mimalloc's value only if /proc is
unavailable. macOS/Windows unaffected (mimalloc's primitives set
current_rss correctly there via task_info/GetProcessMemoryInfo).

Verified via a calibrated 40x8MB synthetic large-file index on IO:
before the fix, a 2100MB forced ceiling never tripped despite RSS
reaching 2.75GB; after the fix, the same fixture aborts cleanly at
rss_mb=2135 against ceiling_mb=2100 with the expected diagnostic dump.
Full ASan test suite re-run clean (0 FAIL) after the fix — the
pre-existing test_incremental.c:302 RSS-delta flake (a live FastAPI clone
that had grown past its 2048MB assumption) did not recur, consistent with
more accurate RSS reporting also improving the existing backpressure's
effectiveness.

tests/mem_ceiling_abort.sh: fixed two harness bugs found while
calibrating this — (1) `mode=full` index_repository does not force a
re-parse when file hashes are unchanged from a prior run against the same
repo_path (routes to incremental.noop), so every run needing a genuine
full extraction now resets the target's cached .db first; (2) the forced
test ceiling is now calibrated against a directly-probed peak RSS for the
exact fixture shape (2100MB, between the 2048MB floor and an observed
~2.2-2.75GB peak) instead of an unverified guess.
…oject_id alias)

AC tests from the pai/codebase-memory-search test plan, committed FAILING
against 132460f (red-test boundary). 8 red tests / 11 FAIL assertions, all in
tests/test_mcp.c; rows 5-7 are characterization pins that pass by design.
Row-4 precedence also passes at red by design: pre-fix 'alias ignored' is
indistinguishable from 'project wins' — the discriminating red for the alias
is the row-3 pair.

Full-suite run at this boundary: red-test-run.log in the /develop run dir
(11 FAILs, every non-mcp suite green incl. incremental).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t_id alias

Root cause of the misleading error (pai/codebase-memory-search): a missing
'project' param and an unknown project both reached REQUIRE_STORE's
'project not found or not indexed' + available_projects error, so a caller
using the wrong param name (project_id) was told the project they named -
visible in the very list printed - was not found.

- New get_project_arg() helper: reads 'project', silently accepts
  'project_id' as an alias when 'project' is absent/empty/non-string;
  a present 'project' always wins. Alias is NOT advertised in the TOOLS[]
  schemas (upstream param name preserved; error text teaches the canonical
  name).
- All 12 project-reading handlers (search_graph, query_graph, trace_path,
  get_code_snippet, get_graph_schema, get_architecture, repo_map,
  search_code, delete_project, index_status, detect_changes, manage_adr)
  now return plain 'project is required' for a missing/empty/non-string
  project BEFORE any store lookup; the available_projects shape is
  reserved for a present-but-unknown value.
- detect_changes/manage_adr not-found errors upgraded from bare
  'project not found' to the standard build_project_list_error shape.
- search_code's missing-param error changed from the half-conflated
  build_project_list_error('project is required') to the plain message.
- ingest_traces stub untouched (never reads project); index_repository/
  list_projects take no project param.
- Param-check order preserved: query/function_name/pattern/qualified_name
  are still reported before project (pinned by tests).

Tests: 5682 passed / 0 failed under ASan+UBSan (see run-dir
test-results.txt; red boundary 353d7bf).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… unguarded div, env-poison, regex leak

Four defects + one hygiene fix, all pre-existing at base 132460f (surfaced during the
pai/codebase-memory-search /develop run 2026-07-02). Makes the default suite exit 0
LSan-clean with NO LSAN_OPTIONS=exitcode=0 workaround.

1. Production heap-UAF/bad-free (Layer 2): extract_worker's final cleanup called
   cbm_slab_destroy_thread() while a 0-file / leftover-cached tl_parser was still live,
   freeing the slab page backing its lexer.included_ranges -> UAF on next ts_lexer_goto
   or bad-free at ts_parser_delete. Add an unconditional (idempotent, NULL-safe)
   cbm_destroy_thread_parser() before cbm_slab_destroy_thread() at pass_parallel.c:694,
   mirroring the correct loop-body (682) and k8s-pass (2325) sites. Fix the stale comment.

2. LSan leaks (test_importance.c): free the caller-owned cbm_node_t fields strdup'd by
   cbm_store_find_node_by_qn at the 3 call sites (cbm_node_free_fields on out/out1/out2).

3a. Unguarded division (test_incremental.c): incr_db_deleted_recovery divided node counts
    by nodes_before, which is 0 for a starved store -> integer SIGFPE. Extract a testable
    compute_recovery_diff_pct() helper that refuses to divide on a <=0 baseline (FAIL
    loudly); add a deterministic network-free guard test for the 0/negative path.

3b. setenv-leak / HOME-poison on ASSERT failure (test_ui.c): an ASSERT between
    setenv(HOME,tmp) and the restore skipped cleanup on failure, leaving HOME pointing at
    an rmtree'd dir (poisons later suites) and leaking old_home. Add an ASSERT_HOME macro
    that restores HOME and frees old_home before its early return. (test_userconfig.c was
    checked and left unchanged — it restores env before asserting; old_appdata is a stack
    buffer, so no poison window and no leak.)

4. Regex leak (mcp.c, observation): handle_search_code compiled path_regex before its
   required-param early returns without freeing it; add cbm_regfree on those three paths.

Verification (ASan+UBSan build): full suite exits 0, 5683 passed, 0 leaks; 10/10 bare
runs memory-clean (no UAF/bad-free). NULL-safety of cbm_destroy_thread_parser confirmed
at cbm.c:387.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… paths

The cycle-1 audit found the regex-leak fix (defect 4) covered only the
missing-param early returns; the write_pattern_file-failure (mcp.c:3813) and
cbm_popen-failure (mcp.c:3845) returns still leaked the compiled path_regex
when a path_filter was supplied. Add the same has_path_filter-guarded
cbm_regfree used at the other early returns. Rebuild + full suite green
(SUITE_EXIT=0, 5683 passed, 0 leaks/UAF under default sanitizer options).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@petercoxphoto petercoxphoto requested a review from DeusData as a code owner July 2, 2026 15:53
@DeusData

DeusData commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Hey @petercoxphoto, thx will check

petercoxphoto and others added 2 commits July 2, 2026 23:06
AC tests (test-plan rows 1-6, 8-11) for two additive index observability
fields, committed FAILING before implementation:

- repo_map seeds_unresolved[]: mixed/all-unresolved/duplicate cases assert
  the array lists exactly the anchor names rm_resolve_anchor rejected;
  all-resolved / zero-requested / malformed-coerced cases assert the key
  is omitted (pinned shape).
- index_status indexed_at: new success-path coverage (ready + empty
  project) asserting an ISO-8601 UTC string field alongside nodes/edges/
  status.

Discriminating new-behavior assertions fail red (field absent -> NULL /
rm_json_str_arr returns -1). ASan/UBSan leak reports at the tail are the
expected early-return-skips-cleanup symptom of the failing ASSERTs; they
clear when the tests go green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WuawHYdivB84yiHwpQWWZ2
Two additive JSON response fields closing the repo-map observability gaps
found on injection's first live day (2026-07-02).

handle_repo_map: capture the anchor names rm_resolve_anchor rejects
(ownership transferred out of anchors[] in the resolution loop instead of
the unconditional free) and emit them as `seeds_unresolved`. Copied into
the doc via yyjson_mut_arr_add_strcpy so the retained names are freed
immediately after the response is serialized; also freed on the
importance-ranking-query error path between the loop and the JSON build.
Omitted when every seed resolved or none were requested. The unscored
gate and all other early returns sit BEFORE the loop, so no other leak
path exists (verified by grep of returns in [loop, JSON-build)).

handle_index_status: first cbm_store_get_project call in this handler;
emit proj.indexed_at (ISO-8601, the persisted column). Added no-copy via
yyjson_mut_obj_add_str, so cbm_project_free_fields(&proj) runs strictly
after yy_doc_to_str(doc) — matching the existing `project` heap-string
ordering in the same function.

Both additive; existing fields unchanged. Full suite green under
ASan/UBSan/LeakSanitizer: 5686 passed, no ASan/LSan/UAF (vendored
tree-sitter grammar UBSan findings are pre-existing, unrelated).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WuawHYdivB84yiHwpQWWZ2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants