feat(index): report per-file indexing failures via skipped[] + logfile (Stage 2/B2-B4)#785
Merged
Merged
Conversation
…e (Stage 2/B2-B4)
index_repository silently dropped files that failed to index: CBMFileResult.has_error
was set but never read, oversized files (>100 MB) were dropped with no signal, and
read/extract failures only bumped a logged-only counter — a file that couldn't be
indexed just vanished from the graph.
Collect and surface them:
- New retained {path, reason, phase} error list on struct cbm_pipeline (mirrors the
excluded_dirs pattern) + accessor + a back-pointer on cbm_pipeline_ctx_t so both
extraction paths can append. Wired in BOTH the sequential (pass_definitions) and
parallel (per-worker, merged lock-free) paths — small repos take the sequential
path, so wiring only one would leave the guard vacuous.
- Feeds: read-fail, extract-fail, the newly-CONSUMED has_error (parse timeout /
parse failed, with error_msg), and oversized. The cross_lsp phase is reserved for
the crash supervisor (Track C) and not fed here (the cross-LSP passes are
best-effort with no failure return; feeding the no-source case would be false
positives).
- MCP/CLI response gains top-level "skipped_count" (0 on clean) and, when >0, a
capped "skipped":{files[<=50],count,truncated} + "logfile". Status stays "indexed"
— a reported skip is the expected, handled outcome, not a degradation.
- Per-run logfile (full uncapped list) written ONLY when >=1 file is skipped:
$CBM_INDEX_LOG override else <cache_dir>/logs/<project>-<epoch>.log.
- Generous env-configurable caps (src/foundation/limits.c): CBM_MAX_FILE_BYTES,
default raised 100 MB -> 512 MiB; over-cap files are REPORTED (phase oversized) +
WARNed, never silently dropped.
Reproduce-first: tests/test_index_resilience.c (gating) — an oversized file (cap
lowered via env) must appear in skipped[] with the 2 good files still indexed and a
logfile written; a clean run has skipped_count 0 and no logfile. Genuine guard:
no-op'ing the recorder makes the oversized file silently vanish (RED). Full suite
5750/0, no ASan/UBSan.
Part of the resilient-indexing effort (Track B surfacing layer). Refs #668.
Signed-off-by: Martin Vogel <martin.vogel.tech@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(index): report per-file indexing failures via skipped[] + logfile (Stage 2/B2-B4)
index_repository silently dropped files that failed to index: CBMFileResult.has_error
was set but never read, oversized files (>100 MB) were dropped with no signal, and
read/extract failures only bumped a logged-only counter — a file that couldn't be
indexed just vanished from the graph.
Collect and surface them:
excluded_dirs pattern) + accessor + a back-pointer on cbm_pipeline_ctx_t so both
extraction paths can append. Wired in BOTH the sequential (pass_definitions) and
parallel (per-worker, merged lock-free) paths — small repos take the sequential
path, so wiring only one would leave the guard vacuous.
parse failed, with error_msg), and oversized. The cross_lsp phase is reserved for
the crash supervisor (Track C) and not fed here (the cross-LSP passes are
best-effort with no failure return; feeding the no-source case would be false
positives).
capped "skipped":{files[<=50],count,truncated} + "logfile". Status stays "indexed"
— a reported skip is the expected, handled outcome, not a degradation.
$CBM_INDEX_LOG override else <cache_dir>/logs/-.log.
default raised 100 MB -> 512 MiB; over-cap files are REPORTED (phase oversized) +
WARNed, never silently dropped.
Reproduce-first: tests/test_index_resilience.c (gating) — an oversized file (cap
lowered via env) must appear in skipped[] with the 2 good files still indexed and a
logfile written; a clean run has skipped_count 0 and no logfile. Genuine guard:
no-op'ing the recorder makes the oversized file silently vanish (RED). Full suite
5750/0, no ASan/UBSan.
Part of the resilient-indexing effort (Track B surfacing layer). Refs #668.