Skip to content

Nondeterministic Java USAGE-edge misattribution since #667 — edges attach to wrong source files on every fresh index #787

Description

@spde

Summary

Since the merge of #667, fresh indexing attaches Java USAGE edges to the wrong source files, nondeterministically — every fresh index of the same unchanged repo produces a different set of incorrect edges. Queries then return "users" that contain no reference to the target symbol at all, while real users are missing. Not present in v0.8.1 or at the #528 merge commit; present from the #667 merge through current HEAD (dcf98dc). This is not in any released binary yet — flagging it now so it doesn't ship in the next release.

Repro (spring-petclinic, macOS arm64, source builds via scripts/build.sh)

git clone --depth 1 https://github.com/spring-projects/spring-petclinic.git
# ground truth: 7 files reference OwnerRepository besides its own definition
grep -rl "OwnerRepository" spring-petclinic/src --include="*.java"

Index the repo (stdio MCP index_repository, fresh CBM_CACHE_DIR each run), then:

MATCH (c {name: 'OwnerRepository'})<-[r:USAGE]-(m) RETURN m.file_path

HEAD (dcf98dc), 5 fresh indexes — 5 different wrong answers:

run bogus "users" returned real users missing
0 EntityUtils.java, PetValidatorTests.java ClinicServiceTests, OwnerControllerTests, PetControllerTests, VisitControllerTests
1 PetType.java OwnerControllerTests, PetControllerTests
2 PetValidator.java OwnerControllerTests, PetControllerTests
3 EntityUtils.java, OwnerRepository.java, PetTypeFormatterTests.java ClinicServiceTests, OwnerControllerTests, PetControllerTests, VisitControllerTests
4 PetTypeFormatter.java, PetValidatorTests.java OwnerControllerTests, PetControllerTests, VisitControllerTests

None of the bogus files contain the string OwnerRepository.

Controls (same machine, same build script, same query): v0.8.1 tag and the #528 merge commit (be3e038) each return exactly the 7 real users with zero bogus entries, deterministically (2/2 runs each).

On a larger private Spring Boot service (~1.3k Java files, 142 @Service/@Component beans) the effect is severe: v0.8.1 resolves 100% of grep-verified bean users (757/757); HEAD drops to ~16% with the remainder misattributed.

Bisect

git bisect --first-parent between be3e038 (good) and dcf98dc (bad), each step = full build + 3 fresh-index probes:

Notes on the suspected area

#667 switches Java/Go module-QN derivation to directory-based (cbm_pipeline_fqn_module_dir; pu_module_is_dir / pp_module_is_dir / pxc_module_is_dir copies that the comments say MUST match cbm_lang_module_is_dir). With Java packages, many files share a directory-derived module QN, so if USAGE source resolution keys on module QN it can pick an arbitrary same-package file — consistent with the observed behavior (misattributed sources; varies run to run; still wrong, though less so, with CBM_WORKERS=1).

Environment: macOS 26.5 arm64 (Darwin 25.5.0), Apple clang, plain scripts/build.sh.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions