Skip to content

feat(semantic): [merge candidate build] FTS5 index and search tools, provider-aware typed embeddings, reranking, diagnostics, and eval harness#87

Open
Zireael wants to merge 130 commits into
cortexkit:mainfrom
Zireael:semantic-search-enhancement
Open

Conversation

@Zireael

@Zireael Zireael commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Semantic search in AFT moves from a minimal embedding-and-cosine prototype to a provider-capability-aware retrieval subsystem with typed vectors, optional reranking, background lifecycle management, diagnostics, and evaluation tooling. This is a public preview — the feature is functional and tested (~93 new tests) but expects iteration based on real-world feedback.

FTS5 index will allow introducing more advanced symbol operations.
image

What changed

The upgrade touches the full semantic pipeline — config, indexing, retrieval, diagnostics, and observability — without breaking the default fastembed experience.

Typed vector representations

Vectors are no longer opaque f32 blobs. Every stored vector carries explicit type metadata (DenseF32, Int8SourceDecoded, BinaryPacked) and is paired with its source kind so the correct distance metric is selected automatically. Binary packed vectors use Hamming search (native bitwise XOR + popcount) instead of cosine, which is both faster and semantically correct for quantized embeddings. This unlocks Perplexity's base64_binary and base64_int8 output modes alongside standard dense providers.

Provider capability profiles

Each embedding backend (fastembed, OpenAI-compatible, Ollama, Perplexity) declares what it supports: output encoding, distance metric, dimension range, max batch size. The config layer validates combinations at configure time — you cannot accidentally request binary vectors through a cosine-only provider. Profiles also carry fingerprint fields so switching providers triggers a clean index rebuild rather than silent corruption.

Fingerprint-driven index lifecycle

A SemanticIndexFingerprint captures every dimension that affects index correctness: backend, model, base_url, dimension, chunking_version, output_encoding, storage_strategy, vector kinds, normalization, and prompt hashes. diff() classifies changes as Rebuild (structural — re-embed everything), ClearQueryCache (query prompts changed — invalidate cached results only), or None. This replaces the previous "delete and hope" invalidation with precise, explainable rebuild decisions.

Non-blocking cold start

Index builds run in a background thread with cooperative cancellation (SemanticCancellationToken via AtomicU64 generation counter). The build checks the generation before each embedding batch and exits early when a reconfigure arrives. Priority ordering ensures high-value files (recently edited, high PageRank) get embedded first. Exponential backoff handles transient provider failures without blocking the session.

Stale-vector pruning

When files are edited, deleted, moved, excluded, or re-included, the index tracks which vectors are stale and prunes them during the next refresh cycle. Every vector record carries file/chunk ownership metadata (file path, version, chunk hash, index fingerprint) so pruning is traceable and deterministic.

File policy and docs chunking

A configurable file policy controls which files enter the index (include globs, exclude globs, max file size, max chunk count). The docs chunker splits Markdown and documentation files into semantic sections before embedding, improving recall for documentation-shaped queries.

Reranking pipeline

Optional reranking via any OpenAI-compatible /v1/rerank or chat-completion endpoint. The pipeline sends initial retrieval candidates to a reranker, parses the response (supporting multiple JSON shapes), and reorders results with safe fallback — if the reranker fails, the original cosine-similarity order is returned unchanged. Config fields: rerank.enabled, rerank.model, rerank.base_url, rerank.api_key_env, rerank.max_candidates.

Search pipeline metrics and diagnostics

Every aft_search call records timing, cache hits/misses, result counts, and reranker fallback events. Metrics are exposed through the status command and through JSONL diagnostic logs for offline analysis. The DiagnosticsOutputMode config controls verbosity in tool output (compact | verbose | off).

Semantic doctor

semantic_doctor is a health-check command that reports config summary, index summary, metrics summary, provider summary, and actionable suggestions. Use it to verify that the index is healthy, the provider is reachable, and the configuration is consistent.

Semantic eval harness

semantic_eval runs a JSONL-defined evaluation suite against the semantic index. Each case specifies a query, expected paths, expected symbols, and top-k. The harness computes recall@k and MRR (Mean Reciprocal Rank) for quantifying retrieval quality across config changes.

Status integration

The status command now includes semantic health metrics: lifecycle state, entry count, dimension, total queries, cache hit ratio, average query time, and provider info. The OpenCode TUI sidebar surfaces these alongside the existing index state.

Config trust boundary

backend, base_url, and api_key_env are user-only fields — project-level aft.jsonc cannot inject these. A hostile repository cannot redirect embeddings at an attacker-controlled endpoint or exfiltrate API keys. The plugin logs a warning when it strips a project-level setting.

Contextualized document-chunk embedding (partial)

Initial support for Perplexity-style document/chunk grouped embedding — chunks from the same source document are batched together rather than flattened. Oversized document handling and retry logic are still in progress (see roadmap).

How to test

Default fastembed (zero-config)

# Enable semantic search in your AFT config
# ~/.config/opencode/aft.jsonc or ~/.pi/agent/aft.jsonc:
{ "semantic_search": true }

# Start a session — index builds in background
# Run aft_search with a concept query:
aft_search({ "query": "authentication middleware" })

Verify: results appear with source: semantic or source: hybrid tags. Status shows [index: ready] after build completes.

Provider switching

// Switch to OpenAI-compatible
{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "text-embedding-3-small",
    "base_url": "https://api.openai.com/v1",
    "api_key_env": "OPENAI_API_KEY"
  }
}

Verify: index rebuilds automatically on next session start. Status shows new provider/model.

Reranking

{
  "semantic_search": true,
  "semantic": {
    "backend": "openai_compatible",
    "model": "text-embedding-3-small",
    "base_url": "https://api.openai.com/v1",
    "api_key_env": "OPENAI_API_KEY"
  },
  "rerank": {
    "enabled": true,
    "model": "rerank-english-v3.0",
    "base_url": "https://api.cohere.com",
    "api_key_env": "COHERE_API_KEY"
  }
}

Verify: search results show reranker-sorted order. Disable reranker — results fall back to cosine order.

Semantic doctor

aft_search({ "query": "test" })  # trigger index build if cold
# Then check health via status command or semantic_doctor

Verify: health report shows ConfigSummary, IndexSummary, MetricsSummary, ProviderSummary.

Eval harness

// Create eval-cases.jsonl:
{"query": "authentication handler", "expected_paths": ["src/auth/middleware.ts"], "expected_symbols": ["authMiddleware"], "top_k": 10}
{"query": "database connection", "expected_paths": ["src/db/pool.ts"], "expected_symbols": ["createPool"], "top_k": 10}

Verify: returns recall@k and MRR scores.

Test coverage

~93 tests across 8 test sub-tasks covering:

  • Config parsing and backward compatibility
  • Fingerprint diff matrix (all field combinations → Rebuild/ClearQueryCache/None)
  • File policy, docs chunking, and manifest handling
  • VectorStore trait with DenseF32 and BinaryPacked implementations
  • Binary packed-vector storage and Hamming search
  • Lifecycle states, snapshots, and stale-vector pruning
  • Search pipeline metrics, diagnostics, and DiagnosticsOutputMode
  • Concurrency, race conditions, and cancellation token behavior
  • Security trust boundary enforcement (project config stripping)
  • Semantic doctor health report
  • Semantic eval harness (JSONL parsing, scoring, recall/MRR)
  • Reranking pipeline (parse multiple JSON shapes, fallback on failure)

Roadmap

Still in progress or planned for follow-up:

  • aft-t6p.23: Complete contextualized document-chunk embedding (oversized docs, retry logic) — partially implemented
  • aft-t6p.2.2: Configurable snippet truncation in reranking (currently hardcoded at 200 chars)
  • aft-t6p.18: End-to-end verification across all backends
  • aft-t6p.5: Configuration and operations documentation
  • Performance benchmarking suite
  • Migration tooling for index format upgrades

Architecture notes

Key new modules:

  • crates/aft/src/semantic_rerank.rs — reranking pipeline with safe fallback
  • crates/aft/src/semantic_diagnostics.rs — JSONL diagnostic logging
  • crates/aft/src/semantic_doctor.rs — health-check report generation
  • crates/aft/src/semantic_eval.rs — evaluation harness (JSONL parser, scoring)
  • crates/aft/src/vector_store.rs — VectorStore trait with DenseF32 and BinaryPacked implementations
  • crates/aft/src/commands/semantic_doctor.rs — doctor command handler
  • crates/aft/src/commands/semantic_eval.rs — eval command handler

Modified significantly:

  • crates/aft/src/semantic_index.rs — lifecycle management, fingerprint-driven invalidation, non-blocking build, stale pruning, typed vectors
  • crates/aft/src/config.rs — provider profiles, rerank config, trust boundary fields
  • crates/aft/src/commands/status.rs — semantic health metrics
  • crates/aft/src/commands/semantic_search.rs — reranking integration, diagnostics output mode

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Provider‑aware semantic search with typed vectors, cross‑encoder/chat reranking, rich diagnostics, doctor/eval tools, and a Semble benchmark suite. This alpha also ships model2vec auto‑download with health/version checks, overflow‑safe embedding chunking, contextualized embeddings, an optional semantic-fts5 baseline, prompt profiles, result capping, and updated builds/workflows.

  • New Features

    • Reranking: chat or /v1/rerank via rerank_api_type, custom rerank_prompt_template, fence‑tolerant parsing, 2 MiB body cap, and overfetch‑then‑truncate (rerank_max_candidates, rerank_max_candidate_chars[_cross_encoder]).
    • Embeddings/indexing: typed vectors (f32/int8/binary→Hamming), overflow‑safe chunking (max_embed_tokens, chunk_overlap_tokens), contextualized document‑chunk embeddings, prompt profiles, effective document_prompt_template at index time, and per‑file result caps via max_results_per_file.
    • Providers: capability profiles plus model2vec backend with HF download/cache/validation, doctor integration, version checks, and local support behind semantic-model2vec.
    • Lifecycle: file policy + docs chunker; fingerprint‑driven rebuilds; background cold start with Partial status and cancellation; stale‑vector pruning; VectorStore abstraction; V8 snapshot with per‑entry chunk_hash and a file manifest.
    • Diagnostics/bench/builds: JSONL logging with rotation/retention, DiagnosticsOutputMode, warning dedup, cache‑hit metrics, semantic_doctor, semantic_eval; optional FTS5 lexical baseline behind semantic-fts5; default builds enable semantic-model2vec/semantic-fts5; manual multi‑platform build-aft workflow; Docker‑based Rust validation; Semble pilot/CI scripts and reports.
  • Bug Fixes

    • Security: project configs cannot set rerank_prompt_template; SSRF validation for rerank_base_url; normalized, traversal‑safe JSONL log paths; bounded streaming for reranker bodies.
    • Reranker/search correctness: score‑sorted handling across results/data/scores shapes, duplicate‑index filtering, out‑of‑bounds index warnings, accurate more_available after post‑rerank truncation, prevent candidate starvation by overfetching before rerank, clamp cosine NaNs, enforce per‑file caps and fusion ordering.
    • UX/diagnostics: surface reranker failures in minimal mode, warning dedup, fix diagnostics gating; eval harness wired to the real search pipeline.

Written for commit baacf80. Summary will update on new commits.

Review in cubic

Greptile Summary

This alpha PR replaces the prototype semantic search subsystem with a production-oriented retrieval pipeline: typed vectors (DenseF32, Int8SourceDecoded, BinaryPacked), provider capability profiles, fingerprint-driven index invalidation, background build with cooperative cancellation, optional reranking, diagnostics/doctor/eval tooling, and a security trust boundary that prevents project configs from redirecting embeddings to attacker-controlled endpoints.

  • Core pipeline overhaul (semantic_index.rs, vector_store.rs): typed vectors with correct distance metrics (Hamming for binary, cosine for f32), stale-vector pruning, priority-ordered cold-start build, and fingerprint-based rebuild decisions.
  • Reranking pipeline (semantic_rerank.rs): supports chat-completions and cross-encoder /v1/rerank endpoints with multiple response-format parsers, body-size cap, and safe fallback to original cosine order on any failure.
  • Diagnostics, doctor, and eval (semantic_diagnostics.rs, semantic_doctor.rs, semantic_eval.rs): JSONL diagnostic logging, health-check command, and JSONL-based eval harness with recall@k and MRR scoring.

Confidence Score: 4/5

Safe to merge with awareness that several known issues from the previous review cycle remain open (reranker failures silent in minimal mode, more_available undercount, data-format rerank ordering, non-ASCII eval file corruption), plus the new gap where project configs can supply an adversarial rerank prompt template.

The new rerank_prompt_template stripping gap is the only genuinely new finding in this update; all other open issues were already identified in earlier review rounds. The change is an alpha build with documented limitations and strong test coverage (~93 tests). The trust boundary is mostly correct, the TypeScript enum mismatches are now fixed, and the vector-store and Hamming-distance math are sound.

packages/opencode-plugin/src/config.ts — add rerank_prompt_template to the stripProjectSemanticFields function. crates/aft/src/semantic_diagnostics.rs — surface RerankerFailure in minimal output mode.

Security Review

  • Prompt injection via rerank_prompt_template in project config (packages/opencode-plugin/src/config.ts:1481): query_prompt_template and document_prompt_template are correctly stripped from project-level configs; rerank_prompt_template is not. A hostile repository can supply an adversarial reranker prompt that manipulates search result ordering for the user. Data exfiltration is not possible (the reranker endpoint URL is user-only), but result integrity can be silently degraded.
  • No new credential-leakage, injection, or authentication bypass issues introduced by this PR. The trust-boundary stripping of backend, base_url, api_key_env, rerank_base_url, and rerank_api_key_env from project configs is correctly implemented.

Important Files Changed

Filename Overview
crates/aft/src/semantic_rerank.rs New reranking pipeline with body-size cap, fence stripping, and multiple response-format parsers. The build_rerank_endpoint trailing-slash fix is present. The data/results formats in extract_indices_from_rerank_results return insertion order rather than score-sorted order (flagged in previous review cycle).
crates/aft/src/semantic_eval.rs New eval harness with JSONL parsing, recall@k, and MRR scoring. strip_trailing_commas is implemented but corrupts non-ASCII bytes (previously flagged). The score_case docstring says hits beyond k don't affect first_hit_rank, but the code does set it, so MRR includes full-list ranks, not @k ranks.
crates/aft/src/semantic_diagnostics.rs New diagnostic subsystem. format_warning_minimal returns None for RerankerFailure, silently suppressing reranker failures in the default Minimal output mode (flagged in previous review cycle).
crates/aft/src/commands/semantic_search.rs Reranking, diagnostics, and fusion-limit integration. more_available is computed against fusion_limit rather than top_k, so candidates between top_k and fusion_limit are silently dropped after reranking without setting more_available = true (flagged in previous cycle). OOB reranker-index warning still gated on diagnostics_enabled.
packages/opencode-plugin/src/config.ts TypeScript Zod schema updated with 16+ new fields including corrected enum values (base64_binary, binary_packed, dot_product, perplexity). Trust-boundary stripping function added. rerank_prompt_template is exposed in the schema but not stripped from project configs, unlike query_prompt_template and document_prompt_template.
crates/aft/src/vector_store.rs New VectorStore trait with FlatF32 (cosine) and FlatBinaryHamming (Hamming distance) implementations. Score normalization, orphan pruning, and stale-vector pruning look correct. Hamming similarity computation is sound.
crates/aft/src/config.rs Provider capability profiles, RerankApiType enum, trust-boundary doc comments, and expanded SemanticBackendConfig. RerankApiType uses snake_case serde, matching the TypeScript ["chat", "rerank"] values.
crates/aft/src/compress/trust.rs Added security-focused tests: atomic writes, multi-project trust, idempotent untrust, and reload survival. No logic changes.

Comments Outside Diff (2)

  1. packages/opencode-plugin/src/config.ts, line 37-54 (link)

    P1 TypeScript enum values don't match the Rust serde strings — config will fail to deserialize

    Several new enum schemas use values that don't align with the Rust serde representation:

    • SemanticOutputEncodingEnum allows "binary", "ubinary", "int8", "uint8" but Rust OutputEncoding deserializes from "base64_binary" and "base64_int8".
    • SemanticStorageStrategyEnum allows "flat" and "binary_pack" but Rust StorageStrategy expects "native_f32" and "binary_packed".
    • SemanticInputModeEnum includes "chunk_extracts" and "contextualized" but Rust InputMode only has "flat_texts" and "document_chunks".
    • SemanticDistanceMetricEnum uses "dot" but Rust DistanceMetric expects "dot_product".
    • SemanticBackendEnum is missing the new "perplexity" variant added to Rust.

    A user who follows the TypeScript autocomplete and picks output_encoding: "int8" will pass TypeScript validation but receive a deserialization error (or silent fallback to default) from the Rust binary at runtime.

  2. crates/aft/src/commands/semantic_search.rs, line 116-119 (link)

    P1 more_available understates available results when reranking is active

    fused_more_available is now computed as results.len() > fusion_limit (e.g., > 20) rather than > top_k (e.g., > 10). After reranking, results.truncate(top_k) discards any candidates between positions top_k and fusion_limit, but more_available has already been set and stays false. Concretely: if the fused pool yields 15 candidates (top_k=10, rerank_max_candidates=20), fused_more_available = 15 > 20 = false, more_available = false, and the 5 reranked-but-discarded candidates are silently dropped with no "more results" hint surfaced to the agent.

    Capture the pool size before truncation and fold it into more_available after the rerank block, before results.truncate(top_k).

Reviews (29): Last reviewed commit: "feat(semantic): add model prompt profile..." | Re-trigger Greptile

Zireael and others added 30 commits May 24, 2026 11:10
Add scripts, docs, Dockerfile, and package.json scripts for Docker-based
Rust validation (fmt/check/clippy/test) so Windows users without MSVC
Build Tools can still validate Rust code.

- scripts/docker-rust.ps1: PowerShell script supporting fmt/check/clippy/
  test/validate/shell tasks with persistent Docker volumes
- Dockerfile.rust: minimal Rust image with rustfmt + clippy pre-installed
- docs/docker-rust-validation.md: full usage and design documentation
- package.json: 6 new docker:rust:* convenience scripts

Design: Linux-target validation via rust:1-bookworm, persistent cargo
volumes for caching, fail-fast sequential validation.
- SemanticFilePolicy config struct with include_code/include_docs/
  include_configs/binary_detection/generated_file_detection/globs
- parse_semantic_files_config handler in configure.rs
- File policy evaluation: should_index_file(), is_generated_file(),
  is_config_file(), is_docs_file()
- Docs chunker: collect_docs_chunks() with heading-based splitting
  for markdown, splitting by file for other doc types
- collect_chunks routes doc files through docs chunker, skips
  binary/generated/config files per policy
- SemanticIndexFingerprint extended with file_policy_hash and
  docs_chunker_version; diff() triggers rebuild on policy change
- build_with_progress/refresh_stale_files accept &SemanticFilePolicy
- compute_file_policy_hash() deterministic hash of policy fields
- Re-export SemanticFilePolicy from semantic_index module
- All test callers updated with &SemanticFilePolicy::default()
…iority ordering, backoff

- CancellationToken (Arc<AtomicU64> generation counter) for cooperative build cancellation on reconfigure
- Cancel old semantic index builds instead of detaching when config changes
- Priority file ordering: README/docs first, then core source, then tests, then rest
- Embedding backoff: exponential retry with jitter for remote provider rate limits
- SemanticIndexStatus::Partial variant with completeness percentage for partial builds
- Search reports partial index state during cold start
- Phase-boundary cancellation checks between model init, disk read, incremental refresh, and full rebuild
Add Perplexity backend with InputMode::DocumentChunks support for
contextualized embedding where chunks carry document-level context.

- SemanticBackend::Perplexity variant with config, profile, engine
- DocumentChunks/PerDocumentChunks/DocumentEmbeddings structs
- embed_document_chunks() routes Perplexity to grouped embedding API
- build_with_progress_contextualized() groups chunks by document
- Wire configure.rs to branch on input_mode: DocumentChunks
- SemanticEmbeddingModel::input_mode() public accessor
- EmbeddingModelProfile with contextualized_supported guard
- Response validation: index continuity, missing documents, dimension
…to trait-backed module

Bead: aft-t6p.12

Extracts Vec<EmbeddingEntry> storage and search from SemanticIndexSnapshot
into a VectorStore trait with FlatF32VectorStore implementation. This
decouples the storage layer from the lifecycle logic and prepares for
alternative backends (binary Hamming, approximate ANN).

Key changes:
- vector_store.rs: VectorStore trait + ScoredChunk/PruneStats types
- FlatF32VectorStore: flat scan with cosine similarity (preserves existing
  behaviour exactly)
- FlatBinaryHammingVectorStore: forward-looking Hamming-search impl
- SemanticIndexSnapshot delegates search/len/prune/entries to store
- Fixed dimension-sync bug where set_dimension updated the snapshot
  dimension but not the store dimension, causing search to return 0
- EmbeddingEntry and IndexedFileMetadata made pub for trait compatibility
On Windows, use copyFileSync for the binary replacement (which overwrites
the target — renameSync fails with EEXIST). If it fails, the original
binary at binaryPath is preserved.

The temp file cleanup is now wrapped in its own try/catch so a cleanup
failure does NOT propagate as a download failure — the binary was already
successfully placed at binaryPath.

Addresses PR cortexkit#69 cubic review finding P2.
Implement bead aft-t6p.24: file identity manifest + vector ownership records.

Changes:
- **FileRecord struct**: identity record with content_hash, size_bytes, mtime,
  language, document_kind, inclusion_policy_hash, indexed_at
- **file_manifest on SemanticIndexSnapshot**: HashMap<PathBuf, FileRecord>
  tracking which files produced which vectors, enabling precise stale-vector
  pruning when files are edited, deleted, or excluded
- **V8 serialization format**: extends V7 with per-entry chunk_hash (after
  each vector) and file manifest block (after all entry vectors). Full
  backward compatibility with V1-V7 reads.
- **chunk_hash on EmbeddingEntry**: deterministic hash of chunk content fields
  for tracing which version of a chunk produced a stored vector
- **compute_chunk_hash**: blake3-based deterministic hash
- **build_manifest_from_store helper**: populates file_manifest from store's
  file_metadata, called in all builder functions (build_from_chunks,
  build_with_progress_contextualized, refresh_stale_files) and from_bytes
  for V1-V7 cache migration
- **next_chunk_id, fingerprint_string**: forward-looking fields on snapshot
  for future unique ID assignment and fingerprint tracking
…rmalization, and model profiles

Adds aft-t6p.20 (Typed embedding vector representation +
storage-strategy resolution):

- TypedVector (source-side) and StoredVector (persisted) enums
  with DenseF32, DenseInt8, BinaryPacked, and Quantized variants
- StorageStrategy (NativeF32, DecodeNormalizeF32, BinaryPacked)
- VectorKind enum for runtime type tagging
- DistanceMetric (Cosine, DotProduct, Euclidean, Hamming)
- NormalizationPolicy (AlreadyNormalized, NormalizeOnInsertQuery,
  NotApplicable)
- EmbeddingModelProfile fields: source_vector_kind, stored_vector_kind,
  metric, normalization
- convert_vector() / validate_compatible() on EmbeddingModelProfile
- blake3 dependency for chunk hashing
… + dummy base_url for Perplexity profile test

Two fixes for `fingerprint_invalidation_tests`:
- Mock HTTP server now lowercases header names before matching
  Content-Length (reqwest/hyper sends lowercase `content-length:`).
- `base64_int8_profile_from_config_selects_correctly` test provides a
  dummy `base_url` for the Perplexity backend (required by `from_config`).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add StorageStrategy::BinaryPacked variant for packed-bit vector storage
- Add EmbeddingModelProfile::perplexity_binary() with BinaryPacked → Hamming path
- Wire from_config to select perplexity_binary profile when Base64Binary encoding
- Implement parse_embedding_value for Base64Binary (decode → 0.0/1.0 f32 vec)
- Implement into_stored for TypedVector::BinaryPacked (requires BinaryPacked strategy)
- Update validate_config and validate_compatible to accept Base64Binary+BinaryPacked
- Replace old "not yet supported" test with parse_embedding_value_base64_binary_succeeds
- 886/893 tests pass (7 pre-existing Docker failures)

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add semantic_diagnostics module with SearchDiagnostics, SearchPipelineType,
SearchWarning, SearchMetricsCollector, PhaseTimer, score_statistics,
top1_margin. Instrument handle_semantic_search with per-phase timing
and warning collection. Wire SearchMetricsCollector into AppContext.
17 new tests, 902/910 lib tests pass (8 pre-existing Docker failures).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
- Add SemanticDiagnosticsLogger with file append, rotation (50 MB), and
  retention cleanup (file-deletion based on mtime)
- Add SearchDiagnosticsEvent struct for JSONL serialization with
  raw_query redaction (opt-in via include_raw_queries) and snippet
  placeholder (include_snippets)
- Add config fields: jsonl_logging, jsonl_path, include_raw_queries,
  include_snippets, retention_days to SemanticBackendConfig
- Add lazy-init diagnostics_logger on AppContext with
  resolve_diagnostics_log_path helper (env var → project root → ~/.cache)
- Wire JSONL record into handle_semantic_search diagnostics block
- 4 new tests: raw query redaction, raw query inclusion, disk write
  verification, missing-file recovery
- 907/914 lib tests pass (7 pre-existing Docker failures)

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
…rch output

Add DiagnosticsOutputMode enum (Off/Minimal/Verbose) and output_mode field
to SemanticBackendConfig. Implement format_diagnostics_prefix() for
Minimal (warnings only) and Verbose (scores + latency + warnings)
output modes. Wire into handle_semantic_search response text.
4 new tests, 25 diagnostics tests total. 910/918 lib tests pass
(8 pre-existing Docker failures).

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
Add optional reranking via OpenAI-compatible chat endpoint. When
enabled, aft_search overfetches candidates, sends them to a reranker
model, and re-sorts by relevance. Falls back gracefully on any error.

- Add RerankConfig fields to SemanticBackendConfig (rerank_enabled,
  rerank_model, rerank_base_url, rerank_api_key_env, rerank_timeout_ms,
  rerank_max_candidates)
- Create semantic_rerank.rs with RerankerClient, RerankOutcome enum,
  and rerank_candidates function
- Add RerankerFailure warning variant to SearchWarning
- Wire reranking into handle_semantic_search (overfetch → rerank → re-sort)
- Add rerank_latency_ms to SearchDiagnostics and SearchDiagnosticsEvent
- Include rerank latency in verbose diagnostics output
- 6 unit tests for reranker parsing, skip conditions, and failure handling

All 25 diagnostics + 6 reranker tests pass. 917/924 total tests pass
(7 pre-existing Docker infrastructure failures).
Add 40+ unit tests to fingerprint_invalidation_tests covering:
- SemanticBackendConfig deserialization (minimal, all-fields, defaults)
- EmbeddingModelProfile validation for all encoding types
- TypedVector conversion and StoredVector roundtrip
- convert_vector and validate_compatible rejection paths
- Distance metric auto-resolution for f32/int8/binary
- base64_int8 signed int8 decode correctness
- Template hashing, enum roundtrips, resolve helpers

Minor: add #[derive(Debug)] to StoredVector for test ergonomics.

Closes aft-t6p.6.1
Add 6 new tests to fingerprint_invalidation_tests covering:
- file_policy_hash mismatch triggers rebuild
- docs_chunker_version mismatch triggers rebuild
- multi-field changes still trigger rebuild
- rebuild+query_prompt: rebuild wins
- only query_prompt change: ClearQueryCache
- non-fingerprint field changes: NoChange

Total: 22 fingerprint tests. Closes aft-t6p.6.2
Add 29 tests covering:
- is_generated_file: protobuf, minified, dist, build, generated, dart
- is_doc_extension and is_config_extension validation
- classify_semantic_file for code/doc/config
- collect_docs_chunks markdown heading splitting
- SemanticFilePolicy defaults and builtin globs
- FileRecord field population
- build_manifest_from_store construction and cleanup

Closes aft-t6p.6.3
… tests

Add 23 tests covering:
- FlatF32VectorStore: search, empty, dimension mismatch, CRUD, prune, stats
- FlatBinaryHammingVectorStore: search, ranking, prune, delete, stats
- hamming_distance and popcount64 correctness
- Binary decode: byte-aligned, non-byte-aligned, padding, error

Closes aft-t6p.6.4
Add 8 tests covering:
- SemanticIndexLifecycle: cold start, set/get, failed+error, all variants
- SemanticIndexSnapshot: search ranking, immutability after clone
- VectorStore: prune_stale_vectors, prune_orphans

Closes aft-t6p.6.5
Add 10 tests covering:
- HybridRerank pipeline type display
- Metrics collector: window size 1, cache hit rate, zero result rate,
  low confidence rate, latency percentiles
- Diagnostics output mode defaults
- Warning formatting: minimal (all variants, verifies suppressed),
  verbose (all 9 variants)
- SearchWarning serde roundtrip for all 8 variants

Closes aft-t6p.6.6
Add 4 tests covering:
- Concurrent snapshot clones produce independent results
- Concurrent read threads see identical data via Arc
- Mutex contention across 10 threads does not deadlock
- Arc strong_count tracks clone/drop correctly

Closes aft-t6p.6.7
Add 6 tests covering:
- Trust file atomic write (no tmp files left behind)
- Multiple projects trusted independently
- Untrust is idempotent
- Trust state survives reload (serde roundtrip)
- Nonexistent project path is untrusted (fail-closed)

Closes aft-t6p.6.8
The validate_compatible_rejects_binary_stored_with_cosine_metric test
was missing source_vector_kind: BinaryPacked, causing the first match
block to fail with 'unsupported source→stored vector conversion' instead
of reaching the metric compatibility check.
Zireael added 29 commits June 13, 2026 12:56
…g tests

Replace invalid partial tokenizer JSON with valid full-root BPE fixtures.
Un-ignore 5 model2vec loading tests that now pass offline.

- Add build_tokenizer_json() helper with correct schema
- Remove invalid root-level strategy field
- Set padding: null, ignore_merges: true, Whitespace pre_tokenizer
- Add tokenizer_json_characterization() test
- Document fixture contract in semantic-search-upgrade doc

model2vec-rs 0.2.1 uses tokenizers 0.21.4 (transitive), AFT uses 0.22.2.
…ment

Resolve 34 merge conflicts across:
- Lock files (Cargo.lock, bun.lock)
- Package configs (package.json, Cargo.toml)
- TypeScript sources (bridge, opencode-plugin, pi-plugin)
- Rust sources (search_index, inspect, callgraph_store, main)
- Semantic search (semantic_index.rs — took upstream structural refactoring)

Semantic search features (rerank, diagnostics, doctor, eval, model2vec)
were in stashed changes and will be re-applied after merge validation.
… on top of upstream merge

semantic_index.rs re-integration tracked in separate atomic beads:
- aft-t6p.merge.typed-vectors: VectorKind, NormalizationPolicy, TypedVector, StoredVector
- aft-t6p.merge.v7v8-format: SEMANTIC_INDEX_VERSION_V7/V8, serialization
- aft-t6p.merge.file-policy: SemanticFilePolicy, classify_semantic_file
- aft-t6p.merge.contextualized: retry logic, chunking constants
- aft-t6p.merge.profiles: EmbeddingModelProfile
…tion errors (aft-t6p.49)

- Restore [features] section with semantic-model2vec and semantic-fts5
- Add missing base64, model2vec-rs dependencies
- Fix clippy::explicit-auto-deref in semantic_doctor.rs:249
- Add SemanticIndexStatus::Partial to match in main.rs
- Allow dead_code for upstream unused functions in cache_freshness.rs
…e (aft-t6p.56)

RerankerFailure now shows in minimal mode ("reranker unavailable, using original order").
Test expectation updated to match current behavior.
FTS5 has zero production callers — only unit tests. Compiled stubs bloat every default build.
Now opt-in only via --features semantic-fts5.
Verified: cargo check passes both with and without FTS5 feature.
Added: dimensions, distance_metric, diagnostics_enabled, output_mode,
rerank_enabled, rerank_api_type, rerank_max_candidate_chars, cap_per_file.
All fields documented with types, defaults, and descriptions.
…e parser (aft-fts5e2e.1)

Add the foundation for the opt-in FTS5 side feature:
- Fts5Config struct with safe defaults (all disabled) in config.rs
- parse_fts5_config() in configure.rs for NDJSON protocol parsing
- Feature-gated command stubs (fts5_index, fts5_search, fts5_find_symbol,
  fts5_read_symbol, fts5_doctor) that return clear disabled/unavailable
  responses when the feature is compiled but runtime-disabled
- Dispatch entries in main.rs behind #[cfg(feature = "semantic-fts5")]
- 9 new tests covering config defaults, disabled-state responses, and
  doctor command output

All FTS5 code is behind the semantic-fts5 Cargo feature, which is NOT
in default features. The feature is invisible unless both compiled with
--features semantic-fts5 AND enabled via [fts5].enabled = true in config.
…5e2e.2)

Replace the single-table spike architecture with a production-shaped,
versioned multi-table SQLite schema for the FTS5 side feature.

Schema v1 tables:
- fts5_meta: schema version, build metadata
- fts5_files: file paths, content hashes, mtime
- fts5_symbols: symbol names, kinds, ranges, bodies
- fts5_symbols_fts: FTS5 virtual table for symbol name search (unicode61)
- fts5_symbol_bodies_fts: FTS5 virtual table for body search (trigram)
- fts5_paths_fts: FTS5 virtual table for file path search (trigram)

Fts5Store provides:
- Versioned schema creation with automatic rebuild detection
- WAL mode for concurrent read performance
- Transactional file/symbol upsert with ON CONFLICT handling
- Cascade delete (file → symbols)
- Exact SQL symbol lookup by name
- FTS5 row counts, integrity checks, and DB size reporting
- Body truncation with configurable char/line limits
- 10 unit tests covering CRUD, cascade, schema, and diagnostics
Implement the FTS5 indexer that walks project files, extracts symbols
with tree-sitter, and populates the Fts5Store.

Fts5Indexer provides:
- Full project indexing with bounded file count
- Incremental update (skips unchanged files by content hash + mtime)
- Stale file removal (files no longer in project)
- Full rebuild (clear + reindex)
- Symbol extraction via LanguageProvider trait (tree-sitter)
- File records with content hashes for change detection
- Symbol records with names, kinds, ranges, and bodies

4 unit tests covering:
- Basic indexing and store population
- Skip-unchanged optimization
- Stale file removal
- Rebuild clears and reindexes
…e2e.4)

Add staleness detection to the FTS5 store for incremental lifecycle:

- stale_files() method compares indexed mtime against current disk mtime
  to detect files that have been modified since last indexing
- Returns StaleFileInfo with path, indexed_mtime, and current_mtime
- Handles deleted files (current_mtime = -1 sentinel)
- Enables doctor command to report stale index state
- Enables incremental update to skip re-indexing fresh files

The indexer already uses content hash + mtime for change detection.
This addition provides an explicit staleness query for diagnostics
and the fts5_doctor command.
…on (aft-fts5e2e.5)

Implement the FTS5 query planner that routes queries to appropriate
search lanes and fuses results with score normalization.

Query analysis:
- Tokenizes queries using code-aware tokenization
- Detects exact symbol queries (single identifier)
- Detects path queries (contains / or .)
- Identifies short tokens needing fallback

Lane routing:
1. exact_symbol_sql — exact SQL match on symbol name (highest priority)
2. prefix_symbol_sql — SQL LIKE prefix match on symbol name
3. symbol_fts — FTS5 search with unicode61 tokenizer
4. path_fts — FTS5 search with trigram tokenizer
5. body_fts — FTS5 search with trigram tokenizer
6. short_token_fallback — SQL LIKE for tokens < 3 chars

Result fusion:
- Normalizes scores across lanes (higher is better)
- Applies lane weights (exact > prefix > symbol_fts > path > body > fallback)
- Deduplicates by symbol_id with multi-lane bonus
- Returns top-k results sorted by fused score

7 unit tests covering analysis, lane selection, and fusion dedup.
Made Fts5Store.conn public for planner SQL access.
…ion (aft-fts5e2e.6)

Replace the fts5_search stub with a real implementation that:
- Parses query, top_k, and scope parameters
- Resolves the FTS5 database path from project root (.aft/fts5.sqlite)
- Opens the FTS5 store and checks index readiness
- Executes search via the QueryPlanner with multi-lane routing
- Returns structured JSON results with file, symbol, kind, lines, score, lane
- Handles empty index gracefully with warning message
- Validates non-empty query

3 new tests: disabled state, empty index warning, empty query rejection.
…e.7)

Replace stubs with real implementations for the FTS5 lifecycle commands.

fts5_index actions:
- status: show index existence, schema version, file/symbol counts, db size,
  stale file count, FTS row counts, integrity check, and db path
- update: incremental index via Fts5Indexer (skip unchanged files)
- rebuild: clear + reindex all files
- prune: remove files no longer present on disk from the index

fts5_doctor:
- Reports compiled status, FTS5 availability, runtime enabled state
- Shows full config (auto_index, max_results, max_body_chars, etc.)
- Shows real index status: schema version, file/symbol counts, db size,
  FTS row counts, stale files, integrity check result
- Builds warnings list (disabled, stale files, no FTS5)
- Builds suggestions list (how to enable, how to refresh index)

Fixed all command tests to use isolated temp directories (doctor tests
went from ~18s to ~0.15s, search test no longer hits stale disk state).
…(aft-fts5e2e.8)

Replace stubs with real implementations for symbol lookup and read.

fts5_find_symbol:
- Accepts name and mode (exact/prefix) parameters
- Exact mode: SQL exact match first, falls back to FTS planner
- Prefix mode: uses query planner with multi-lane routing
- Returns structured results with symbol_id, file_id, name, kind, lines, snippet, lane
- Warns when index is empty

fts5_read_symbol:
- Accepts symbol_id or name (+ optional file for disambiguation)
- SQL lookup by ID or name exact match
- Ambiguous name matches return candidate list with file/kind/lines
- Reads file content from disk and extracts symbol body with optional context lines
- Returns symbol metadata + line-numbered source body

Added get_file_by_id method to Fts5Store for file lookup by primary key.
…ces (aft-fts5e2e.9)

Register all 5 FTS5 commands as tools in both coding-agent plugins,
gated by the fts5.enabled config flag.

OpenCode plugin (packages/opencode-plugin):
- Added fts5 config schema (enabled, auto_index, index_on_start, max_results)
- Created tools/fts5.ts with all 5 tool definitions using callBridge
- Registered fts5Tools in index.ts, gated on config.fts5?.enabled

Pi plugin (packages/pi-plugin):
- Added fts5 config type definition
- Created tools/fts5.ts with all 5 tool definitions using bridgeFor/callBridge
- Registered registerFts5Tool in index.ts, gated on config.fts5?.enabled
- Uses Pi-native execute signature (untyped params, textResult return)

All fts5 errors in opencode-plugin match the pre-existing @opencode-ai/plugin
module pattern (same as semantic.ts). Pi plugin has zero TS errors.
…t-fts5e2e.10)

Add plain-text rendering helpers so OpenCode/Pi agents get readable
output from FTS5 tools, matching the semantic_search convention where
the `text` field carries the clean agent-facing summary.

Text renderers:
- render_search_text: header line + numbered results with kind, name,
  file:line-range, score, lane, and truncated snippet
- render_find_symbol_text: header + numbered matches with kind, name,
  file:lines, and lane
- render_read_symbol_text: header line with file:line range + source body
- render_index_status_text: "N files, N symbols, X.X MiB" summary with
  stale-file warning when applicable
- render_index_action_text: processed/added/updated/removed/symbols counts
- render_doctor_text: compiled/available/enabled status, index health,
  and warnings

Every Response::success path in fts5_index, fts5_search, fts5_find_symbol,
fts5_read_symbol, and fts5_doctor now includes a `text` field. Empty-index
and no-index-found paths also render readable text.
…e2e.11)

Add 9 end-to-end integration tests that exercise the full FTS5 command
loop through the binary protocol: spawn aft → configure with FTS5 →
index → search → find → read → doctor.

Tests cover:
- Index lifecycle: status on empty project, update builds index,
  status after indexing shows file/symbol counts
- Search: finds symbols by name, empty index returns warning
- Find symbol: exact match returns correct kind and name
- Read symbol: returns source body for symbol by ID
- Doctor: reports health with compiled/enabled/index status
- Regression: short identifiers (process, items) are found
- Disabled state: all 5 commands return fts5_disabled when not enabled

Note: integration test binary doesn't appear in nextest output due to
missing [[test]] entry in Cargo.toml (same as watcher_integration).
Tests compile and will run once wired.
Add FTS5 as a comparable search mode in the Semble benchmark infrastructure,
enabling head-to-head comparison against ripgrep lexical and semantic baselines.

benchmarks/semble/pilot.ts:
- Added fts5Search() function that spawns aft binary with configure + fts5_index
  + fts5_search NDJSON commands
- Added --binary flag for specifying AFT binary path
- FTS5 results are included in aggregate metrics alongside lexical mode
- FTS5 mode is conditional on binary availability (gracefully skipped if no results)

benchmarks/semble/baseline-fts5.ts:
- New standalone FTS5 baseline runner mirroring baseline-rg.ts structure
- Measures recall@k, MRR, and latency across pilot corpus
- Supports --pilot, --cache-dir, --input, --k, --output, --binary flags
- Reports aggregate and per-category metrics

benchmarks/semble/README.md:
- Updated Quick Start to document FTS5 baseline alongside ripgrep baseline
- Added "Run the baseline benchmarks" section with both commands
…t-fts5e2e.13)

Add comprehensive documentation for the FTS5 side feature, including
user-facing docs, architecture updates, and a graduation decision report.

docs/fts5.md:
- Complete user guide covering enablement, commands, architecture, and usage
- Documents all 5 FTS5 commands with JSON examples
- Explains database store, index lifecycle, and query planner
- Lists known limitations and benchmark comparison instructions
- Includes graduation criteria for feature maturity

docs/fts5-graduation-report.md:
- Formal evaluation report for FTS5 graduation to selectable backend
- Completes bead 0-12 status matrix showing 12/15 beads done
- Evaluates 5 graduation criteria: benchmarks, operational maturity,
  agent feedback, documentation, and E2E validation
- Provides risk assessment with mitigations
- Recommends conditional graduation pending benchmark validation

ARCHITECTURE.md:
- Added fts5_store.rs to Key Characteristics shared engines list

STRUCTURE.md:
- Added benchmarks/semble/ directory description
- Added docs/fts5.md and docs/fts5-graduation-report.md entries
Add completion summary documenting the full FTS5 implementation across
16 beads in the aft-fts5e2e epic.

Summary covers:
- Bead matrix (16/16 complete) with commit references
- Implementation summary: core infrastructure, commands, plugins,
  testing, validation, and documentation
- Complete file list (23 files changed across Rust, TypeScript, docs)
- Test results: 51 unit tests passing, 9 integration tests compiled
- Configuration reference with defaults
- Usage examples for all 5 FTS5 commands
- Next steps: benchmark validation, agent feedback, graduation decision
- Known limitations and potential improvements

This completes the FTS5 e2e opt-in side feature epic. The feature is
now ready for benchmark validation and potential graduation to a
selectable lexical backend.
Update Cargo.lock with dependency changes from FTS5 feature implementation:
- Add base64 0.22.1 dependency
- Add model2vec-rs dependency
- Update ndarray to 0.17.2

These changes reflect the new dependencies required for FTS5 indexing
and semantic search features.
Add baseline-aft.ts to benchmark the current AFT search behavior (trigram-indexed
grep and semantic search) against ripgrep and FTS5 baselines.

benchmarks/semble/baseline-aft.ts:
- New standalone AFT baseline runner supporting grep, semantic, and hybrid modes
- Spawns aft binary with configure + grep/semantic_search NDJSON commands
- Measures recall@k, MRR, and latency across pilot corpus
- Supports --pilot, --cache-dir, --input, --k, --output, --binary, --mode flags

benchmarks/semble/pilot.ts:
- Added aftGrepSearch() function for AFT grep mode comparison
- Pilot now runs 4 modes: lexical (ripgrep), aft-grep, fts5, and semantic
- All modes included in aggregate metrics

benchmarks/semble/README.md:
- Added baseline-aft.ts to directory listing
- Updated Quick Start with AFT grep and semantic baseline commands
- Updated full pilot description to include all 4 modes
Root cause: spawnSync with input closes stdin immediately after sending
all data. AFT's reader thread sees EOF, the channel disconnects, and
the main loop exits before the search command finishes processing.
This caused 0% recall and 0.5ms latency across all AFT benchmarks.

Fix: Replace spawnSync with async spawn that keeps stdin open until all
responses are received. Created shared aft-ndjson.ts helper that:
- Spawns aft with stdin piped
- Writes NDJSON commands one at a time
- Reads stdout line-by-line collecting responses
- Resolves after receiving all expected responses
- Keeps stdin open until responses arrive (prevents premature EOF)

Updated all three benchmark files:
- baseline-aft.ts: async main + aftNdjson for AFT grep/semantic modes
- baseline-fts5.ts: async main + aftNdjson for FTS5 mode
- pilot.ts: async main + aftNdjson for FTS5 and AFT grep modes

The ripgrep baseline (execSync) is unaffected — it's a one-shot command.
- Add null guards to recallAtK, mrr, and ndcgAtK so benchmarks
  don't crash when the AFT binary is missing and results are empty
- Add statSync check before running benchmarks to fail fast with a
  clear error message instead of silently producing 0% recall
Multiple bugs fixed:
1. Missing `await` on async aftSearch() — Promise object was pushed
   instead of results, causing NaN latency and 0% recall
2. Push frames (configure_warnings) counted as command responses —
   helper resolved before actual search response arrived
3. Windows \\?\ path prefix not stripped in normalizePath — suffix
   matching failed on absolute paths from AFT grep
4. Grep responses use `matches` key, not `results` — extract from
   either key for forward compatibility
5. Null guards on recallAtK/mrr/ndcgAtK prevent crashes when
   binary is missing

AFT grep baseline now works: recall@10=10.0% mrr=0.066 latency=681ms
vs ripgrep recall@10=14.0% mrr=0.073 latency=102ms
FTS5 commands require [fts5].enabled=true at runtime. The benchmark
configure commands now pass this as a top-level param so FTS5 index
and search work without requiring a project-level aft.jsonc.

Pilot results with v0.39.1 binary (semantic-fts5 feature):
  FTS5:      recall=100%  mrr=1.000  latency=894ms
  AFT grep:  recall=33%   mrr=0.245  latency=795ms
  Ripgrep:   recall=10%   mrr=0.037  latency=110ms
@Zireael Zireael changed the title feat(semantic): [alpha build] provider-aware typed embeddings, reranking, diagnostics, and eval harness feat(semantic): [merge candidate build] FTS5 index and search tools, provider-aware typed embeddings, reranking, diagnostics, and eval harness Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant