diff --git a/docs/guide/benchmarking/2026-06-11-competitor-strength-evidence-matrix.md b/docs/guide/benchmarking/2026-06-11-competitor-strength-evidence-matrix.md new file mode 100644 index 00000000..1802eaf5 --- /dev/null +++ b/docs/guide/benchmarking/2026-06-11-competitor-strength-evidence-matrix.md @@ -0,0 +1,160 @@ +# Competitor-Strength Evidence Matrix - June 11, 2026 + +Goal: Define a durable competitor-strength matrix so ELF benchmark claims are tied to +measured evidence classes, typed blockers, and explicit next measurement gates. +Read this when: You need to decide whether ELF can claim a win, tie, loss, gap, or +non-claim against a tracked memory, RAG, or graph project. +Inputs: `docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md`, +`docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md`, +`docs/guide/benchmarking/2026-06-10-live-real-world-sweep-report.md`, +`docs/guide/research/external_memory_improvement_plan.md`, +`docs/guide/research/research_projects_inventory.md`, +`apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json`, +and `Makefile.toml`. +Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`, +`docs/guide/benchmarking/live_baseline_benchmark.md`, and the current external adapter +manifest. +Outputs: Human-readable matrix, claim boundaries, scenario next-measurement gates, +and the machine-readable companion file +`docs/research/2026-06-11-xy-897-competitor-strength-matrix.json`. + +## Decision Boundary + +Do not claim that ELF beats, ties, or loses to a competitor unless the named scenario +is encoded and run at a comparable evidence class. + +Current boundary: + +- ELF and qmd have full-suite `live_real_world` sweeps, but neither has a full-suite + live pass. Each sweep produced 38 jobs with 18 pass, 5 wrong_result, 1 incomplete, + 2 blocked, and 12 not_encoded. +- ELF fixture evidence is strong: `cargo make real-world-memory` reports 38 jobs + across 11 suites with 36 pass and 2 blocked production-ops operator boundaries. + That proves the fixture contract, not live-service parity. +- qmd is the strongest measured local retrieval-debug comparison, but the current + evidence still separates its same-corpus/live-retrieval strengths from the full-suite + live non-pass sweep. +- Most other projects are `live_baseline_only` or `research_gate`. They must not be + treated as beaten until a comparable scenario is encoded and run. +- Private-corpus and credentialed production-ops checks remain operator-owned + `blocked` states. + +## Current Ledger Summary + +The current manifest has 21 adapter records across 17 projects. Evidence-class counts: +1 `fixture_backed`, 6 `live_baseline_only`, 2 `live_real_world`, and 12 +`research_gate`. Overall adapter-status counts: 1 `pass`, 6 `wrong_result`, 1 +`lifecycle_fail`, 6 `blocked`, and 7 `not_encoded`. + +## State Taxonomy + +This report uses the benchmark's snake_case state names. Hyphenated prose names map +directly to these states: fixture-backed -> `fixture_backed`, +live-baseline -> `live_baseline_only`, live-real-world -> `live_real_world`, +research-gate -> `research_gate`, wrong-result -> `wrong_result`, +lifecycle-fail -> `lifecycle_fail`, and not-encoded -> `not_encoded`. + +| State | Meaning | Claim boundary | +| --- | --- | --- | +| `fixture_backed` | Checked-in real-world jobs or fixture responses are scored by the benchmark runner. | Useful for contract coverage, not live runtime proof. | +| `live_baseline_only` | Docker same-corpus or lifecycle checks ran, but no real-world job suite was scored for that project. | Cannot imply real-world job parity. | +| `live_real_world` | A runtime or CLI adapter materialized and scored real-world job records. | Can support scenario claims only for the encoded suite statuses. | +| `research_gate` | Source, setup, resource, retry, or output-contract metadata exists. | Follow-up routing only; not pass evidence. | +| `blocked` | Safe measurement needs unavailable credentials, private data, setup proof, or external dependency. | Keep typed until the missing input exists. | +| `unsupported` | Capability is outside the project shape or requires a non-comparable path. | Do not turn into a loss. | +| `wrong_result` | The system ran but missed expected memory, answer, or evidence terms. | Behavioral non-pass. | +| `lifecycle_fail` | Retrieval may work, but update/delete/reload/persistence/cold-start behavior fails. | Lifecycle non-pass, not a retrieval win. | +| `incomplete` | The run did not reach the behavioral check because setup or runtime failed. | Setup/runtime non-pass, not quality evidence. | +| `not_encoded` | The scenario is not currently covered. | No pass/fail claim is allowed. | + +## Project Matrix + +| Project | Strongest user-facing scenario | Current evidence | Measured status and proof | Unsupported or blocked status | Required benchmark before ELF claim | Borrow if stronger | +| --- | --- | --- | --- | --- | --- | --- | +| ELF | Evidence-linked source-of-truth memory service with real-world fixtures and live retrieval sweeps. | `live_real_world`; supporting `fixture_backed`. | `wrong_result` full live sweep: `cargo make real-world-memory-live-adapters`, `tmp/real-world-memory/live-adapters/elf-report.md`. Fixture contract: `cargo make real-world-memory`, `tmp/real-world-memory/real-world-memory-report.json`. | `blocked`: private manifest and provider credentials; broader live suites remain `wrong_result`, `incomplete`, or `not_encoded`. | Full-suite live pass plus separate private-corpus and credentialed production-ops proof. | Keep borrowing qmd debug knobs, OpenViking staged trajectory, mem0 history, Letta core memory, and graph/RAG navigation. | +| qmd | Local retrieval-debug workflow with transparent CLI indexing, querying, expansion, fusion, and rerank ergonomics. | `live_real_world`; supporting `live_baseline_only` and `research_gate`. | `wrong_result` full live sweep: `cargo make real-world-memory-live-adapters`, `tmp/real-world-memory/live-adapters/qmd-report.md`; targeted retrieval suites pass. | `not_encoded`: deep profile and non-retrieval live behavior are not encoded; memory_evolution is `wrong_result`. | qmd deep retrieval/debug profile plus full-suite live replay with trace-level diagnostics. | Weighted fusion, rerank explanation, local debug knobs, and command-line replay. | +| agentmemory | Coding-agent continuity, MCP/REST packaging, viewer workflow, and durable cross-agent memory lifecycle. | `live_baseline_only`. | `lifecycle_fail`: `ELF_BASELINE_PROJECTS=agentmemory cargo make baseline-live-docker`, `tmp/live-baseline/live-baseline-report.json`. | `blocked`: durable cold-start and real-world adapter coverage are missing. | Durable local adapter with update, delete, cold-start reload, work_resume, capture/write-policy, and lifecycle-staleness jobs. | Cross-agent hooks, packaging, continuity scenarios, and viewer affordances. | +| mem0/OpenMemory | Memory lifecycle, personalization, hosted/OpenMemory UI ergonomics, and optional graph memory. | `live_baseline_only`. | `wrong_result`: `ELF_BASELINE_PROJECTS=mem0 cargo make baseline-live-docker`, `tmp/live-baseline/live-baseline-report.json`. | `not_encoded`: OpenMemory UI, hosted claims, and real-world personalization coverage are not encoded. | Fix local same-corpus result, then encode memory_evolution, personalization, UI readback, and optional graph-context jobs. | Entity-scoped history, lifecycle surfaces, async update ergonomics, and OpenMemory inspection UX. | +| memsearch | Markdown-first canonical store with rebuildable local index and practical hybrid retrieval. | `live_baseline_only`. | `wrong_result`: `ELF_BASELINE_PROJECTS=memsearch cargo make baseline-live-docker`, `tmp/live-baseline/live-baseline-report.json`. | `incomplete`: source-of-truth and real-world reindex behavior are not cleanly scored. | Fix Docker same-corpus retrieval and reindex/update/delete reload evidence, then score source-of-truth and retrieval-debug jobs. | Canonical markdown store, local reindex clarity, and user-inspectable source files. | +| OpenViking | Filesystem-like context trajectory, hierarchical retrieval, and staged context loading. | `live_baseline_only`; supporting `research_gate`. | `wrong_result`: `ELF_BASELINE_PROJECTS=OpenViking cargo make baseline-live-docker`, `tmp/live-baseline/live-baseline-report.json`. | `not_encoded`: hierarchical context trajectory is not encoded; same-corpus output still misses expected evidence. | Make evidence-bearing same-corpus output pass, then score staged trajectory and hierarchy expansion. | `viking://`-style context model, trajectory readback, and staged retrieval planning. | +| claude-mem | Progressive disclosure, automatic capture loop, repository-local lifecycle, and local viewer workflow. | `live_baseline_only`. | `wrong_result`: `ELF_BASELINE_PROJECTS=claude-mem cargo make baseline-live-docker`, `tmp/live-baseline/live-baseline-report.json`. | `not_encoded`: progressive-disclosure real-world jobs are not encoded. | Durable repository-backed work_resume, operator_debugging_ux, capture/write-policy, and progressive-disclosure jobs. | Progressive disclosure, automatic capture review loops, and local viewer/operator comfort. | +| RAGFlow | Full RAG application workflow with document, chunk, and reference evidence handles. | `research_gate`. | `blocked`: `ELF_RAGFLOW_SMOKE_START=1 ELF_RAGFLOW_SMOKE_ACCEPT_RESOURCE_ENVELOPE=1 cargo make ragflow-docker-smoke`, `tmp/real-world-memory/ragflow-smoke/ragflow-smoke.json`. | `blocked`: Docker resource envelope and adapter output mapping still need proof. | XY-885 tiny Docker evidence-smoke adapter mapping `reference.chunks` to scored evidence. | Document/chunk references, resource-envelope reporting, and RAG app evidence handles. | +| LightRAG | Lightweight graph/RAG context export with source file-path citation shape. | `research_gate`. | `blocked`: `ELF_LIGHTRAG_CONTEXT_START=1 cargo make lightrag-docker-context-smoke`, `tmp/real-world-memory/lightrag-context/summary.json`. | `blocked`: Docker service setup and context export are not proven. | XY-886 Docker context-export adapter with explicit provider config and source citation mapping. | Context-only query modes, graph-aware retrieval layout, and file-path citation readback. | +| GraphRAG | GraphRAG indexing, graph summaries, and document/text-unit evidence tables. | `research_gate`. | `blocked`: `ELF_GRAPHRAG_SMOKE_RUN=1 cargo make graphrag-docker-smoke`, `tmp/real-world-memory/graphrag-smoke/summary.json`. | `blocked`: indexing resource envelope and source citation mapping are not proven. | XY-887 cost-bounded Docker adapter over a tiny corpus and scored output tables. | Graph summary artifacts, local/global search separation, and source table evidence mapping. | +| Graphiti/Zep | Temporal graph memory with current, historical, and future fact validity windows. | `research_gate`. | `blocked`: `ELF_GRAPHITI_ZEP_SMOKE_START=1 ELF_GRAPHITI_ZEP_SMOKE_RUN=1 cargo make graphiti-zep-docker-temporal-smoke`, `tmp/real-world-memory/graphiti-zep-smoke/summary.json`. | `blocked`: Docker graph-store and temporal adapter are not proven. | XY-888 Docker-local temporal graph adapter scoring current/historical fact validity. | Temporal fact windows, invalidation/supersession semantics, and graph fact provenance. | +| Letta | Core memory blocks versus archival memory with explicit operating-context surfaces. | `research_gate`. | `not_encoded`: `docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json`. | `blocked`: contained evidence export path is not selected. | Select contained export contract, then encode core-vs-archival, personalization, and project-decision jobs. | Core memory block ergonomics, archival separation, and shared operating context readback. | +| LangGraph | Checkpoint/replay regression workflow and durable state replay for agent runs. | `research_gate`. | `not_encoded`: `docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json`. | `unsupported`: not a standalone memory backend adapter. | Non-goal for direct win/loss until a standalone memory output contract exists; use replay jobs as benchmark infrastructure reference. | Checkpoint replay, deterministic regression, and state-diff evaluation patterns. | +| nanograph | Typed graph schema and query ergonomics for graph-lite developer experience. | `research_gate`. | `not_encoded`: `docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json`. | `unsupported`: not a memory backend comparison target. | Non-goal for direct win/loss unless a contained memory-backed target emerges; measure ELF graph-lite DX instead. | Typed relation schema, query ergonomics, and small graph developer experience. | +| llm-wiki | LLM-maintained wiki or knowledge-page workflow with query-save and lint loops. | `research_gate`. | `not_encoded`: `docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json`. | `unsupported`: no live service runtime for adapter proof. | Select contained plugin or instruction harness, then score knowledge pages for citations, unsupported claims, rebuild, and stale-source lint. | Maintained wiki workflows, page lint, query-save loops, and topic-scoped navigation. | +| gbrain | Operational knowledge brain with compiled_truth pages, timelines, enrichment, and maintenance loops. | `research_gate`. | `not_encoded`: `docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json`. | `blocked`: Docker-local brain repo and database path are missing. | Prove Docker-local repository/database setup, then encode compiled_truth/timeline and operator-continuity jobs. | Compiled truth pages, timeline maintenance, and human-operable knowledge-brain navigation. | +| graphify | Graph-compressed navigation with `graph.json` and `GRAPH_REPORT` evidence outputs. | `research_gate`. | `blocked`: `cargo make graphify-docker-graph-report-smoke`, `tmp/real-world-memory/graphify-smoke/graphify-smoke.json`. | `blocked`: Docker CLI graph/report generation is not proven; host-global assistant hooks are out of scope. | XY-889 Docker-only graph/report adapter over `graph.json` and `GRAPH_REPORT.md`. | Graph compression, source-location graph reports, and navigation hints for large code or document spaces. | + +## Scenario Matrix + +| Scenario | Current ELF evidence | Strongest competitor/reference | Current competitor evidence | Next measurement before claim | +| --- | --- | --- | --- | --- | +| Retrieval/debug | Fixture retrieval passes; live retrieval passes. | qmd. | qmd live retrieval passes and live baseline passes, but full-suite live status is `wrong_result`. | Run qmd deep profile and ELF/qmd trace-level replay with expansion, fusion, rerank, and candidate-drop diagnostics. | +| Work resume | Fixture and live work_resume pass. | agentmemory, claude-mem, OpenViking. | agentmemory `lifecycle_fail`, claude-mem `wrong_result`, OpenViking work_resume `not_encoded`. | Encode durable work_resume adapters or keep each blocked with lifecycle/setup evidence. | +| Project decisions | Fixture and live project_decisions pass. | qmd, Letta. | qmd live project_decisions pass; Letta is `research_gate` `not_encoded`. | Add Letta core/archival decision jobs only after a contained export path exists. | +| Source-of-truth | Fixture and live trust_source_of_truth pass. | memsearch. | memsearch canonical-store evidence exists, but source-of-truth is `incomplete` and retrieval is `wrong_result`. | Fix memsearch reindex/retrieval evidence and score source-of-truth rebuild/reload jobs. | +| Temporal/current-vs-historical memory | Fixture memory_evolution passes; live memory_evolution is `wrong_result`. | Graphiti/Zep, mem0/OpenMemory. | Graphiti/Zep is `research_gate` `blocked`; mem0/OpenMemory is `wrong_result`. | Fix ELF/qmd live memory_evolution evidence links and run XY-888. | +| Consolidation | Fixture consolidation passes; live consolidation is `not_encoded`. | agentmemory, managed-memory references, llm-wiki. | No manifest project has live consolidation scoring. | Run reviewable consolidation proposal generation with source refs, unsupported-claim flags, and audit transitions. | +| Knowledge pages | Fixture knowledge_compilation passes; live knowledge_compilation is `not_encoded`. | llm-wiki, gbrain, GraphRAG, graphify. | llm-wiki and gbrain are `research_gate` `not_encoded` or `blocked`; GraphRAG and graphify are `blocked`. | Encode live derived-page rebuild/lint scoring and run contained knowledge/RAG adapters only after setup proof. | +| Operator debugging | Fixture operator_debugging_ux passes; live operator_debugging_ux is `not_encoded`. | qmd, claude-mem, OpenMemory. | qmd has debug strengths but operator_debugging_ux is `not_encoded`; claude-mem and OpenMemory UX are `not_encoded`. | Score trace hydration, stage attribution, raw-SQL avoidance, and repair-action clarity through live artifacts. | +| Capture/write policy | Fixture capture_integration passes; live capture_integration is `not_encoded`. | agentmemory, claude-mem. | agentmemory capture is `blocked`; claude-mem capture is `not_encoded`. | Run live capture/write-policy jobs proving redaction, exclusion, evidence binding, and no secret leakage. | +| Production ops | Fixture production_ops has 4 pass and 2 blocked; live production_ops is `incomplete`; production adoption has provider/backfill/restore evidence. | ELF production gate, qmd, RAG/RAGFlow resource gates. | qmd live production_ops is `incomplete`; RAG/resource gates are `research_gate` `blocked`. | Rerun private-corpus and credentialed gates only when operator-owned manifest and credentials exist. | +| Personalization | Fixture and live personalization pass. | mem0/OpenMemory, Letta. | mem0/OpenMemory and Letta personalization are `not_encoded`. | Encode scoped preference readback for mem0/OpenMemory and Letta before personalization superiority claims. | +| Context trajectory | ELF has trace direction but no comparable staged trajectory scenario. | OpenViking. | OpenViking setup is pinned, same-corpus retrieval is `wrong_result`, and hierarchy trajectory is `not_encoded`. | Make OpenViking evidence-bearing retrieval pass, then score staged context trajectory outputs. | +| Core-vs-archival memory | ELF core-block semantics exist in the service contract, but comparative benchmark coverage is not encoded here. | Letta. | Letta is `research_gate` `not_encoded` until contained export proof exists. | Add ELF core-block versus archival-search jobs; compare Letta only after contained export proof. | +| Graph/RAG navigation | ELF relation context is not enough to claim graph/RAG navigation parity. | RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, graphify. | All named RAG/graph projects are `research_gate` `blocked` or `not_encoded`. | Run XY-885 through XY-889 Docker-contained adapters with evidence-linked outputs. | + +## Parallelizable Benchmark Follow-Ups + +These workstreams can proceed after this matrix lands because the claim boundaries are +now explicit: + +| Workstream | Issue or candidate | Parallelizable | Blocked by | Measurement | +| --- | --- | --- | --- | --- | +| qmd deep retrieval/debug profile | New benchmark issue | yes | None after this matrix lands. | Stress profile plus trace-level retrieval-debug artifacts for qmd and ELF. | +| agentmemory durable lifecycle adapter | `[ELF benchmark P0] Make external adapters lifecycle-durable and fail-typed` | yes | Durable local adapter path selection. | Update, delete, cold-start reload, work_resume, and capture/write-policy jobs. | +| mem0/OpenMemory local and UI coverage | New adapter repair issue | yes | Comparable local OSS path for UI/readback evidence. | Same-corpus fix plus memory_evolution, personalization, and OpenMemory inspection jobs. | +| memsearch source-of-truth and reindex coverage | New adapter repair issue | yes | Docker same-corpus retrieval and reindex correctness. | Canonical markdown store, rebuild/reindex, retrieval, update/delete/reload jobs. | +| OpenViking context trajectory | New benchmark issue after evidence output fix | yes | Evidence-bearing same-corpus retrieval output. | Hierarchical expansion, staged trajectory, and resume/retrieval evidence jobs. | +| claude-mem progressive disclosure | New adapter issue | yes | Durable repository path and progressive-disclosure output contract. | Work resume, operator debugging, capture/write-policy, and progressive disclosure jobs. | +| RAGFlow evidence smoke | XY-885 | yes | Resource envelope accepted for tiny Docker smoke. | `reference.chunks` to benchmark evidence mapping. | +| LightRAG context export | XY-886 | yes | Docker service setup and explicit provider config. | Retrieved context export and source file-path citations. | +| GraphRAG cost-bounded adapter | XY-887 | yes | Tiny corpus cost/resource envelope. | Document, text-unit, graph-summary, and citation output tables. | +| Graphiti/Zep temporal graph adapter | XY-888 | yes | Docker-local graph store setup. | Current/historical/future fact validity and evidence ids. | +| graphify graph report adapter | XY-889 | yes | Docker CLI graph/report generation proof. | `graph.json` and `GRAPH_REPORT` evidence for graph navigation and knowledge synthesis. | +| Private corpus and credentialed production ops | Operator-owned benchmark gates | no | Sanitized private manifest and routed provider credentials. | Private-corpus retrieval quality and credentialed production-ops evidence. | +| Letta, LangGraph, nanograph, llm-wiki direct adapters | Research-only until output contract | no | Contained evidence export or non-memory-backend comparability contract. | Run only after each has a comparable output contract; otherwise keep as product-reference evidence. | + +## Validation Contract + +Consistency checks for this report should verify: + +- The Markdown project matrix includes every project currently present in + `memory_projects_manifest.json`: ELF, qmd, agentmemory, mem0/OpenMemory, memsearch, + OpenViking, claude-mem, RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, Letta, LangGraph, + nanograph, llm-wiki, gbrain, and graphify. +- The machine-readable matrix has the same project set and includes every required + scenario id: `retrieval_debug`, `work_resume`, `project_decisions`, + `source_of_truth`, `temporal_current_historical`, `consolidation`, + `knowledge_pages`, `operator_debugging`, `capture_write_policy`, `production_ops`, + `personalization`, `context_trajectory`, `core_vs_archival_memory`, and + `graph_rag_navigation`. +- Evidence states remain typed. Do not collapse `research_gate`, `blocked`, + `unsupported`, `wrong_result`, `lifecycle_fail`, `incomplete`, or `not_encoded` + into pass/fail aggregates. + +## Claim Rules + +- A project can be called stronger only for a named scenario with comparable measured + evidence. +- `research_gate` plus setup metadata can justify a follow-up adapter issue, not a + product win. +- A blocked measurement is not a hidden loss. Keep the typed reason and rerun only when + the missing operator or setup input exists. +- If a project remains stronger on user-facing workflow but lacks comparable measured + evidence, record what ELF should borrow and add a benchmark gate before changing any + README-level claim. diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index 18824179..37798553 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -47,6 +47,10 @@ cleanup, use `docs/guide/single_user_production.md`. adoption refresh that keeps the decision at adopt with bounded caveats and separates fixture, live adapter, private corpus, credentialed, blocked, and research-gate evidence. +- `2026-06-11-competitor-strength-evidence-matrix.md`: XY-897 competitor-strength + matrix contract that maps every tracked memory/RAG/graph project to its strongest + scenario, current evidence class, typed blockers, next measurement gate, and ELF + borrow-if-stronger direction. - `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world agent memory benchmark contract, including suite taxonomy, typed report states, knowledge-compilation fixture tasks, and the production-ops fixture target. diff --git a/docs/research/2026-06-11-xy-897-competitor-strength-matrix.json b/docs/research/2026-06-11-xy-897-competitor-strength-matrix.json new file mode 100644 index 00000000..b847ecc7 --- /dev/null +++ b/docs/research/2026-06-11-xy-897-competitor-strength-matrix.json @@ -0,0 +1,648 @@ +{ + "schema": "elf.competitor_strength_evidence_matrix/v1", + "matrix_id": "xy-897-competitor-strength-evidence-matrix-2026-06-11", + "date": "2026-06-11", + "authority": "XY-897", + "purpose": "Keep competitor-strength claims tied to measured evidence classes, typed blockers, and next benchmark gates.", + "source_inputs": [ + "docs/guide/benchmarking/2026-06-10-production-adoption-refresh.md", + "docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md", + "docs/guide/benchmarking/2026-06-10-live-real-world-sweep-report.md", + "docs/guide/research/external_memory_improvement_plan.md", + "docs/guide/research/research_projects_inventory.md", + "apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json", + "Makefile.toml" + ], + "claim_boundary": { + "summary": "Do not claim ELF beats, ties, or loses to a project unless the named scenario is encoded and run at a comparable evidence class.", + "current_live_real_world_boundary": "ELF and qmd have full-suite live_real_world sweeps, but both are typed non-pass sweeps, not full-suite live passes.", + "research_gate_boundary": "Research-gate records are routing evidence for future adapters and must not be counted as fixture-backed, live-baseline, or live-real-world pass evidence.", + "operator_boundary": "Private corpus and credentialed production-ops checks remain blocked until operator-owned inputs are supplied." + }, + "manifest_summary": { + "adapter_records": 21, + "project_count": 17, + "evidence_class_counts": { + "fixture_backed": 1, + "live_baseline_only": 6, + "live_real_world": 2, + "research_gate": 12 + }, + "overall_status_counts": { + "pass": 1, + "wrong_result": 6, + "lifecycle_fail": 1, + "blocked": 6, + "not_encoded": 7 + } + }, + "state_taxonomy": [ + { + "state": "fixture_backed", + "meaning": "A checked-in fixture or generated fixture response is scored by the real-world job runner. This is evidence for the benchmark contract, not live runtime behavior." + }, + { + "state": "live_baseline_only", + "meaning": "A Docker live-baseline adapter ran same-corpus or lifecycle checks, but no real-world job suite was scored through that project." + }, + { + "state": "live_real_world", + "meaning": "A project adapter materialized and scored real-world job records through a runtime or CLI path." + }, + { + "state": "research_gate", + "meaning": "Source, setup, resource, retry, and output-contract metadata exists, but the project has not produced live adapter pass evidence." + }, + { + "state": "blocked", + "meaning": "A safe measurement cannot run without operator-owned credentials, private data, setup proof, or a dependency outside the lane." + }, + { + "state": "unsupported", + "meaning": "The capability is out of scope for the project shape or would require a non-comparable path such as host-global state." + }, + { + "state": "wrong_result", + "meaning": "The system ran but missed expected memory, evidence, or answer terms." + }, + { + "state": "lifecycle_fail", + "meaning": "Basic retrieval may work, but update, delete, reload, persistence, or cold-start behavior is wrong or incomplete." + }, + { + "state": "incomplete", + "meaning": "The run did not reach the behavioral check because setup, install, dependency, or runtime execution failed." + }, + { + "state": "not_encoded", + "meaning": "The scenario is not currently encoded for that project or evidence class, so no pass or fail claim is allowed." + } + ], + "project_matrix": [ + { + "project": "ELF", + "strongest_user_facing_scenario": "Evidence-linked source-of-truth memory service with real-world fixtures and live service retrieval sweeps.", + "current_evidence_class": "live_real_world", + "supporting_evidence_classes": [ + "fixture_backed", + "live_real_world" + ], + "measured_status": "wrong_result", + "proof": { + "command": "cargo make real-world-memory-live-adapters", + "artifact": "tmp/real-world-memory/live-adapters/elf-report.md" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "private_manifest_and_provider_credentials", + "details": "Fixture production-ops keeps private corpus and provider credential gates blocked; live sweep keeps broader non-retrieval suites typed non-pass." + }, + "benchmark_before_claim": "A full-suite live_real_world pass plus separate private-corpus and credentialed production-ops evidence is required before broad live parity or production proof claims.", + "borrow_if_stronger": "Keep borrowing qmd debug knobs, OpenViking staged trajectory, mem0 history, Letta core memory, and graph/RAG navigation patterns where they remain stronger." + }, + { + "project": "qmd", + "strongest_user_facing_scenario": "Local retrieval-debug workflow with transparent CLI indexing, querying, expansion, fusion, and rerank ergonomics.", + "current_evidence_class": "live_real_world", + "supporting_evidence_classes": [ + "live_baseline_only", + "live_real_world", + "research_gate" + ], + "measured_status": "wrong_result", + "proof": { + "command": "cargo make real-world-memory-live-adapters", + "artifact": "tmp/real-world-memory/live-adapters/qmd-report.md" + }, + "unsupported_or_blocked_status": { + "state": "not_encoded", + "typed_reason": "deep_profile_and_non_retrieval_suites_not_encoded", + "details": "The full live sweep passes targeted retrieval suites but keeps memory_evolution wrong_result and several broader suites not_encoded or incomplete." + }, + "benchmark_before_claim": "Run qmd deep retrieval/debug profile and full-suite live real-world replay with trace-level diagnostics before claiming ELF wins, ties, or loses on retrieval debugging.", + "borrow_if_stronger": "Borrow transparent local knobs for query rewriting, weighted fusion, rerank explanation, and command-line replay." + }, + { + "project": "agentmemory", + "strongest_user_facing_scenario": "Coding-agent continuity, MCP/REST packaging, viewer workflow, and durable cross-agent memory lifecycle.", + "current_evidence_class": "live_baseline_only", + "supporting_evidence_classes": [ + "live_baseline_only" + ], + "measured_status": "lifecycle_fail", + "proof": { + "command": "ELF_BASELINE_PROJECTS=agentmemory cargo make baseline-live-docker", + "artifact": "tmp/live-baseline/live-baseline-report.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "durable_lifecycle_adapter_missing", + "details": "Same-corpus retrieval can run, but durable cold-start and real-world job adapter coverage are blocked by the current adapter path." + }, + "benchmark_before_claim": "Add a durable local adapter that covers update, delete, cold-start reload, work resume, capture/write policy, and lifecycle-staleness jobs.", + "borrow_if_stronger": "Borrow cross-agent hooks, packaging, continuity scenarios, and operator-visible viewer affordances." + }, + { + "project": "mem0/OpenMemory", + "strongest_user_facing_scenario": "Memory lifecycle, personalization, hosted/OpenMemory UI ergonomics, and optional graph memory.", + "current_evidence_class": "live_baseline_only", + "supporting_evidence_classes": [ + "live_baseline_only" + ], + "measured_status": "wrong_result", + "proof": { + "command": "ELF_BASELINE_PROJECTS=mem0 cargo make baseline-live-docker", + "artifact": "tmp/live-baseline/live-baseline-report.json" + }, + "unsupported_or_blocked_status": { + "state": "not_encoded", + "typed_reason": "openmemory_ui_and_hosted_claims_not_encoded", + "details": "Local OSS setup is represented, but hosted/OpenMemory UI parity and real-world personalization coverage are not encoded." + }, + "benchmark_before_claim": "Fix the local adapter's same-corpus result, then encode memory_evolution, personalization, OpenMemory UI readback, and optional graph-context jobs.", + "borrow_if_stronger": "Borrow entity-scoped memory history, lifecycle surfaces, async update ergonomics, and OpenMemory-style inspection UX." + }, + { + "project": "memsearch", + "strongest_user_facing_scenario": "Markdown-first canonical store with rebuildable local index and practical hybrid retrieval.", + "current_evidence_class": "live_baseline_only", + "supporting_evidence_classes": [ + "live_baseline_only" + ], + "measured_status": "wrong_result", + "proof": { + "command": "ELF_BASELINE_PROJECTS=memsearch cargo make baseline-live-docker", + "artifact": "tmp/live-baseline/live-baseline-report.json" + }, + "unsupported_or_blocked_status": { + "state": "incomplete", + "typed_reason": "source_of_truth_and_reindex_real_world_jobs_incomplete", + "details": "Same-corpus retrieval is wrong_result and source-of-truth plus real-world reindex behavior is not yet cleanly scored." + }, + "benchmark_before_claim": "Fix Docker same-corpus retrieval and reindex/update/delete reload evidence, then score source-of-truth and retrieval-debug real-world jobs.", + "borrow_if_stronger": "Borrow the canonical markdown-store ergonomics, local reindex clarity, and user-inspectable source files." + }, + { + "project": "OpenViking", + "strongest_user_facing_scenario": "Filesystem-like context trajectory, hierarchical retrieval, and staged context loading.", + "current_evidence_class": "live_baseline_only", + "supporting_evidence_classes": [ + "live_baseline_only", + "research_gate" + ], + "measured_status": "wrong_result", + "proof": { + "command": "ELF_BASELINE_PROJECTS=OpenViking cargo make baseline-live-docker", + "artifact": "tmp/live-baseline/live-baseline-report.json" + }, + "unsupported_or_blocked_status": { + "state": "not_encoded", + "typed_reason": "hierarchical_context_trajectory_not_encoded", + "details": "Pinned Docker local embedding setup reaches add_resource/find, but same-corpus output misses expected evidence and trajectory jobs are not encoded." + }, + "benchmark_before_claim": "First make evidence-bearing same-corpus output pass, then run a context-trajectory suite that scores staged retrieval paths and hierarchy expansion.", + "borrow_if_stronger": "Borrow the viking-style filesystem context model, trajectory readback, and staged retrieval planning." + }, + { + "project": "claude-mem", + "strongest_user_facing_scenario": "Progressive disclosure, automatic capture loop, repository-local lifecycle, and practical local viewer workflow.", + "current_evidence_class": "live_baseline_only", + "supporting_evidence_classes": [ + "live_baseline_only" + ], + "measured_status": "wrong_result", + "proof": { + "command": "ELF_BASELINE_PROJECTS=claude-mem cargo make baseline-live-docker", + "artifact": "tmp/live-baseline/live-baseline-report.json" + }, + "unsupported_or_blocked_status": { + "state": "not_encoded", + "typed_reason": "progressive_disclosure_real_world_jobs_not_encoded", + "details": "Current Docker evidence is not a clean retrieval pass and progressive-disclosure jobs are not encoded." + }, + "benchmark_before_claim": "Add durable repository-backed work_resume, operator_debugging_ux, capture/write-policy, and progressive-disclosure jobs.", + "borrow_if_stronger": "Borrow progressive disclosure, automatic capture review loops, and local viewer/operator comfort." + }, + { + "project": "RAGFlow", + "strongest_user_facing_scenario": "Full RAG application workflow with document, chunk, and reference evidence handles.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "blocked", + "proof": { + "command": "ELF_RAGFLOW_SMOKE_START=1 ELF_RAGFLOW_SMOKE_ACCEPT_RESOURCE_ENVELOPE=1 cargo make ragflow-docker-smoke", + "artifact": "tmp/real-world-memory/ragflow-smoke/ragflow-smoke.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "docker_service_resource_envelope_and_adapter_output_mapping", + "details": "Research says adapter candidate, but Docker runtime proof and reference.chunks to benchmark evidence mapping must still run." + }, + "benchmark_before_claim": "Run XY-885 tiny Docker evidence-smoke adapter and map RAGFlow reference chunks to scored retrieval/debug evidence.", + "borrow_if_stronger": "Borrow document/chunk reference surfaces, resource-envelope reporting, and RAG app evidence handles." + }, + { + "project": "LightRAG", + "strongest_user_facing_scenario": "Lightweight graph/RAG context export with source file-path citation shape.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "blocked", + "proof": { + "command": "ELF_LIGHTRAG_CONTEXT_START=1 cargo make lightrag-docker-context-smoke", + "artifact": "tmp/real-world-memory/lightrag-context/summary.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "docker_service_setup_and_context_export_not_proven", + "details": "The project is an adapter candidate, but retrieved-context export and real-world adapter scoring remain blocked." + }, + "benchmark_before_claim": "Run XY-886 Docker context-export adapter with explicit LLM and embedding config plus source citation mapping.", + "borrow_if_stronger": "Borrow context-only query modes, graph-aware retrieval layout, and file-path citation readback." + }, + { + "project": "GraphRAG", + "strongest_user_facing_scenario": "GraphRAG indexing, graph summaries, and document/text-unit evidence tables.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "blocked", + "proof": { + "command": "ELF_GRAPHRAG_SMOKE_RUN=1 cargo make graphrag-docker-smoke", + "artifact": "tmp/real-world-memory/graphrag-smoke/summary.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "indexing_resource_envelope_and_source_citation_mapping", + "details": "Cost-bounded Docker CLI/API and parquet outputs are identified, but indexing and evidence mapping have not passed." + }, + "benchmark_before_claim": "Run XY-887 cost-bounded Docker adapter over a tiny corpus and score output tables against retrieval and knowledge-synthesis evidence.", + "borrow_if_stronger": "Borrow graph summary artifacts, local/global search separation, and source table evidence mapping." + }, + { + "project": "Graphiti/Zep", + "strongest_user_facing_scenario": "Temporal graph memory with current, historical, and future fact validity windows.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "blocked", + "proof": { + "command": "ELF_GRAPHITI_ZEP_SMOKE_START=1 ELF_GRAPHITI_ZEP_SMOKE_RUN=1 cargo make graphiti-zep-docker-temporal-smoke", + "artifact": "tmp/real-world-memory/graphiti-zep-smoke/summary.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "docker_graph_store_and_temporal_adapter_not_proven", + "details": "Temporal graph memory is an adapter candidate, but Docker graph-store setup and real-world job scoring are blocked." + }, + "benchmark_before_claim": "Run XY-888 Docker-local temporal graph adapter and score current versus historical fact validity with evidence ids.", + "borrow_if_stronger": "Borrow temporal fact windows, invalidation/supersession semantics, and graph fact provenance." + }, + { + "project": "Letta", + "strongest_user_facing_scenario": "Core memory blocks versus archival memory with explicit operating-context surfaces.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "not_encoded", + "proof": { + "command": null, + "artifact": "docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "contained_evidence_export_path_not_selected", + "details": "Research-only until a supported contained server path can export core/archival evidence without relying on unsupported setup." + }, + "benchmark_before_claim": "Select a contained evidence export contract, then encode core-vs-archival memory, personalization, and project-decision jobs.", + "borrow_if_stronger": "Borrow explicit core memory block ergonomics, archival separation, and shared operating context readback." + }, + { + "project": "LangGraph", + "strongest_user_facing_scenario": "Checkpoint/replay regression workflow and durable state replay for agent runs.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "not_encoded", + "proof": { + "command": null, + "artifact": "docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json" + }, + "unsupported_or_blocked_status": { + "state": "unsupported", + "typed_reason": "not_a_standalone_memory_backend_adapter", + "details": "Keep as a checkpoint/replay reference, not as a direct memory backend competitor until a comparable memory output contract exists." + }, + "benchmark_before_claim": "Non-goal for direct win/loss until a standalone memory adapter contract exists; use replay regression jobs as a benchmark infrastructure reference.", + "borrow_if_stronger": "Borrow checkpoint replay, deterministic regression, and state-diff evaluation patterns." + }, + { + "project": "nanograph", + "strongest_user_facing_scenario": "Typed graph schema and query ergonomics for graph-lite developer experience.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "not_encoded", + "proof": { + "command": null, + "artifact": "docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json" + }, + "unsupported_or_blocked_status": { + "state": "unsupported", + "typed_reason": "not_a_memory_backend_comparison_target", + "details": "Official shape is no server and no Docker path; use as graph-lite DX reference rather than adapter proof." + }, + "benchmark_before_claim": "Non-goal for direct win/loss unless a contained memory-backed comparison target emerges; measure ELF graph-lite DX against typed schema/query acceptance instead.", + "borrow_if_stronger": "Borrow typed relation schema, query ergonomics, and small graph developer experience." + }, + { + "project": "llm-wiki", + "strongest_user_facing_scenario": "LLM-maintained wiki or knowledge-page workflow with query-save and lint loops.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "not_encoded", + "proof": { + "command": null, + "artifact": "docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json" + }, + "unsupported_or_blocked_status": { + "state": "unsupported", + "typed_reason": "live_service_runtime_not_available_for_adapter_proof", + "details": "Research-only until a contained plugin or instruction harness can emit scored knowledge-page evidence." + }, + "benchmark_before_claim": "Select a contained plugin or instruction harness, then score knowledge pages for citation coverage, unsupported claims, rebuild, and stale-source lint.", + "borrow_if_stronger": "Borrow maintained wiki workflows, page lint, query-save loops, and topic-scoped knowledge navigation." + }, + { + "project": "gbrain", + "strongest_user_facing_scenario": "Operational knowledge brain with compiled_truth pages, timelines, enrichment, and maintenance loops.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "not_encoded", + "proof": { + "command": null, + "artifact": "docs/research/2026-06-10-xy-882-rag-graph-adapter-feasibility.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "docker_local_brain_repo_and_database_path_missing", + "details": "Research remains blocked until a Docker-local brain repo and database path can be proven without operator-owned state." + }, + "benchmark_before_claim": "First prove Docker-local repository and database setup, then encode compiled_truth/timeline page scoring and operator-continuity jobs.", + "borrow_if_stronger": "Borrow compiled truth pages, timeline maintenance, and human-operable knowledge-brain navigation." + }, + { + "project": "graphify", + "strongest_user_facing_scenario": "Graph-compressed navigation with graph.json and GRAPH_REPORT evidence outputs.", + "current_evidence_class": "research_gate", + "supporting_evidence_classes": [ + "research_gate" + ], + "measured_status": "blocked", + "proof": { + "command": "cargo make graphify-docker-graph-report-smoke", + "artifact": "tmp/real-world-memory/graphify-smoke/graphify-smoke.json" + }, + "unsupported_or_blocked_status": { + "state": "blocked", + "typed_reason": "docker_cli_graph_report_generation_not_proven", + "details": "Adapter candidate, but graph report generation and real-world scoring are still blocked; host-global assistant hooks are out of scope." + }, + "benchmark_before_claim": "Run XY-889 Docker-only graph/report adapter over graph.json and GRAPH_REPORT.md, then score graph navigation and knowledge-synthesis evidence.", + "borrow_if_stronger": "Borrow graph compression, source-location graph reports, and navigation hints for large code or document spaces." + } + ], + "scenario_matrix": [ + { + "scenario_id": "retrieval_debug", + "scenario": "retrieval/debug", + "current_elf_evidence": "ELF fixture-backed retrieval passes and ELF live_real_world retrieval passes in the full sweep.", + "strongest_competitor_or_reference": "qmd", + "current_competitor_evidence": "qmd live_real_world retrieval passes and qmd live_baseline_only checks pass, but qmd full-suite live status is wrong_result.", + "current_state": "Measured tie on encoded retrieval answers; qmd remains stronger on local debug ergonomics not fully scored.", + "next_measurement": "Run qmd deep retrieval/debug profile and ELF/qmd trace-level wrong-result replay with expansion, fusion, rerank, and candidate-drop diagnostics." + }, + { + "scenario_id": "work_resume", + "scenario": "work resume", + "current_elf_evidence": "ELF fixture-backed work_resume passes and ELF live_real_world work_resume passes.", + "strongest_competitor_or_reference": "agentmemory, claude-mem, OpenViking", + "current_competitor_evidence": "agentmemory is live_baseline_only with lifecycle_fail; claude-mem is wrong_result; OpenViking work_resume is not_encoded.", + "current_state": "ELF and qmd have current encoded live pass evidence, but continuity-oriented competitors remain undermeasured.", + "next_measurement": "Encode durable agentmemory, claude-mem, and OpenViking work_resume adapters or declare each blocked with lifecycle/setup evidence." + }, + { + "scenario_id": "project_decisions", + "scenario": "project decisions", + "current_elf_evidence": "ELF fixture-backed and live_real_world project_decisions suites pass.", + "strongest_competitor_or_reference": "qmd, Letta", + "current_competitor_evidence": "qmd live_real_world project_decisions passes; Letta project_decisions is research_gate not_encoded.", + "current_state": "ELF and qmd are the only measured live competitors for this scenario.", + "next_measurement": "Add core/archival decision-memory jobs for Letta only after a contained export path exists; otherwise keep Letta as design reference." + }, + { + "scenario_id": "source_of_truth", + "scenario": "source-of-truth", + "current_elf_evidence": "ELF fixture-backed trust_source_of_truth passes and ELF live_real_world trust_source_of_truth passes.", + "strongest_competitor_or_reference": "memsearch", + "current_competitor_evidence": "memsearch has live_baseline_only canonical store evidence but trust_source_of_truth is incomplete and retrieval is wrong_result.", + "current_state": "ELF has stronger measured source-of-truth evidence; memsearch remains a local-store ergonomics reference.", + "next_measurement": "Fix memsearch same-corpus retrieval/reindex evidence, then run source-of-truth rebuild and reload jobs before any win/loss claim." + }, + { + "scenario_id": "temporal_current_historical", + "scenario": "temporal/current-vs-historical memory", + "current_elf_evidence": "ELF fixture-backed memory_evolution passes, but ELF live_real_world memory_evolution is wrong_result.", + "strongest_competitor_or_reference": "Graphiti/Zep, mem0/OpenMemory", + "current_competitor_evidence": "Graphiti/Zep is research_gate blocked; mem0/OpenMemory is live_baseline_only wrong_result.", + "current_state": "No project has a comparable live pass for current-vs-historical evidence; ELF cannot claim live superiority yet.", + "next_measurement": "Fix ELF/qmd live memory_evolution evidence links and run XY-888 Graphiti/Zep temporal graph adapter." + }, + { + "scenario_id": "consolidation", + "scenario": "consolidation", + "current_elf_evidence": "ELF fixture-backed consolidation passes, but live_real_world consolidation is not_encoded.", + "strongest_competitor_or_reference": "agentmemory, managed dreaming references, llm-wiki", + "current_competitor_evidence": "Manifest projects do not yet have live consolidation scoring; llm-wiki knowledge workflow is research_gate not_encoded.", + "current_state": "Fixture-only ELF evidence is useful, but no live proposal-generation parity claim is allowed.", + "next_measurement": "Run a reviewable consolidation-worker benchmark that emits proposals, source refs, unsupported-claim flags, and apply/discard/defer audit events." + }, + { + "scenario_id": "knowledge_pages", + "scenario": "knowledge pages", + "current_elf_evidence": "ELF fixture-backed knowledge_compilation passes, but live_real_world knowledge_compilation is not_encoded.", + "strongest_competitor_or_reference": "llm-wiki, gbrain, GraphRAG, graphify", + "current_competitor_evidence": "llm-wiki and gbrain are research_gate not_encoded or blocked; GraphRAG and graphify are research_gate blocked.", + "current_state": "No live knowledge-page competitor result exists; ELF has only fixture-backed derived-page evidence.", + "next_measurement": "Encode live knowledge-page rebuild/lint scoring for ELF and run contained llm-wiki, gbrain, GraphRAG, or graphify adapters only after setup proof exists." + }, + { + "scenario_id": "operator_debugging", + "scenario": "operator debugging", + "current_elf_evidence": "ELF fixture-backed operator_debugging_ux passes, but ELF live_real_world operator_debugging_ux is not_encoded.", + "strongest_competitor_or_reference": "qmd, claude-mem, OpenMemory", + "current_competitor_evidence": "qmd has local debug strengths but operator_debugging_ux is not_encoded in live sweeps; claude-mem and OpenMemory UX are not_encoded.", + "current_state": "Operator debugging remains mostly product/UX evidence, not comparable live benchmark evidence.", + "next_measurement": "Score trace hydration, candidate-stage attribution, raw-SQL avoidance, and repair-action clarity through live viewer or CLI artifacts." + }, + { + "scenario_id": "capture_write_policy", + "scenario": "capture/write policy", + "current_elf_evidence": "ELF fixture-backed capture_integration passes, but ELF live_real_world capture_integration is not_encoded.", + "strongest_competitor_or_reference": "agentmemory, claude-mem", + "current_competitor_evidence": "agentmemory capture_integration is blocked and claude-mem capture_integration is not_encoded.", + "current_state": "ELF fixture evidence is strongest, but live capture and write-policy behavior still needs runtime scoring.", + "next_measurement": "Run capture/write-policy jobs that prove redaction, exclusion, evidence binding, and no secret leakage through live ingestion paths." + }, + { + "scenario_id": "production_ops", + "scenario": "production ops", + "current_elf_evidence": "ELF production runbooks and fixture production_ops cover restore, Qdrant rebuild, backfill resume, resource envelope, and typed private/credential blockers; live_real_world production_ops is incomplete.", + "strongest_competitor_or_reference": "ELF production gate, qmd, RAG/RAGFlow resource gates", + "current_competitor_evidence": "qmd live production_ops is incomplete; RAGFlow/GraphRAG/LightRAG resource gates are research_gate blocked.", + "current_state": "ELF has the strongest checked-in production evidence, but private corpus and credentialed gates remain blocked.", + "next_measurement": "Rerun private-corpus and credentialed production-ops gates only when operator-owned manifest and credentials are supplied." + }, + { + "scenario_id": "personalization", + "scenario": "personalization", + "current_elf_evidence": "ELF fixture-backed personalization passes and ELF live_real_world personalization passes.", + "strongest_competitor_or_reference": "mem0/OpenMemory, Letta", + "current_competitor_evidence": "mem0/OpenMemory personalization is not_encoded and Letta personalization is research_gate not_encoded.", + "current_state": "ELF and qmd have live encoded evidence; personalization-specialized competitors are not yet comparable.", + "next_measurement": "Encode mem0/OpenMemory and Letta scoped-preference readback jobs before making personalization superiority claims." + }, + { + "scenario_id": "context_trajectory", + "scenario": "context trajectory", + "current_elf_evidence": "ELF has trace and trajectory directions, but staged context trajectory is not yet a comparable live scenario.", + "strongest_competitor_or_reference": "OpenViking", + "current_competitor_evidence": "OpenViking Docker setup is pinned, same-corpus retrieval is wrong_result, and hierarchical trajectory is research_gate not_encoded.", + "current_state": "OpenViking remains the strongest design reference, but not a measured live winner.", + "next_measurement": "Make OpenViking same-corpus evidence-bearing retrieval pass, then score hierarchical expansion and staged context trajectory outputs." + }, + { + "scenario_id": "core_vs_archival_memory", + "scenario": "core-vs-archival memory", + "current_elf_evidence": "ELF spec and admin surfaces define core blocks, but comparative benchmark coverage is not yet encoded here.", + "strongest_competitor_or_reference": "Letta", + "current_competitor_evidence": "Letta is research_gate not_encoded until a contained evidence export path is selected.", + "current_state": "Scenario is a product gap measurement target, not a current win/loss surface.", + "next_measurement": "Add core-block versus archival-search jobs for ELF and only compare Letta after contained export proof exists." + }, + { + "scenario_id": "graph_rag_navigation", + "scenario": "graph/RAG navigation", + "current_elf_evidence": "ELF relation context and graph-lite work are not enough to claim graph/RAG navigation parity.", + "strongest_competitor_or_reference": "RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, graphify", + "current_competitor_evidence": "All named RAG/graph projects are research_gate blocked or not_encoded, with adapter-candidate follow-ups for RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, and graphify.", + "current_state": "No RAG/graph project has live_real_world pass evidence; research gates define follow-up adapter work only.", + "next_measurement": "Run XY-885 through XY-889 Docker-contained adapters and require evidence-linked outputs before any graph/RAG navigation claim." + } + ], + "parallelizable_followups": [ + { + "workstream": "qmd deep retrieval/debug profile", + "issue_or_candidate": "new benchmark issue", + "parallelizable": true, + "blocked_by": "None after this matrix lands.", + "measurement": "Stress profile plus trace-level retrieval-debug artifacts for qmd and ELF." + }, + { + "workstream": "agentmemory durable lifecycle adapter", + "issue_or_candidate": "[ELF benchmark P0] Make external adapters lifecycle-durable and fail-typed", + "parallelizable": true, + "blocked_by": "Durable local adapter path selection.", + "measurement": "Update, delete, cold-start reload, work_resume, and capture/write-policy jobs." + }, + { + "workstream": "mem0/OpenMemory local and UI coverage", + "issue_or_candidate": "new adapter repair issue", + "parallelizable": true, + "blocked_by": "Comparable local OSS path for UI/readback evidence.", + "measurement": "Same-corpus fix plus memory_evolution, personalization, and OpenMemory inspection jobs." + }, + { + "workstream": "memsearch source-of-truth and reindex coverage", + "issue_or_candidate": "new adapter repair issue", + "parallelizable": true, + "blocked_by": "Docker same-corpus retrieval and reindex correctness.", + "measurement": "Canonical markdown store, rebuild/reindex, retrieval, update/delete/reload jobs." + }, + { + "workstream": "OpenViking context trajectory", + "issue_or_candidate": "new benchmark issue after evidence output fix", + "parallelizable": true, + "blocked_by": "Evidence-bearing same-corpus retrieval output.", + "measurement": "Hierarchical expansion, staged trajectory, and resume/retrieval evidence jobs." + }, + { + "workstream": "claude-mem progressive disclosure", + "issue_or_candidate": "new adapter issue", + "parallelizable": true, + "blocked_by": "Durable repository path and progressive-disclosure output contract.", + "measurement": "Work resume, operator debugging, capture/write-policy, and progressive disclosure jobs." + }, + { + "workstream": "RAGFlow evidence smoke", + "issue_or_candidate": "XY-885", + "parallelizable": true, + "blocked_by": "Resource envelope accepted for tiny Docker smoke.", + "measurement": "reference.chunks to benchmark evidence mapping." + }, + { + "workstream": "LightRAG context export", + "issue_or_candidate": "XY-886", + "parallelizable": true, + "blocked_by": "Docker service setup and explicit provider config.", + "measurement": "Retrieved context export and source file-path citations." + }, + { + "workstream": "GraphRAG cost-bounded adapter", + "issue_or_candidate": "XY-887", + "parallelizable": true, + "blocked_by": "Tiny corpus cost/resource envelope.", + "measurement": "Document, text-unit, graph-summary, and citation output tables." + }, + { + "workstream": "Graphiti/Zep temporal graph adapter", + "issue_or_candidate": "XY-888", + "parallelizable": true, + "blocked_by": "Docker-local graph store setup.", + "measurement": "Current/historical/future fact validity and evidence ids." + }, + { + "workstream": "graphify graph report adapter", + "issue_or_candidate": "XY-889", + "parallelizable": true, + "blocked_by": "Docker CLI graph/report generation proof.", + "measurement": "graph.json and GRAPH_REPORT evidence for graph navigation and knowledge synthesis." + }, + { + "workstream": "Private corpus and credentialed production ops", + "issue_or_candidate": "operator-owned benchmark gates", + "parallelizable": false, + "blocked_by": "Sanitized private manifest and routed provider credentials.", + "measurement": "Private-corpus retrieval quality and credentialed production-ops pass/fail evidence." + }, + { + "workstream": "Letta, LangGraph, nanograph, llm-wiki direct adapters", + "issue_or_candidate": "research-only until output contract", + "parallelizable": false, + "blocked_by": "Contained evidence export or non-memory-backend comparability contract.", + "measurement": "Only run after each has a comparable output contract; otherwise treat as product-reference evidence." + } + ] +}