hack-ink · yvette-carlisle · Jun 11, 2026 · Jun 11, 2026
diff --git a/...e/benchmarking/2026-06-11-elf-iteration-direction-from-competitor-benchmarks.md b/...e/benchmarking/2026-06-11-elf-iteration-direction-from-competitor-benchmarks.md
@@ -0,0 +1,282 @@
+# ELF Iteration Direction From Competitor Benchmarks - June 11, 2026
+
+Goal: Convert the current benchmark evidence and competitor-strength matrix into an
+iteration direction for ELF without overstating wins.
+Read this when: You need to decide what ELF should learn from adjacent memory,
+RAG, graph, and agent-continuity projects.
+Inputs: `2026-06-11-competitor-strength-evidence-matrix.md`,
+`2026-06-10-live-real-world-sweep-report.md`,
+`2026-06-10-production-adoption-refresh.md`,
+`2026-06-10-real-world-comparison-report.md`,
+`apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json`,
+and `docs/guide/research/external_memory_improvement_plan.md`.
+Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`.
+Outputs: Current measured data, scenario gaps, and a prioritized optimization
+direction for future ELF work.
+
+## Executive Judgment
+
+ELF is a credible personal-production foundation for a high-trust memory service, but
+the current evidence does not prove broad superiority over all tracked projects.
+
+The strongest current statement is:
+
+- ELF is ahead on source-of-truth discipline, evidence-bound writes, rebuildable
+  derived indexes, typed failure reporting, and checked-in production-operation
+  evidence.
+- ELF and qmd are tied on the encoded live retrieval, work-resume, and
+  project-decision slices. ELF does not yet beat qmd's local retrieval-debug
+  ergonomics.
+- Many competitor strengths are still undermeasured: OpenViking context trajectory,
+  mem0/OpenMemory entity history and UI, agentmemory and claude-mem continuity
+  capture, Letta core-vs-archival memory, Graphiti/Zep temporal graph behavior, and
+  llm-wiki/gbrain/graphify knowledge workflows.
+- The right next strategy is not to replace ELF with any one project. It is to keep
+  ELF's evidence-bound core and absorb the best measured or plausible product
+  patterns behind benchmark gates.
+
+## Current Measured Data
+
+### Fixture-Backed ELF Aggregate
+
+`cargo make real-world-memory` currently reports:
+
+| Metric | Value |
+| --- | ---: |
+| Jobs | `38` |
+| Encoded suites | `11` |
+| Pass | `36` |
+| Blocked | `2` |
+| Wrong result | `0` |
+| Lifecycle fail | `0` |
+| Incomplete | `0` |
+| Not encoded | `0` |
+| Unsupported claim | `0` |
+| Mean score | `0.947` |
+| Evidence coverage | `84/84` |
+| Expected evidence recall | `77/77` |
+
+This proves the fixture contract is broad and well controlled. It does not prove that
+every live adapter or every competitor runtime passes those scenarios.
+
+### Live Real-World Sweep
+
+`cargo make real-world-memory-live-adapters` produced comparable full-suite live
+sweeps for ELF and qmd:
+
+| Adapter | Jobs | Pass | Wrong result | Incomplete | Blocked | Not encoded | Mean score | Evidence recall |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| ELF live service adapter | `38` | `18` | `5` | `1` | `2` | `12` | `0.514` | `41/75` |
+| qmd live CLI adapter | `38` | `18` | `5` | `1` | `2` | `12` | `0.512` | `41/75` |
+
+Interpretation:
+
+- This is a tie for the currently encoded live real-world sweep.
+- Both pass `trust_source_of_truth`, `work_resume`, `project_decisions`,
+  `retrieval`, and `personalization`.
+- Both fail `memory_evolution` live conflict evidence with `wrong_result`.
+- Both leave consolidation, knowledge compilation, operator debugging, capture
+  integration, and parts of production operations as `not_encoded` or incomplete.
+
+### Production Evidence
+
+ELF has the strongest production-operation evidence among the tracked systems:
+
+| Run | Scope | Result |
+| --- | --- | --- |
+| Provider synthetic | 8 documents, 6 queries, Qwen3-Embedding-8B, 4096 dimensions | `8/8`, `pass`, 59 seconds |
+| Provider stress | 480 generated documents, 16 queries | `9/9`, `pass`, 779 seconds |
+| Provider backfill | 2,000 generated documents, 16 queries, resume 1,000 -> 2,000 | `9/9`, `pass`, 2,804 seconds |
+| Restore proof | Docker Compose backup/restore plus Qdrant rebuild | restored note searchable, zero rebuild errors |
+| Private production corpus | operator-owned manifest required | failed closed, no pass claimed |
+
+This is enough to support personal production use with bounded caveats. It is not a
+private-corpus quality proof.
+
+### External Adapter Ledger
+
+The current adapter manifest records 21 adapter records across 17 projects:
+
+| Evidence class | Count | Meaning |
+| --- | ---: | --- |
+| `fixture_backed` | `1` | ELF real-world fixture scoring. |
+| `live_baseline_only` | `6` | Docker same-corpus or lifecycle evidence without real-world job scoring. |
+| `live_real_world` | `2` | ELF and qmd full-suite live sweeps. |
+| `research_gate` | `12` | Source/setup/resource/output-contract evidence only. |
+
+Overall adapter statuses:
+
+| Status | Count |
+| --- | ---: |
+| `pass` | `1` |
+| `wrong_result` | `6` |
+| `lifecycle_fail` | `1` |
+| `blocked` | `6` |
+| `not_encoded` | `7` |
+
+The ledger is intentionally not a leaderboard. It prevents fixture evidence,
+same-corpus checks, research gates, and live real-world runs from being collapsed into
+one misleading score.
+
+## Scenario Conclusions
+
+| Scenario | Current position | What ELF should learn next |
+| --- | --- | --- |
+| Retrieval/debug | ELF and qmd are tied on encoded live retrieval; qmd remains the stronger debug UX reference. | Add trace-level replay, expansion/fusion/rerank knobs, candidate-drop diagnosis, and command-line replay. |
+| Work resume | ELF live work-resume passes; continuity-oriented competitors are undermeasured. | Borrow agentmemory/claude-mem capture breadth and OpenViking staged context, but require durable adapter proof. |
+| Project decisions | ELF and qmd live project-decision suites pass; Letta is not encoded. | Add core-vs-archival decision-memory scenarios before comparing Letta. |
+| Source of truth | ELF has the strongest measured source-of-truth evidence. | Borrow memsearch's local canonical-store ergonomics without making files or vectors authoritative. |
+| Temporal memory | ELF fixture passes, but live memory evolution is wrong_result. | Prioritize current-vs-historical evidence links and Graphiti/Zep-style validity windows. |
+| Consolidation | ELF fixture passes, but live proposal generation is not encoded. | Build reviewable derived proposals with source refs, confidence, unsupported-claim flags, and apply/defer/discard audit. |
+| Knowledge pages | ELF fixture pages pass; live knowledge generation is not encoded. | Borrow llm-wiki lint/query-save loops, gbrain timelines, and graphify reports behind rebuild/lint benchmarks. |
+| Operator debugging | Fixture UX passes; live trace/viewer scoring is not encoded. | Make viewer/CLI debugging a scored live surface, not just an admin convenience. |
+| Capture/write policy | Fixture capture boundary passes; live capture is not encoded. | Borrow agentmemory/claude-mem capture hooks while preserving redaction and evidence binding. |
+| Production ops | ELF has the strongest checked-in evidence, with private/credential gates blocked. | Keep Docker-first production proof and add private corpus only when an operator-owned manifest exists. |
+| Personalization | ELF live personalization passes; mem0/OpenMemory and Letta are not encoded. | Add entity-scoped preference history and UI readback before claiming stronger personalization. |
+| Context trajectory | Not comparable yet; OpenViking remains the reference. | Score staged retrieval, hierarchy expansion, and trajectory readback. |
+| Core-vs-archival | Product gap, not a measured comparison yet. | Borrow Letta's core memory block shape with explicit scope, provenance, and read-only attachment. |
+| Graph/RAG navigation | Research gates only. | Run RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, and graphify adapters only when Docker outputs map to evidence ids. |
+
+## Project Guidance Matrix
+
+| Project | Current evidence | User-facing strength | ELF direction |
+| --- | --- | --- | --- |
+| ELF | `fixture_backed` plus `live_real_world`; live full sweep is `wrong_result`. | Evidence-linked memory service, strict provenance, rebuildable Qdrant, production backfill/restore proof. | Keep this as the core; do not weaken source-of-truth or typed failure semantics while adding product ergonomics. |
+| qmd | `live_real_world` plus `live_baseline_only`; targeted retrieval passes, full sweep is `wrong_result`. | Local retrieval-debug workflow, transparent CLI, weighted fusion, rerank, replayable commands. | Treat qmd as the retrieval-debug bar. ELF should match its introspection and local replay without becoming CLI-only. |
+| agentmemory | `live_baseline_only`; current status is `lifecycle_fail`. | Coding-agent continuity, hooks, MCP/REST packaging, viewer/console observability. | Borrow capture breadth and continuity UX, but require durable lifecycle proof before claims. |
+| mem0/OpenMemory | `live_baseline_only`; current status is `wrong_result`. | Entity-scoped memory, lifecycle/history surfaces, hosted ecosystem, OpenMemory UI. | Add entity/preference history and UI readback patterns, while keeping hosted claims out of local OSS benchmarks. |
+| memsearch | `live_baseline_only`; current status is `wrong_result` with source-of-truth gaps. | Markdown-first canonical store and local reindex clarity. | Borrow local inspectability and canonical-file ergonomics, not file-as-authority semantics. |
+| OpenViking | `live_baseline_only` plus `research_gate`; current status is `wrong_result`. | Filesystem-like context model, hierarchy, staged context trajectory. | Add staged retrieval and trajectory scoring after same-corpus evidence output is correct. |
+| claude-mem | `live_baseline_only`; current status is `wrong_result`. | Progressive disclosure, automatic capture, local viewer workflow. | Borrow progressive disclosure and viewer comfort; benchmark capture and operator-debugging live paths. |
+| RAGFlow | `research_gate`; current status is `blocked`. | Full RAG application workflow with document/chunk/reference handles. | Use as a resource-aware RAG adapter benchmark, not as a current ELF competitor win/loss. |
+| LightRAG | `research_gate`; current status is `blocked`. | Lightweight graph/RAG context export and source-path citation shape. | Borrow context-export ideas for graph/RAG navigation after Docker proof. |
+| GraphRAG | `research_gate`; current status is `blocked`. | Graph summaries, document/text-unit tables, local/global search separation. | Borrow graph summary artifacts for knowledge pages and graph navigation after cost-bounded output proof. |
+| Graphiti/Zep | `research_gate`; current status is `blocked`. | Temporal graph facts, validity windows, current-vs-historical answers. | Use as the semantic model for ELF temporal memory and relation validity benchmarks. |
+| Letta | `research_gate`; current status is `not_encoded`. | Core memory blocks versus archival memory. | Add explicit scoped core blocks in ELF, but compare Letta only after a contained export path exists. |
+| LangGraph | `research_gate`; current status is `not_encoded` or `unsupported` as a direct memory backend. | Checkpoint, replay, fork, and regression debugging for agent state. | Borrow replay/regression patterns for benchmark infrastructure, not as direct memory parity. |
+| nanograph | `research_gate`; current status is `not_encoded` or `unsupported` as a full memory backend. | Typed graph schema and query ergonomics. | Borrow graph-lite DX and typed relation query ideas. |
+| llm-wiki | `research_gate`; current status is `not_encoded`. | Maintained wiki pages, query-save, lint, and repair loops. | Use as a reference for rebuildable, cited knowledge pages. |
+| gbrain | `research_gate`; current status is `not_encoded` and setup-blocked. | Compiled truth pages, timelines, and human-operable knowledge navigation. | Borrow current-truth plus timeline presentation after Docker-local setup proof exists. |
+| graphify | `research_gate`; current status is `blocked`. | `graph.json`, `GRAPH_REPORT`, source-location graph navigation. | Borrow graph-compressed navigation only after Docker graph/report output maps to evidence ids. |
+
+## Optimization Direction
+
+### P0 - Close Measured Quality Gaps
+
+These are the highest leverage because current evidence already shows an ELF gap or a
+near tie.
+
+1. Live memory evolution correctness
+   - Current state: fixture pass, live `wrong_result`.
+   - Borrow from: Graphiti/Zep validity windows, mem0 history, ELF ingest-decision
+     audit rows.
+   - Target: live answers cite both current and historical conflict evidence, not only
+     current retrieved text.
+   - Benchmark gate: live `memory_evolution` pass for ELF before superiority claims.
+
+2. qmd-level retrieval debugging
+   - Current state: ELF and qmd tie on encoded results; qmd remains stronger in
+     local debug ergonomics.
+   - Borrow from: qmd weighted fusion, rerank explanation, local replay commands.
+   - Target: every wrong result can be traced through expansion, dense retrieval,
+     sparse retrieval, fusion, rerank, graph context, and final selection.
+   - Benchmark gate: qmd deep profile plus ELF/qmd trace-level replay report.
+
+3. Live operator debugging UX
+   - Current state: fixture pass, live `not_encoded`.
+   - Borrow from: claude-mem viewer, OpenMemory inspector, qmd command output.
+   - Target: no raw SQL needed to explain a bad memory result.
+   - Benchmark gate: live operator-debugging jobs score trace hydration, stage
+     attribution, and repair-action clarity.
+
+### P1 - Turn ELF Into A Better Daily Memory Product
+
+These improve day-to-day usefulness while preserving ELF's evidence-bound core.
+
+1. Capture and continuity
+   - Borrow from: agentmemory hook breadth and claude-mem automatic capture review.
+   - ELF shape: live ingestion must preserve redaction, excluded spans, source ids,
+     and write-policy audit.
+   - Benchmark gate: capture/write-policy live jobs with no secret leakage.
+
+2. Reviewable consolidation
+   - Borrow from: managed memory dreaming and Always-On Memory Agent scheduling.
+   - ELF shape: derived proposals only; source notes are not silently rewritten.
+   - Benchmark gate: consolidation proposals include lineage, confidence,
+     unsupported-claim flags, and apply/defer/discard audit.
+
+3. Knowledge pages
+   - Borrow from: llm-wiki, gbrain, graphify, and GraphRAG.
+   - ELF shape: project/entity/concept pages are rebuilt from authoritative notes and
+     linted for unsupported or stale sections.
+   - Benchmark gate: live knowledge-page rebuild/lint report, not fixture-only proof.
+
+4. Core memory blocks
+   - Borrow from: Letta core memory versus archival memory.
+   - ELF shape: scoped read-only blocks with provenance and attachment rules, separate
+     from archival search.
+   - Benchmark gate: core-vs-archival jobs prove correct attachment, sharing, and
+     fallback to search.
+
+### P2 - Expand External Comparison Without Fake Wins
+
+These are needed for broad credibility but should not block personal production use.
+
+1. RAG and graph adapters
+   - Current state: RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, and graphify are
+     adapter candidates, but still `research_gate`.
+   - Benchmark gate: Docker-contained adapters must emit evidence-linked outputs
+     before any live pass claim.
+
+2. OpenViking context trajectory
+   - Current state: setup is pinned, same-corpus retrieval is `wrong_result`, and
+     staged trajectory is `not_encoded`.
+   - Benchmark gate: evidence-bearing retrieval pass, then staged hierarchy/trajectory
+     scoring.
+
+3. mem0/OpenMemory and memsearch coverage
+   - Current state: both are `wrong_result` or partially incomplete in local checks.
+   - Benchmark gate: fix same-corpus correctness first; only then score entity
+     history, UI readback, markdown store, and reindex workflows.
+
+## What Not To Claim Yet
+
+Do not claim:
+
+- ELF beats qmd overall. Current live sweep is essentially tied, and qmd still owns
+  stronger local retrieval-debug ergonomics.
+- ELF has full-suite live real-world pass evidence. It does not.
+- ELF has private-corpus production quality proof. The private profile currently
+  fails closed without an operator-owned manifest.
+- ELF beats OpenViking on context trajectory. That scenario is not encoded.
+- ELF beats mem0/OpenMemory on hosted memory, entity history, UI, or optional graph
+  memory. Those scenarios are not encoded.
+- ELF beats Letta on core-vs-archival memory. That scenario is not encoded.
+- ELF beats RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, or graphify on graph/RAG
+  navigation. Current evidence is research-gate or blocked.
+
+## Suggested Report Cadence
+
+Use this cadence for future benchmark-driven iteration:
+
+1. Keep `2026-06-11-competitor-strength-evidence-matrix.md` as the claim gate.
+2. Keep this report as the optimization direction.
+3. For each new adapter or suite, publish a dated benchmark report only when the run
+   changes a README-level claim or a production-adoption decision.
+4. Every report must classify evidence as `fixture_backed`, `live_baseline_only`,
+   `live_real_world`, or `research_gate`.
+5. Do not promote a reference project into a win/loss claim until the relevant
+   scenario is encoded and run at a comparable evidence class.
+
+## Recommended Next Reports
+
+The next reporting work should be ordered by decision value:
+
+1. ELF/qmd retrieval-debug deep profile.
+2. ELF live memory-evolution repair report.
+3. Operator-debugging live trace/viewer report.
+4. Capture/write-policy live adapter report.
+5. OpenViking context-trajectory report after evidence-bearing retrieval works.
+6. RAG/graph adapter pack report after Docker-contained outputs map to evidence ids.
+
+These are report and measurement directions, not implementation commitments.
diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md
@@ -51,6 +51,9 @@ cleanup, use `docs/guide/single_user_production.md`.
   matrix contract that maps every tracked memory/RAG/graph project to its strongest
   scenario, current evidence class, typed blockers, next measurement gate, and ELF
   borrow-if-stronger direction.
+- `2026-06-11-elf-iteration-direction-from-competitor-benchmarks.md`: current
+  optimization-direction report that translates measured benchmark data and competitor
+  strengths into prioritized ELF iteration themes and explicit non-claims.
 - `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world
   agent memory benchmark contract, including suite taxonomy, typed report states,
   knowledge-compilation fixture tasks, and the production-ops fixture target.