Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
# ELF Iteration Direction From Competitor Benchmarks - June 11, 2026

Goal: Convert the current benchmark evidence and competitor-strength matrix into an
iteration direction for ELF without overstating wins.
Read this when: You need to decide what ELF should learn from adjacent memory,
RAG, graph, and agent-continuity projects.
Inputs: `2026-06-11-competitor-strength-evidence-matrix.md`,
`2026-06-10-live-real-world-sweep-report.md`,
`2026-06-10-production-adoption-refresh.md`,
`2026-06-10-real-world-comparison-report.md`,
`apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json`,
and `docs/guide/research/external_memory_improvement_plan.md`.
Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`.
Outputs: Current measured data, scenario gaps, and a prioritized optimization
direction for future ELF work.

## Executive Judgment

ELF is a credible personal-production foundation for a high-trust memory service, but
the current evidence does not prove broad superiority over all tracked projects.

The strongest current statement is:

- ELF is ahead on source-of-truth discipline, evidence-bound writes, rebuildable
derived indexes, typed failure reporting, and checked-in production-operation
evidence.
- ELF and qmd are tied on the encoded live retrieval, work-resume, and
project-decision slices. ELF does not yet beat qmd's local retrieval-debug
ergonomics.
- Many competitor strengths are still undermeasured: OpenViking context trajectory,
mem0/OpenMemory entity history and UI, agentmemory and claude-mem continuity
capture, Letta core-vs-archival memory, Graphiti/Zep temporal graph behavior, and
llm-wiki/gbrain/graphify knowledge workflows.
- The right next strategy is not to replace ELF with any one project. It is to keep
ELF's evidence-bound core and absorb the best measured or plausible product
patterns behind benchmark gates.

## Current Measured Data

### Fixture-Backed ELF Aggregate

`cargo make real-world-memory` currently reports:

| Metric | Value |
| --- | ---: |
| Jobs | `38` |
| Encoded suites | `11` |
| Pass | `36` |
| Blocked | `2` |
| Wrong result | `0` |
| Lifecycle fail | `0` |
| Incomplete | `0` |
| Not encoded | `0` |
| Unsupported claim | `0` |
| Mean score | `0.947` |
| Evidence coverage | `84/84` |
| Expected evidence recall | `77/77` |

This proves the fixture contract is broad and well controlled. It does not prove that
every live adapter or every competitor runtime passes those scenarios.

### Live Real-World Sweep

`cargo make real-world-memory-live-adapters` produced comparable full-suite live
sweeps for ELF and qmd:

| Adapter | Jobs | Pass | Wrong result | Incomplete | Blocked | Not encoded | Mean score | Evidence recall |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| ELF live service adapter | `38` | `18` | `5` | `1` | `2` | `12` | `0.514` | `41/75` |
| qmd live CLI adapter | `38` | `18` | `5` | `1` | `2` | `12` | `0.512` | `41/75` |

Interpretation:

- This is a tie for the currently encoded live real-world sweep.
- Both pass `trust_source_of_truth`, `work_resume`, `project_decisions`,
`retrieval`, and `personalization`.
- Both fail `memory_evolution` live conflict evidence with `wrong_result`.
- Both leave consolidation, knowledge compilation, operator debugging, capture
integration, and parts of production operations as `not_encoded` or incomplete.

### Production Evidence

ELF has the strongest production-operation evidence among the tracked systems:

| Run | Scope | Result |
| --- | --- | --- |
| Provider synthetic | 8 documents, 6 queries, Qwen3-Embedding-8B, 4096 dimensions | `8/8`, `pass`, 59 seconds |
| Provider stress | 480 generated documents, 16 queries | `9/9`, `pass`, 779 seconds |
| Provider backfill | 2,000 generated documents, 16 queries, resume 1,000 -> 2,000 | `9/9`, `pass`, 2,804 seconds |
| Restore proof | Docker Compose backup/restore plus Qdrant rebuild | restored note searchable, zero rebuild errors |
| Private production corpus | operator-owned manifest required | failed closed, no pass claimed |

This is enough to support personal production use with bounded caveats. It is not a
private-corpus quality proof.

### External Adapter Ledger

The current adapter manifest records 21 adapter records across 17 projects:

| Evidence class | Count | Meaning |
| --- | ---: | --- |
| `fixture_backed` | `1` | ELF real-world fixture scoring. |
| `live_baseline_only` | `6` | Docker same-corpus or lifecycle evidence without real-world job scoring. |
| `live_real_world` | `2` | ELF and qmd full-suite live sweeps. |
| `research_gate` | `12` | Source/setup/resource/output-contract evidence only. |

Overall adapter statuses:

| Status | Count |
| --- | ---: |
| `pass` | `1` |
| `wrong_result` | `6` |
| `lifecycle_fail` | `1` |
| `blocked` | `6` |
| `not_encoded` | `7` |

The ledger is intentionally not a leaderboard. It prevents fixture evidence,
same-corpus checks, research gates, and live real-world runs from being collapsed into
one misleading score.

## Scenario Conclusions

| Scenario | Current position | What ELF should learn next |
| --- | --- | --- |
| Retrieval/debug | ELF and qmd are tied on encoded live retrieval; qmd remains the stronger debug UX reference. | Add trace-level replay, expansion/fusion/rerank knobs, candidate-drop diagnosis, and command-line replay. |
| Work resume | ELF live work-resume passes; continuity-oriented competitors are undermeasured. | Borrow agentmemory/claude-mem capture breadth and OpenViking staged context, but require durable adapter proof. |
| Project decisions | ELF and qmd live project-decision suites pass; Letta is not encoded. | Add core-vs-archival decision-memory scenarios before comparing Letta. |
| Source of truth | ELF has the strongest measured source-of-truth evidence. | Borrow memsearch's local canonical-store ergonomics without making files or vectors authoritative. |
| Temporal memory | ELF fixture passes, but live memory evolution is wrong_result. | Prioritize current-vs-historical evidence links and Graphiti/Zep-style validity windows. |
| Consolidation | ELF fixture passes, but live proposal generation is not encoded. | Build reviewable derived proposals with source refs, confidence, unsupported-claim flags, and apply/defer/discard audit. |
| Knowledge pages | ELF fixture pages pass; live knowledge generation is not encoded. | Borrow llm-wiki lint/query-save loops, gbrain timelines, and graphify reports behind rebuild/lint benchmarks. |
| Operator debugging | Fixture UX passes; live trace/viewer scoring is not encoded. | Make viewer/CLI debugging a scored live surface, not just an admin convenience. |
| Capture/write policy | Fixture capture boundary passes; live capture is not encoded. | Borrow agentmemory/claude-mem capture hooks while preserving redaction and evidence binding. |
| Production ops | ELF has the strongest checked-in evidence, with private/credential gates blocked. | Keep Docker-first production proof and add private corpus only when an operator-owned manifest exists. |
| Personalization | ELF live personalization passes; mem0/OpenMemory and Letta are not encoded. | Add entity-scoped preference history and UI readback before claiming stronger personalization. |
| Context trajectory | Not comparable yet; OpenViking remains the reference. | Score staged retrieval, hierarchy expansion, and trajectory readback. |
| Core-vs-archival | Product gap, not a measured comparison yet. | Borrow Letta's core memory block shape with explicit scope, provenance, and read-only attachment. |
| Graph/RAG navigation | Research gates only. | Run RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, and graphify adapters only when Docker outputs map to evidence ids. |

## Project Guidance Matrix

| Project | Current evidence | User-facing strength | ELF direction |
| --- | --- | --- | --- |
| ELF | `fixture_backed` plus `live_real_world`; live full sweep is `wrong_result`. | Evidence-linked memory service, strict provenance, rebuildable Qdrant, production backfill/restore proof. | Keep this as the core; do not weaken source-of-truth or typed failure semantics while adding product ergonomics. |
| qmd | `live_real_world` plus `live_baseline_only`; targeted retrieval passes, full sweep is `wrong_result`. | Local retrieval-debug workflow, transparent CLI, weighted fusion, rerank, replayable commands. | Treat qmd as the retrieval-debug bar. ELF should match its introspection and local replay without becoming CLI-only. |
| agentmemory | `live_baseline_only`; current status is `lifecycle_fail`. | Coding-agent continuity, hooks, MCP/REST packaging, viewer/console observability. | Borrow capture breadth and continuity UX, but require durable lifecycle proof before claims. |
| mem0/OpenMemory | `live_baseline_only`; current status is `wrong_result`. | Entity-scoped memory, lifecycle/history surfaces, hosted ecosystem, OpenMemory UI. | Add entity/preference history and UI readback patterns, while keeping hosted claims out of local OSS benchmarks. |
| memsearch | `live_baseline_only`; current status is `wrong_result` with source-of-truth gaps. | Markdown-first canonical store and local reindex clarity. | Borrow local inspectability and canonical-file ergonomics, not file-as-authority semantics. |
| OpenViking | `live_baseline_only` plus `research_gate`; current status is `wrong_result`. | Filesystem-like context model, hierarchy, staged context trajectory. | Add staged retrieval and trajectory scoring after same-corpus evidence output is correct. |
| claude-mem | `live_baseline_only`; current status is `wrong_result`. | Progressive disclosure, automatic capture, local viewer workflow. | Borrow progressive disclosure and viewer comfort; benchmark capture and operator-debugging live paths. |
| RAGFlow | `research_gate`; current status is `blocked`. | Full RAG application workflow with document/chunk/reference handles. | Use as a resource-aware RAG adapter benchmark, not as a current ELF competitor win/loss. |
| LightRAG | `research_gate`; current status is `blocked`. | Lightweight graph/RAG context export and source-path citation shape. | Borrow context-export ideas for graph/RAG navigation after Docker proof. |
| GraphRAG | `research_gate`; current status is `blocked`. | Graph summaries, document/text-unit tables, local/global search separation. | Borrow graph summary artifacts for knowledge pages and graph navigation after cost-bounded output proof. |
| Graphiti/Zep | `research_gate`; current status is `blocked`. | Temporal graph facts, validity windows, current-vs-historical answers. | Use as the semantic model for ELF temporal memory and relation validity benchmarks. |
| Letta | `research_gate`; current status is `not_encoded`. | Core memory blocks versus archival memory. | Add explicit scoped core blocks in ELF, but compare Letta only after a contained export path exists. |
| LangGraph | `research_gate`; current status is `not_encoded` or `unsupported` as a direct memory backend. | Checkpoint, replay, fork, and regression debugging for agent state. | Borrow replay/regression patterns for benchmark infrastructure, not as direct memory parity. |
| nanograph | `research_gate`; current status is `not_encoded` or `unsupported` as a full memory backend. | Typed graph schema and query ergonomics. | Borrow graph-lite DX and typed relation query ideas. |
| llm-wiki | `research_gate`; current status is `not_encoded`. | Maintained wiki pages, query-save, lint, and repair loops. | Use as a reference for rebuildable, cited knowledge pages. |
| gbrain | `research_gate`; current status is `not_encoded` and setup-blocked. | Compiled truth pages, timelines, and human-operable knowledge navigation. | Borrow current-truth plus timeline presentation after Docker-local setup proof exists. |
| graphify | `research_gate`; current status is `blocked`. | `graph.json`, `GRAPH_REPORT`, source-location graph navigation. | Borrow graph-compressed navigation only after Docker graph/report output maps to evidence ids. |

## Optimization Direction

### P0 - Close Measured Quality Gaps

These are the highest leverage because current evidence already shows an ELF gap or a
near tie.

1. Live memory evolution correctness
- Current state: fixture pass, live `wrong_result`.
- Borrow from: Graphiti/Zep validity windows, mem0 history, ELF ingest-decision
audit rows.
- Target: live answers cite both current and historical conflict evidence, not only
current retrieved text.
- Benchmark gate: live `memory_evolution` pass for ELF before superiority claims.

2. qmd-level retrieval debugging
- Current state: ELF and qmd tie on encoded results; qmd remains stronger in
local debug ergonomics.
- Borrow from: qmd weighted fusion, rerank explanation, local replay commands.
- Target: every wrong result can be traced through expansion, dense retrieval,
sparse retrieval, fusion, rerank, graph context, and final selection.
- Benchmark gate: qmd deep profile plus ELF/qmd trace-level replay report.

3. Live operator debugging UX
- Current state: fixture pass, live `not_encoded`.
- Borrow from: claude-mem viewer, OpenMemory inspector, qmd command output.
- Target: no raw SQL needed to explain a bad memory result.
- Benchmark gate: live operator-debugging jobs score trace hydration, stage
attribution, and repair-action clarity.

### P1 - Turn ELF Into A Better Daily Memory Product

These improve day-to-day usefulness while preserving ELF's evidence-bound core.

1. Capture and continuity
- Borrow from: agentmemory hook breadth and claude-mem automatic capture review.
- ELF shape: live ingestion must preserve redaction, excluded spans, source ids,
and write-policy audit.
- Benchmark gate: capture/write-policy live jobs with no secret leakage.

2. Reviewable consolidation
- Borrow from: managed memory dreaming and Always-On Memory Agent scheduling.
- ELF shape: derived proposals only; source notes are not silently rewritten.
- Benchmark gate: consolidation proposals include lineage, confidence,
unsupported-claim flags, and apply/defer/discard audit.

3. Knowledge pages
- Borrow from: llm-wiki, gbrain, graphify, and GraphRAG.
- ELF shape: project/entity/concept pages are rebuilt from authoritative notes and
linted for unsupported or stale sections.
- Benchmark gate: live knowledge-page rebuild/lint report, not fixture-only proof.

4. Core memory blocks
- Borrow from: Letta core memory versus archival memory.
- ELF shape: scoped read-only blocks with provenance and attachment rules, separate
from archival search.
- Benchmark gate: core-vs-archival jobs prove correct attachment, sharing, and
fallback to search.

### P2 - Expand External Comparison Without Fake Wins

These are needed for broad credibility but should not block personal production use.

1. RAG and graph adapters
- Current state: RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, and graphify are
adapter candidates, but still `research_gate`.
- Benchmark gate: Docker-contained adapters must emit evidence-linked outputs
before any live pass claim.

2. OpenViking context trajectory
- Current state: setup is pinned, same-corpus retrieval is `wrong_result`, and
staged trajectory is `not_encoded`.
- Benchmark gate: evidence-bearing retrieval pass, then staged hierarchy/trajectory
scoring.

3. mem0/OpenMemory and memsearch coverage
- Current state: both are `wrong_result` or partially incomplete in local checks.
- Benchmark gate: fix same-corpus correctness first; only then score entity
history, UI readback, markdown store, and reindex workflows.

## What Not To Claim Yet

Do not claim:

- ELF beats qmd overall. Current live sweep is essentially tied, and qmd still owns
stronger local retrieval-debug ergonomics.
- ELF has full-suite live real-world pass evidence. It does not.
- ELF has private-corpus production quality proof. The private profile currently
fails closed without an operator-owned manifest.
- ELF beats OpenViking on context trajectory. That scenario is not encoded.
- ELF beats mem0/OpenMemory on hosted memory, entity history, UI, or optional graph
memory. Those scenarios are not encoded.
- ELF beats Letta on core-vs-archival memory. That scenario is not encoded.
- ELF beats RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, or graphify on graph/RAG
navigation. Current evidence is research-gate or blocked.

## Suggested Report Cadence

Use this cadence for future benchmark-driven iteration:

1. Keep `2026-06-11-competitor-strength-evidence-matrix.md` as the claim gate.
2. Keep this report as the optimization direction.
3. For each new adapter or suite, publish a dated benchmark report only when the run
changes a README-level claim or a production-adoption decision.
4. Every report must classify evidence as `fixture_backed`, `live_baseline_only`,
`live_real_world`, or `research_gate`.
5. Do not promote a reference project into a win/loss claim until the relevant
scenario is encoded and run at a comparable evidence class.

## Recommended Next Reports

The next reporting work should be ordered by decision value:

1. ELF/qmd retrieval-debug deep profile.
2. ELF live memory-evolution repair report.
3. Operator-debugging live trace/viewer report.
4. Capture/write-policy live adapter report.
5. OpenViking context-trajectory report after evidence-bearing retrieval works.
6. RAG/graph adapter pack report after Docker-contained outputs map to evidence ids.

These are report and measurement directions, not implementation commitments.
3 changes: 3 additions & 0 deletions docs/guide/benchmarking/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ cleanup, use `docs/guide/single_user_production.md`.
matrix contract that maps every tracked memory/RAG/graph project to its strongest
scenario, current evidence class, typed blockers, next measurement gate, and ELF
borrow-if-stronger direction.
- `2026-06-11-elf-iteration-direction-from-competitor-benchmarks.md`: current
optimization-direction report that translates measured benchmark data and competitor
strengths into prioritized ELF iteration themes and explicit non-claims.
- `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world
agent memory benchmark contract, including suite taxonomy, typed report states,
knowledge-compilation fixture tasks, and the production-ops fixture target.
Expand Down