diff --git a/docs/guide/benchmarking/2026-06-11-elf-qmd-memory-evolution-diagnostic.md b/docs/guide/benchmarking/2026-06-11-elf-qmd-memory-evolution-diagnostic.md new file mode 100644 index 00000000..bf4e53a1 --- /dev/null +++ b/docs/guide/benchmarking/2026-06-11-elf-qmd-memory-evolution-diagnostic.md @@ -0,0 +1,211 @@ +# ELF/qmd Memory-Evolution Diagnostic - June 11, 2026 + +Goal: Explain the fresh live memory-evolution failures for ELF and qmd, and turn the +measured gaps into benchmark and optimization directions without implementing those +optimizations here. +Read this when: You need to decide whether ELF currently beats qmd on +current-vs-historical memory, supersession, delete/tombstone handling, or temporal +relation validity. +Inputs: Fresh local runs of `cargo make real-world-memory-evolution` and +`cargo make real-world-memory-live-adapters` on commit `87a388b`. +Outputs: Fixture evidence, live ELF/qmd job-level diagnosis, claim boundaries, and +future iteration directions. + +## Executive Judgment + +ELF does not yet have a production-quality live memory-evolution win. The fixture +suite passes, but the live adapter path still fails five of six current-vs-historical +jobs. + +The narrow fresh result is: + +- Fixture memory-evolution: `5/5` pass. +- ELF live memory-evolution: `1/6` pass, `5/6` wrong_result. +- qmd live memory-evolution: `0/6` pass, `6/6` wrong_result. + +ELF is better than qmd on this fresh live slice only in a limited sense: ELF retrieves +all required memory-evolution evidence and passes the delete/TTL tombstone job; qmd +misses three required evidence links and fails the delete/TTL job. + +That is not enough to claim ELF has solved memory evolution. The main live ELF gap is +not basic retrieval. ELF retrieves the current evidence, rationale evidence, and often +the relevant historical evidence, but the answer and trace do not explicitly encode +that a historical fact was superseded, invalidated, or preserved as history. The +scorer therefore records no conflict detection and assigns `0.0` lifecycle behavior +on the five supersession jobs. + +For a memory system meant to support real agents, this is a P0 product-quality gap: +users do not only ask for the newest note. They ask what changed, why, what used to be +true, which source is current, and whether an old conclusion is stale. + +## Fresh Runs + +| Command | Result | Runtime | +| --- | --- | ---: | +| `cargo make real-world-memory-evolution` | pass | 50.34 seconds | +| `cargo make real-world-memory-live-adapters` | pass | 112.26 seconds | + +The live adapter command emitted repeated Qdrant client/server compatibility warnings, +but it completed and wrote ELF and qmd reports. Treat the warning as benchmark-harness +risk, not as a run failure. + +## Fixture Baseline + +`cargo make real-world-memory-evolution` proves the benchmark contract itself can +score the intended behavior: + +| Metric | Value | +| --- | ---: | +| Jobs | `5` | +| Pass | `5` | +| Wrong result | `0` | +| Mean score | `1.000` | +| Expected evidence recall | `11/11` | +| Evidence coverage | `11/11` | +| Conflict detections | `5` | +| Update rationales available | `5` | +| History-readback encoded jobs | `1` | + +This is fixture evidence. It proves the scenario contract is encoded and scored. It +does not prove the ELF live service or qmd CLI path can produce the same behavior. + +## Live Full-Sweep Context + +The fresh live sweep changed the qmd full-suite shape compared with the previous +coverage audit: + +| Adapter | Jobs | Pass | Wrong result | Blocked | Not encoded | Mean score | Mean latency | Expected evidence recall | Evidence coverage | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | +| ELF live service adapter | `38` | `18` | `5` | `2` | `13` | `0.525` | `8.620 ms` | `41/77` | `48/84` | +| qmd live CLI adapter | `38` | `17` | `6` | `2` | `13` | `0.486` | `691.163 ms` | `38/77` | `45/84` | + +Do not turn this into a broad win claim. The difference is explained by this +memory-evolution slice: qmd failed the delete/TTL job that ELF passed. + +## Live Memory-Evolution Result + +| Adapter | Jobs | Pass | Wrong result | Mean score | Expected evidence matched | Produced evidence | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | +| ELF live service adapter | `6` | `1` | `5` | `0.492` | `13/13` | `13` | +| qmd live CLI adapter | `6` | `0` | `6` | `0.325` | `10/13` | `10` | + +### Job Matrix + +| Job | ELF status | ELF score | qmd status | qmd score | Diagnosis | +| --- | --- | ---: | --- | ---: | --- | +| `memory-evolution-benchmark-verdict-001` | wrong_result | `0.40` | wrong_result | `0.15` | ELF retrieved current verdict, caveat, and rationale, but did not cite the old not-ready verdict as historical. qmd also missed the private-corpus caveat evidence. | +| `memory-evolution-deploy-method-001` | wrong_result | `0.40` | wrong_result | `0.40` | Both retrieved current production runbook and supersession rationale, but neither explicitly preserved the old quickstart path as historical conflict evidence. | +| `memory-evolution-issue-state-001` | wrong_result | `0.40` | wrong_result | `0.40` | Both answered the current done state and resolution rationale, but neither surfaced the earlier blocked state as superseded history. | +| `memory-evolution-preference-001` | wrong_result | `0.40` | wrong_result | `0.15` | ELF retrieved current preference and rationale, but did not preserve the old terse preference as historical. qmd only returned the rationale evidence. | +| `memory-evolution-relation-temporal-001` | wrong_result | `0.35` | wrong_result | `0.35` | Both retrieved current and historical owners, but neither produced a scored temporal-validity explanation or update rationale. | +| `memory-evolution-delete-ttl-001` | pass | `1.00` | wrong_result | `0.50` | ELF retrieved both tombstone and current plan evidence. qmd retrieved only the current plan and missed the tombstone. | + +### Dimension Pattern + +For ELF's five wrong-result jobs, the pattern is consistent: + +| Dimension | Score pattern | +| --- | --- | +| `answer_correctness` | `0.0` on all five wrong-result jobs | +| `evidence_grounding` | `1.0` on all five wrong-result jobs | +| `lifecycle_behavior` | `0.0` on all five wrong-result jobs | +| `trap_avoidance` | `1.0` on all five wrong-result jobs | + +That means ELF usually finds the right evidence and avoids stale facts as current, but +the answer is not lifecycle-aware enough. It does not represent the historical version +as a first-class part of the answer, so the benchmark cannot credit conflict +detection. + +qmd has the same lifecycle pattern, plus evidence misses: + +| qmd miss | Effect | +| --- | --- | +| `verdict-bounded-private-caveat` missing | Benchmark verdict job drops to `0.15`. | +| `pref-current-concise-rationale` missing | Preference job drops to `0.15`. | +| `delete-tombstone` missing | Delete/TTL job is `wrong_result` despite answering the current plan. | + +## What This Says About ELF + +ELF currently looks strong at current-fact retrieval and typed source-of-truth +discipline. It is not yet strong enough at memory evolution. + +The missing product behavior is a temporal reconciliation layer: + +1. Detect that current and historical evidence both relate to the same claim. +2. Explain which evidence is current and which is historical. +3. Preserve old facts when the user asks what changed. +4. Mark superseded facts as no longer current without deleting their historical value. +5. Expose tombstones and invalidation evidence as answerable lifecycle facts. +6. Emit trace artifacts that show conflict candidates, current winner, historical + loser, and update rationale. + +This is why the fixture can pass while the live path fails. The fixture response is a +curated memory-evolution answer. The live adapters are retrieval-backed materializers, +not full temporal reconciliation engines. + +## What ELF Should Borrow + +These are optimization directions, not implemented changes in this report: + +| Source/reference | Useful idea for ELF | Benchmark gate before claiming progress | +| --- | --- | --- | +| Graphiti/Zep | Temporal fact validity windows, invalidation, and current/historical graph facts. | Run the Graphiti/Zep temporal graph adapter and compare current, historical, and future-validity jobs. | +| mem0/OpenMemory | Entity-scoped memory history and user-visible memory lifecycle inspection. | Add entity/preference history readback and UI/export evidence checks. | +| Letta | Core memory blocks separate from archival memory. | Add core-vs-archival jobs that distinguish always-loaded operating context from retrieved history. | +| qmd | Local replay and candidate inspection ergonomics. | Emit ELF trace hydration with conflict candidates, demoted historical facts, and replay commands. | +| Existing ELF production ops | Tombstone and deletion semantics. | Extend delete/TTL scoring from one isolated job into update/delete/recreate history cases. | + +## Next Benchmark And Report Directions + +1. Live temporal reconciliation report + - Score whether ELF can answer "what changed?" with current evidence, + historical evidence, and update rationale in the same answer. + - Include trace hydration for current winner, historical loser, and conflict + resolution reason. + +2. Graphiti/Zep temporal graph comparison + - Use the existing Graphiti/Zep research gate as the next real adapter target. + - The goal is not to copy a graph database blindly; it is to measure validity + windows and supersession semantics against ELF. + +3. mem0/OpenMemory history comparison + - Measure preference/entity history, correction, deletion, and user-visible + inspection. + - This directly maps to personal agent-memory expectations. + +4. qmd tombstone/delete diagnostic + - qmd is already the retrieval-debug reference, but it missed the delete tombstone + in this run. + - Keep this as a measured qmd gap before using qmd as a lifecycle reference. + +5. ELF trace-candidate conflict profile + - Add a report that shows top candidates for conflict jobs, not only final mapped + evidence ids. + - This should make it obvious whether historical evidence was absent, present but + unselected, or selected but not narrated. + +## Claim Boundaries + +Allowed claims: + +- The fixture memory-evolution suite passes. +- In the fresh live memory-evolution run, ELF outscored qmd and passed one job qmd + failed. +- ELF retrieved all required memory-evolution evidence in the live run. +- ELF still failed five of six live memory-evolution jobs because current-vs-historical + conflict detection was not encoded in the answer behavior. + +Not allowed: + +- Do not claim ELF has solved memory evolution. +- Do not claim ELF broadly beats qmd as a memory system. +- Do not promote fixture memory-evolution pass into live production proof. +- Do not treat Graphiti/Zep, mem0/OpenMemory, or Letta as beaten; their strongest + scenarios still need comparable adapter reports. + +## Bottom Line + +The next ELF iteration direction should prioritize temporal reconciliation over more +generic retrieval work. Retrieval is good enough to find the needed evidence in this +slice; the failing behavior is deciding and explaining how current, historical, +deleted, and superseded memories relate. diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index 81e90780..1cc0563b 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -61,6 +61,10 @@ cleanup, use `docs/guide/single_user_production.md`. - `2026-06-11-elf-qmd-retrieval-debug-profile.md`: fresh ELF/qmd retrieval-debug profile with real-world retrieval-suite evidence, 480-document stress baseline evidence, qmd top-10 artifact inspection, and explicit rerank/fusion non-claims. +- `2026-06-11-elf-qmd-memory-evolution-diagnostic.md`: fresh ELF/qmd + memory-evolution diagnostic showing fixture pass, live ELF/qmd current-vs-historical + wrong-result patterns, qmd tombstone evidence miss, and temporal-reconciliation + iteration directions. - `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world agent memory benchmark contract, including suite taxonomy, typed report states, knowledge-compilation fixture tasks, and the production-ops fixture target. diff --git a/docs/research/2026-06-11-elf-qmd-memory-evolution-diagnostic.json b/docs/research/2026-06-11-elf-qmd-memory-evolution-diagnostic.json new file mode 100644 index 00000000..f7a639ae --- /dev/null +++ b/docs/research/2026-06-11-elf-qmd-memory-evolution-diagnostic.json @@ -0,0 +1,197 @@ +{ + "schema": "elf.memory_evolution_diagnostic_report/v1", + "run_id": "2026-06-11-elf-qmd-memory-evolution-diagnostic", + "commit": "87a388b6f33ff0142359876e5d9632fc096ee956", + "created_at": "2026-06-11", + "scope": "ELF versus qmd live memory-evolution behavior, current-vs-historical conflict diagnosis, and optimization directions", + "commands": [ + { + "command": "cargo make real-world-memory-evolution", + "status": "pass", + "runtime_seconds": 50.34, + "artifact": "tmp/real-world-memory/evolution-report.json" + }, + { + "command": "cargo make real-world-memory-live-adapters", + "status": "pass", + "runtime_seconds": 112.26, + "artifact": "tmp/real-world-memory/live-adapters/" + } + ], + "fixture_memory_evolution": { + "job_count": 5, + "pass": 5, + "wrong_result": 0, + "mean_score": 1.0, + "expected_evidence_total": 11, + "expected_evidence_matched": 11, + "conflict_detection_count": 5, + "update_rationale_available_count": 5, + "history_readback_encoded_count": 1 + }, + "live_full_sweep_context": { + "elf": { + "job_count": 38, + "pass": 18, + "wrong_result": 5, + "blocked": 2, + "not_encoded": 13, + "mean_score": 0.525, + "mean_latency_ms": 8.62, + "expected_evidence_total": 77, + "expected_evidence_matched": 41, + "evidence_required_count": 84, + "evidence_covered_count": 48 + }, + "qmd": { + "job_count": 38, + "pass": 17, + "wrong_result": 6, + "blocked": 2, + "not_encoded": 13, + "mean_score": 0.486, + "mean_latency_ms": 691.163, + "expected_evidence_total": 77, + "expected_evidence_matched": 38, + "evidence_required_count": 84, + "evidence_covered_count": 45 + } + }, + "live_memory_evolution": { + "elf": { + "jobs": 6, + "pass": 1, + "wrong_result": 5, + "mean_score": 0.4916666666666667, + "expected_evidence_total": 13, + "expected_evidence_matched": 13, + "produced_evidence_total": 13, + "diagnosis": "ELF retrieved all required evidence but failed supersession jobs because conflict detection and lifecycle-aware current-vs-historical answer behavior were not emitted." + }, + "qmd": { + "jobs": 6, + "pass": 0, + "wrong_result": 6, + "mean_score": 0.325, + "expected_evidence_total": 13, + "expected_evidence_matched": 10, + "produced_evidence_total": 10, + "diagnosis": "qmd had the same missing conflict-detection pattern and additionally missed three required evidence links, including the delete tombstone." + } + }, + "job_diagnosis": [ + { + "job_id": "memory-evolution-benchmark-verdict-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "qmd_status": "wrong_result", + "qmd_score": 0.15, + "diagnosis": "ELF retrieved current verdict, caveat, and rationale but did not cite the old not-ready verdict as historical; qmd also missed private-corpus caveat evidence." + }, + { + "job_id": "memory-evolution-deploy-method-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "qmd_status": "wrong_result", + "qmd_score": 0.4, + "diagnosis": "Both retrieved the current runbook and supersession rationale but did not preserve the old quickstart path as historical conflict evidence." + }, + { + "job_id": "memory-evolution-issue-state-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "qmd_status": "wrong_result", + "qmd_score": 0.4, + "diagnosis": "Both answered the current done state and rationale but did not surface the earlier blocked state as superseded history." + }, + { + "job_id": "memory-evolution-preference-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "qmd_status": "wrong_result", + "qmd_score": 0.15, + "diagnosis": "ELF retrieved current preference and rationale but did not preserve the old terse preference as historical; qmd only returned rationale evidence." + }, + { + "job_id": "memory-evolution-relation-temporal-001", + "elf_status": "wrong_result", + "elf_score": 0.35, + "qmd_status": "wrong_result", + "qmd_score": 0.35, + "diagnosis": "Both retrieved current and historical owners but did not emit scored temporal-validity explanation or update rationale." + }, + { + "job_id": "memory-evolution-delete-ttl-001", + "elf_status": "pass", + "elf_score": 1.0, + "qmd_status": "wrong_result", + "qmd_score": 0.5, + "diagnosis": "ELF retrieved tombstone and current plan evidence; qmd retrieved only the current plan and missed the tombstone." + } + ], + "elf_failure_pattern": { + "wrong_result_jobs": 5, + "answer_correctness_score": 0.0, + "evidence_grounding_score": 1.0, + "lifecycle_behavior_score": 0.0, + "trap_avoidance_score": 1.0, + "interpretation": "The issue is lifecycle-aware reconciliation and narration, not basic evidence retrieval." + }, + "claim_boundary": { + "fixture_claim": "fixture_memory_evolution_passes", + "live_claim": "elf_narrowly_outscores_qmd_on_this_fresh_slice_but_does_not_solve_memory_evolution", + "not_allowed": [ + "ELF broadly beats qmd as a memory system", + "ELF has solved temporal memory evolution", + "fixture pass is production proof", + "Graphiti/Zep, mem0/OpenMemory, or Letta are beaten" + ] + }, + "optimization_directions": [ + { + "direction": "temporal_reconciliation_layer", + "description": "Detect current and historical evidence for the same claim, choose the current winner, preserve the historical loser, and cite update rationale." + }, + { + "direction": "history_readback_and_note_version_links", + "description": "Expose add/update/delete/ignore history and version links for user preference and entity memory changes." + }, + { + "direction": "tombstone_and_invalidation_evidence", + "description": "Treat deletion and TTL tombstones as answerable evidence instead of only suppressing stale retrieval." + }, + { + "direction": "trace_conflict_candidates", + "description": "Hydrate trace artifacts with conflict candidates, current winners, historical losers, dropped candidates, and replay commands." + } + ], + "borrow_from": [ + { + "project": "Graphiti/Zep", + "borrow": "temporal fact windows, invalidation, supersession, and graph fact provenance", + "benchmark_gate": "Graphiti/Zep temporal graph adapter for current, historical, and future-valid facts" + }, + { + "project": "mem0/OpenMemory", + "borrow": "entity-scoped history, lifecycle inspection, and memory UI/readback", + "benchmark_gate": "entity and preference history readback with correction and deletion evidence" + }, + { + "project": "Letta", + "borrow": "core memory blocks versus archival memory", + "benchmark_gate": "core-vs-archival jobs for operating context and historical retrieval" + }, + { + "project": "qmd", + "borrow": "local replay and candidate inspection ergonomics", + "benchmark_gate": "ELF trace hydration with conflict candidates and replay commands" + } + ], + "next_reports": [ + "Live temporal reconciliation report", + "Graphiti/Zep temporal graph comparison", + "mem0/OpenMemory history comparison", + "qmd tombstone/delete diagnostic", + "ELF trace-candidate conflict profile" + ] +}