diff --git a/README.md b/README.md index 828d1821..60535d0f 100644 --- a/README.md +++ b/README.md @@ -143,6 +143,10 @@ with the production embedding provider path, `Qwen3-Embedding-8B`, and passed same-corpus retrieval but failed lifecycle/cold-start coverage. memsearch, mem0, OpenViking, and claude-mem remained `incomplete` or wrong-result typed states; those states are reported as limitations, not hidden as proof. +- Real-world agent memory aggregate after the P1 benchmark batch: 38 fixture-backed + jobs across 11 suites, 35 pass, 1 incomplete, 2 blocked, 0 wrong-result, + 0 not-encoded, and 0 unsupported-claim results. The remaining non-pass jobs are + production-ops operator boundaries, not hidden benchmark wins. - The benchmark runner and report publisher are checked in and Docker-isolated: `cargo make baseline-live-docker`, `cargo make baseline-backfill-docker`, `cargo make baseline-production-private-addendum`, @@ -157,19 +161,30 @@ Detailed evidence and interpretation: - [Live Baseline Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-live-baseline-report.md) - [Synthetic Production Corpus Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-corpus-report.md) - [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md) +- [Real-World Comparison Report - June 10, 2026](docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md) - [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md) - [Single-User Production Runbook](docs/guide/single_user_production.md) -- Future benchmark contract: +- Benchmark contract: [Real-World Agent Memory Benchmark v1](docs/spec/real_world_agent_memory_benchmark_v1.md). - This contract defines job-level suites for agent work. Checked-in fixture runners now - cover a smoke work-resume slice and proposal-only consolidation cases through - `cargo make real-world-job-smoke` and `cargo make real-world-memory-consolidation`, - and `cargo make real-world-memory` now reports the first external adapter coverage - manifest for ELF, qmd, agentmemory, mem0/OpenMemory, claude-mem, memsearch, and - OpenViking. Those real-world reports still distinguish fixture-backed and - live-baseline-only evidence from true live real-world adapter runs; no external - project has a live real-world suite win until an adapter actually executes - `real_world_job` prompts and scoring. + This contract defines job-level suites for agent work. `cargo make real-world-memory` + now reports fixture-backed ELF evidence plus the external adapter coverage manifest + for ELF, qmd, agentmemory, mem0/OpenMemory, claude-mem, memsearch, and OpenViking. + The report still distinguishes fixture-backed and live-baseline-only evidence from + true live real-world adapter runs; no external project has a live real-world suite win + until an adapter actually executes `real_world_job` prompts and scoring. + +Evidence-backed position after the June 10 real-world report: + +- ELF is better evidenced than the tested alternatives on evidence-bound writes, + deterministic ingestion boundaries, Postgres source-of-truth plus rebuildable Qdrant + indexing, scoped service APIs, and fixture-backed provenance/resume/evolution checks. +- ELF and qmd are both strong in the current encoded retrieval evidence: qmd remains + the local retrieval-debug baseline, while ELF has the stronger service and provenance + contract. +- ELF is still behind or not yet proven on live real-world external adapters, + private-corpus production quality, credentialed production-ops gates, qmd-style local + debug knobs, agentmemory/claude-mem/OpenMemory-style continuity UX, OpenViking-style + context trajectory, and hosted managed memory. Quick comparison snapshot (objective/high-level). This table compares capability coverage, not overall project quality. @@ -222,7 +237,8 @@ Detailed comparison, mechanism-level analysis, and source map: - [Agent Memory Selection Research Run](docs/research/2026-06-08-agent-memory-selection.json) - [Real-World Benchmark Dimension Research Run](docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json) -Latest external research refresh: June 9, 2026. +Latest real-world benchmark report: June 10, 2026. Latest external research refresh: +June 9, 2026. ## Documentation diff --git a/apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json b/apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json index c66ebd56..1c37fc4c 100644 --- a/apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json +++ b/apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json @@ -20,7 +20,7 @@ "evidence_class": "fixture_backed", "docker_default": true, "host_global_installs_required": false, - "overall_status": "wrong_result", + "overall_status": "incomplete", "setup": { "status": "pass", "evidence": "The checked-in real_world_memory fixtures parse and score through the ELF fixture runner.", @@ -28,13 +28,13 @@ "artifact": "tmp/real-world-memory/real-world-memory-report.json" }, "run": { - "status": "wrong_result", - "evidence": "The current fixture set reports 27 jobs, 25 pass, 1 wrong_result, and 1 not_encoded.", + "status": "incomplete", + "evidence": "The current fixture set reports 38 jobs, 35 pass, 1 incomplete, 2 blocked, 0 wrong_result, 0 not_encoded, and 0 unsupported_claim.", "command": "cargo make real-world-memory", "artifact": "tmp/real-world-memory/real-world-memory-report.json" }, "result": { - "status": "wrong_result", + "status": "incomplete", "evidence": "This is fixture-backed ELF scoring, not a live external adapter result.", "artifact": "tmp/real-world-memory/real-world-memory-report.md" }, @@ -66,40 +66,50 @@ "status": "pass", "evidence": "Checked-in work-resume fixtures are encoded and passing." }, + { + "suite_id": "project_decisions", + "status": "pass", + "evidence": "Checked-in project-decision fixtures cover accepted decisions, reversals, current validation gates, rationale, and bounded caveats." + }, { "suite_id": "retrieval", "status": "pass", - "evidence": "Checked-in retrieval fixtures are encoded; one deliberate operator-debug wrong-result case is reported under operator_debugging_ux." + "evidence": "Checked-in retrieval fixtures cover alternate phrasing, distractors, multi-hop routing, current-versus-obsolete selection, and minimal context." }, { "suite_id": "memory_evolution", - "status": "not_encoded", - "evidence": "The relation temporal-validity case is deliberately not_encoded until temporal graph validity is implemented." + "status": "pass", + "evidence": "Checked-in memory-evolution fixtures cover current-versus-historical facts and the relation temporal-validity case is encoded." }, { - "suite_id": "operator_debugging_ux", - "status": "wrong_result", - "evidence": "The aggregate fixture set includes one deliberate wrong-result trace attribution case." + "suite_id": "consolidation", + "status": "pass", + "evidence": "Proposal-only consolidation fixtures are encoded and passing without source mutation." }, { - "suite_id": "capture_integration", + "suite_id": "knowledge_compilation", "status": "pass", - "evidence": "The redaction and capture-boundary fixture is encoded and passing." + "evidence": "Knowledge page fixtures are encoded and passing with citation and rebuild metrics." }, { - "suite_id": "personalization", + "suite_id": "operator_debugging_ux", "status": "pass", - "evidence": "The scoped preference fixture is encoded and passing." + "evidence": "Operator-debugging fixtures now expose stage attribution and dropped-candidate evidence without raw SQL." }, { - "suite_id": "consolidation", + "suite_id": "capture_integration", "status": "pass", - "evidence": "Proposal-only consolidation fixtures are encoded and passing without source mutation." + "evidence": "The redaction and capture-boundary fixture is encoded and passing." }, { - "suite_id": "knowledge_compilation", + "suite_id": "production_ops", + "status": "incomplete", + "evidence": "Production-ops fixtures encode restore, Qdrant rebuild, backfill resume, resource-envelope interpretation, plus typed incomplete and blocked operator boundaries." + }, + { + "suite_id": "personalization", "status": "pass", - "evidence": "Knowledge page fixtures are encoded and passing with citation and rebuild metrics." + "evidence": "The scoped preference fixture is encoded and passing." } ], "evidence": [ @@ -115,7 +125,8 @@ } ], "notes": [ - "This adapter record exists to keep ELF fixture results separate from live external adapter results." + "This adapter record exists to keep ELF fixture results separate from live external adapter results.", + "The remaining non-pass ELF fixture states are production-ops operator boundaries: a Docker local-embedding dependency, provider credentials, and an operator-owned private corpus manifest." ], "follow_up": { "title": "[ELF benchmark vNext] Replace fixture-only ELF answers with live real-world adapter execution where appropriate", diff --git a/apps/elf-eval/tests/real_world_job_benchmark.rs b/apps/elf-eval/tests/real_world_job_benchmark.rs index eb1d38ca..04a8b409 100644 --- a/apps/elf-eval/tests/real_world_job_benchmark.rs +++ b/apps/elf-eval/tests/real_world_job_benchmark.rs @@ -224,7 +224,7 @@ fn real_world_report_includes_external_adapter_coverage_manifest() -> Result<()> report .pointer("/external_adapters/summary/overall_status_counts/wrong_result") .and_then(Value::as_u64), - Some(4) + Some(3) ); assert_eq!( report @@ -236,7 +236,7 @@ fn real_world_report_includes_external_adapter_coverage_manifest() -> Result<()> report .pointer("/external_adapters/summary/overall_status_counts/incomplete") .and_then(Value::as_u64), - Some(1) + Some(2) ); assert_eq!( report @@ -258,6 +258,7 @@ fn real_world_report_includes_external_adapter_coverage_manifest() -> Result<()> let openviking = find_by_field(adapters, "/adapter_id", "openviking_live_baseline")?; assert_eq!(elf.pointer("/evidence_class").and_then(Value::as_str), Some("fixture_backed")); + assert_eq!(elf.pointer("/overall_status").and_then(Value::as_str), Some("incomplete")); assert_eq!(qmd.pointer("/overall_status").and_then(Value::as_str), Some("pass")); assert_eq!(qmd.pointer("/suites/0/status").and_then(Value::as_str), Some("not_encoded")); assert_eq!( diff --git a/docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md b/docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md new file mode 100644 index 00000000..1082526c --- /dev/null +++ b/docs/guide/benchmarking/2026-06-10-real-world-comparison-report.md @@ -0,0 +1,177 @@ +# Real-World Comparison Report - June 10, 2026 + +Goal: Publish the post-P1 real-world agent memory benchmark evidence and adoption +implications. +Read this when: You need the checked-in evidence behind README-level real-world +benchmark claims after XY-833 and XY-861 through XY-864 landed. +Inputs: Generated reports under `tmp/real-world-memory/` and `tmp/real-world-job/`, +`apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json`, +and the live-baseline reports linked from this guide. +Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`, +`docs/guide/benchmarking/real_world_agent_memory_benchmark.md`, and +`docs/guide/benchmarking/live_baseline_benchmark.md`. +Verification: The commands listed below were run from branch `y/elf-xy-865`. The +generated reports used runner version +`0.2.0-89d30dc04a854771f2a62f607e1d13498ccb3073-aarch64-apple-darwin`; the working +tree also contained the adapter manifest refresh recorded here. + +## Context + +Dependency batch state at report time: + +| Issue | Result | PR | +| --- | --- | --- | +| XY-833 operator-debugging UX repair | Done | `https://github.com/hack-ink/ELF/pull/147` | +| XY-861 project-decision suite | Done | `https://github.com/hack-ink/ELF/pull/151` | +| XY-862 production-ops suite | Done | `https://github.com/hack-ink/ELF/pull/148` | +| XY-863 graph temporal validity | Done | `https://github.com/hack-ink/ELF/pull/150` | +| XY-864 external adapter comparison contract | Done | `https://github.com/hack-ink/ELF/pull/149` | + +This report is for the XY-865 branch `y/elf-xy-865` and PR title +`XY-865: [ELF benchmark vNext P1] Publish real-world comparison report and adoption plan`. + +No private-corpus or credentialed provider checks were run for this report because no +operator-owned private manifest or routed provider credentials were supplied. Those +paths remain typed `blocked` boundaries, not passes. + +## Commands + +| Command | Generated artifact | Run ID | Generated at | +| --- | --- | --- | --- | +| `cargo make real-world-memory` | `tmp/real-world-memory/real-world-memory-report.{json,md}` | `real-world-memory` | `2026-06-10T04:21:32.545027Z` | +| `cargo make real-world-memory-project-decisions` | `tmp/real-world-memory/project-decisions/report.{json,md}` | `real-world-memory-project-decisions` | `2026-06-10T04:21:52.403238Z` | +| `cargo make real-world-memory-production-ops` | `tmp/real-world-memory/production-ops-report.{json,md}` | `real-world-memory-production-ops` | `2026-06-10T04:21:59.520163Z` | +| `cargo make real-world-memory-evolution` | `tmp/real-world-memory/evolution-report.{json,md}` | `real-world-memory-evolution` | `2026-06-10T04:22:06.325152Z` | +| `cargo make real-world-job-operator-ux` | `tmp/real-world-job/real-world-job-operator-ux-report.{json,md}` | `real-world-job-operator-ux` | `2026-06-10T04:22:12.28938Z` | + +All generated reports used runner version +`0.2.0-89d30dc04a854771f2a62f607e1d13498ccb3073-aarch64-apple-darwin`. + +## Aggregate Result + +`cargo make real-world-memory` now reports `38` jobs across all `11` encoded real-world +suites: + +| Metric | Value | +| --- | ---: | +| Pass | `35` | +| Incomplete | `1` | +| Blocked | `2` | +| Wrong result | `0` | +| Lifecycle fail | `0` | +| Not encoded | `0` | +| Unsupported claim | `0` | +| Mean score | `0.921` | +| Evidence coverage | `82/82` (`1.000`) | +| Source-ref coverage | `82/82` (`1.000`) | +| Quote coverage | `82/82` (`1.000`) | +| Expected evidence recall | `75/75` (`1.000`) | +| Redaction leaks | `0` | +| Scope violations | `0` | +| Temporal validity gaps | `0` | +| Qdrant rebuild cases | `2/2` pass | + +Suite-level outcomes: + +| Suite | Jobs | Status | Mean score | Interpretation | +| --- | ---: | --- | ---: | --- | +| `trust_source_of_truth` | 1 | `pass` | `1.000` | Source-of-truth rebuild fixture passed. | +| `work_resume` | 5 | `pass` | `1.000` | Resume and exact next-action fixtures passed. | +| `project_decisions` | 5 | `pass` | `1.000` | Current decisions, reversals, rationale, and caveats passed. | +| `retrieval` | 5 | `pass` | `1.000` | Retrieval fixtures with distractors and obsolete context passed. | +| `memory_evolution` | 6 | `pass` | `1.000` | Current-vs-historical and temporal relation validity passed. | +| `consolidation` | 4 | `pass` | `1.000` | Proposal-only consolidation passed with `0` source mutations. | +| `knowledge_compilation` | 2 | `pass` | `1.000` | Derived page fixtures passed with citation/rebuild checks. | +| `operator_debugging_ux` | 1 | `pass` | `1.000` | Aggregate stage-attribution fixture passed. | +| `capture_integration` | 2 | `pass` | `1.000` | Redaction and capture-boundary fixtures passed. | +| `production_ops` | 6 | `incomplete` | `0.500` | Three jobs passed, one is a typed dependency `incomplete`, and two are typed operator `blocked`. | +| `personalization` | 1 | `pass` | `1.000` | Scoped preference correction passed. | + +## Focused P1 Slices + +| Command | Jobs | Status summary | Evidence notes | +| --- | ---: | --- | --- | +| `cargo make real-world-memory-project-decisions` | 5 | `5` pass | Current decision, historical/reversed decision, validation gate, tradeoff rationale, and private-manifest caveat all passed. | +| `cargo make real-world-memory-evolution` | 5 | `5` pass | Temporal relation validity is now encoded and passing; stale answers `0`, conflict detections `5`, update rationales `5`. | +| `cargo make real-world-job-operator-ux` | 5 | `5` pass | Dropped evidence, rerank promotion, provider latency, rebuild change, and misleading relation-context debug cases passed with raw SQL needed `0`. | +| `cargo make real-world-memory-production-ops` | 6 | `3` pass, `1` incomplete, `2` blocked | Restore/Qdrant rebuild, interrupted backfill resume, and resource envelope passed; local embedding dependency, provider credentials, and private manifest remain typed non-pass boundaries. | + +## External Adapter Evidence + +The real-world runner loads +`apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json`. +That manifest is an evidence ledger, not a leaderboard. It keeps three evidence classes +separate: + +| Evidence class | Count | Meaning | +| --- | ---: | --- | +| `fixture_backed` | 1 | ELF fixture scoring through checked-in real-world jobs. | +| `live_baseline_only` | 6 | Docker same-corpus/lifecycle evidence from the live-baseline runner only. | +| `live_real_world` | 0 | No external project currently executes `real_world_job` prompts and scoring. | + +Adapter-level status after refreshing the manifest: + +| Project | Evidence class | Overall status | What is proven | What is not proven | +| --- | --- | --- | --- | --- | +| ELF | `fixture_backed` | `incomplete` | Fixture-backed real-world scoring passes 10 of 11 suites, with production-ops typed boundaries preserved. | A live end-to-end real-world service adapter is not encoded. | +| qmd | `live_baseline_only` | `pass` | Docker same-corpus retrieval, update, delete, and cold-start live-baseline checks pass. | qmd does not yet run any real-world job suite. | +| agentmemory | `live_baseline_only` | `lifecycle_fail` | Same-corpus retrieval can run through current adapter. | Durable storage/cold-start lifecycle and real-world suites are blocked by the current in-memory adapter path. | +| mem0/OpenMemory | `live_baseline_only` | `wrong_result` | Local OSS setup is represented separately from hosted/OpenMemory claims. | Same-corpus retrieval was not a clean pass and no real-world job adapter is encoded. | +| memsearch | `live_baseline_only` | `wrong_result` | Markdown-first design remains a source-of-truth ergonomics reference. | Same-corpus retrieval was not a clean pass and real-world suites are incomplete/not encoded. | +| OpenViking | `live_baseline_only` | `incomplete` | Hierarchical context trajectory remains a reference direction. | Docker local-embedding setup must be pinned before fair retrieval or real-world jobs can run. | +| claude-mem | `live_baseline_only` | `wrong_result` | Progressive disclosure and local viewer remain UX references. | Current Docker evidence is not a clean same-corpus pass and progressive disclosure jobs are not encoded. | + +External summary counters: `7` adapter records, `6` external projects, `7` Docker-default, +`0` host-global-install requirements, `0` live real-world adapters, `3` external +wrong-result overall states, `1` lifecycle-fail state, and `1` external incomplete state. + +## Remaining Gaps + +Every remaining non-pass state is either a follow-up or an explicit non-goal for this +report: + +| Gap | Status | Follow-up or non-goal | +| --- | --- | --- | +| ELF production-ops cold-start dependency fixture | `incomplete` | `[ELF benchmark P0] Pin Docker-compatible local embedding dependency for cold-start adapter checks`. | +| ELF provider-backed production-ops gate | `blocked` | Run only with routed operator credentials; credentials were not supplied for this report. | +| ELF private production corpus | `blocked` | Supply an operator-owned sanitized private manifest; private-corpus checks were a non-goal without that manifest. | +| ELF fixture-backed scoring is not live service execution | `not_encoded` capability | `[ELF benchmark vNext] Replace fixture-only ELF answers with live real-world adapter execution where appropriate`. | +| qmd real-world job adapter | `not_encoded` suites | Add a qmd adapter that executes `real_world_job` prompts and scoring before claiming real-world suite parity. | +| agentmemory durable lifecycle | `lifecycle_fail` / `blocked` | `[ELF benchmark P0] Make agentmemory adapter lifecycle-durable and fail-typed`. | +| mem0/OpenMemory same-corpus and real-world coverage | `wrong_result` / `not_encoded` | Add/fix a local OSS adapter before claiming lifecycle, personalization, or OpenMemory UI parity. | +| memsearch same-corpus and real-world coverage | `wrong_result` / `incomplete` | Fix Docker same-corpus retrieval/reindex evidence before scoring Markdown-first real-world jobs. | +| OpenViking Docker local embedding path | `incomplete` | `[ELF benchmark adapter] Pin OpenViking Docker local embedding dependency path`. | +| claude-mem durable/progressive-disclosure adapter | `wrong_result` / `not_encoded` | Add durable local repository and progressive-disclosure job coverage before UX parity claims. | + +## Adoption Implications + +What ELF is better at in the current evidence: + +- Evidence-bound writes, deterministic ingestion boundaries, source-of-truth discipline, + rebuildable Qdrant indexing, scoped service APIs, and audited fixture-backed real-world + provenance are stronger than the currently tested alternatives. +- The P1 fixture batch removed the previous real-world `wrong_result` and `not_encoded` + aggregate gaps for project decisions, temporal relation validity, and operator + debugging UX. + +Where ELF is comparable or still being tested: + +- qmd remains the strongest local retrieval-debug baseline. It passes current + live-baseline checks, while ELF has the stronger evidence/provenance service contract. +- The fixture-backed retrieval and memory-evolution suites pass, but this is not the + same as proving every external project on the same real-world jobs. + +Where ELF is behind or not yet proven: + +- No external project has a live real-world adapter win, including ELF as a live service + adapter; the current ELF result is fixture-backed. +- Production-ops is intentionally not a full pass because credentialed and private + corpus checks need operator-owned inputs. +- ELF still needs to absorb external strengths: qmd-style local debug knobs, + agentmemory/claude-mem/OpenMemory-style continuity and viewer ergonomics, + OpenViking-style context trajectory, mem0-style entity history, and memsearch-style + canonical local-store ergonomics. + +The current adoption statement is therefore: ELF is the best-supported foundation in +this repository for high-trust evidence-linked agent memory, but this report does not +claim overall external superiority or private-corpus production proof. diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index e6ea0bff..7cbb67ec 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -37,6 +37,9 @@ cleanup, use `docs/guide/single_user_production.md`. - `2026-06-09-operator-debugging-ux-report.md`: checked-in real-world job operator-debugging UX report with trace/viewer links, raw-SQL avoidance, root-cause step counts, dropped-candidate visibility, and repair-action clarity. +- `2026-06-10-real-world-comparison-report.md`: checked-in post-P1 real-world + comparison report with aggregate fixture evidence, external-adapter evidence classes, + remaining typed gaps, and adoption implications. - `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world agent memory benchmark contract, including suite taxonomy, typed report states, knowledge-compilation fixture tasks, and the production-ops fixture target.