hack-ink · yvette-carlisle · Jun 9, 2026 · Jun 9, 2026
diff --git a/README.md b/README.md
@@ -154,6 +154,10 @@ Detailed evidence and interpretation:
 - [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md)
 - [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
 - [Single-User Production Runbook](docs/guide/single_user_production.md)
+- Future benchmark contract:
+  [Real-World Agent Memory Benchmark v1](docs/spec/real_world_agent_memory_benchmark_v1.md).
+  This contract defines job-level suites for agent work, but no system win is claimed
+  under it until a runner encodes and reports those suites.
 
 Quick comparison snapshot (objective/high-level).
 This table compares capability coverage, not overall project quality.
@@ -199,6 +203,7 @@ Detailed comparison, mechanism-level analysis, and source map:
 - [Synthetic Production Corpus Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-corpus-report.md)
 - [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md)
 - [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
+- [Real-World Agent Memory Benchmark](docs/guide/benchmarking/real_world_agent_memory_benchmark.md)
 - [External Memory Improvement Plan](docs/guide/research/external_memory_improvement_plan.md)
 - [Detailed External Comparison](docs/guide/research/comparison_external_projects.md)
 - [Research Projects Inventory](docs/guide/research/research_projects_inventory.md)

diff --git a/docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md b/docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md
@@ -254,6 +254,9 @@ Recommended non-blocking follow-ups:
 
 - Rerun `baseline-production-private` when an operator-owned private manifest is
   available, and publish a private-corpus addendum that does not expose private text.
+- Treat `docs/spec/real_world_agent_memory_benchmark_v1.md` as the future-work
+  contract for job-level memory evaluation. This report does not claim any pass under
+  that new suite because no real-world job runner was encoded in this gate.
 - Keep qmd as the strongest external local baseline for routing/fusion/debuggability
   comparison work.
 - Treat agentmemory, memsearch, mem0, OpenViking, and claude-mem adapter failures as

diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md
@@ -33,6 +33,8 @@ cleanup, use `docs/guide/single_user_production.md`.
 - `2026-06-09-production-adoption-gate-report.md`: XY-836 production adoption
   decision report with fresh provider-backed synthetic, stress, backfill, restore, and
   external adapter evidence.
+- `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world
+  agent memory benchmark contract, including suite taxonomy and typed report states.
 
 ## Update Rules
 
@@ -42,3 +44,6 @@ cleanup, use `docs/guide/single_user_production.md`.
 - Link the newest decision-relevant report from README and this index.
 - When benchmark semantics change, update `live_baseline_benchmark.md` and the
   relevant spec before publishing a new result.
+- Real-world job benchmark changes are governed by
+  `docs/spec/real_world_agent_memory_benchmark_v1.md`; keep this guide as routing and
+  do not duplicate the normative schema here.
diff --git a/docs/guide/benchmarking/live_baseline_benchmark.md b/docs/guide/benchmarking/live_baseline_benchmark.md
@@ -251,6 +251,10 @@ by the live baseline runner. It does not remove the host report directory.
 
 ## Result Semantics
 
+The result terms below belong to the current Docker live baseline. For the future
+job-level suite contract, including `unsupported_claim`, see
+`docs/spec/real_world_agent_memory_benchmark_v1.md`.
+
 - `pass`: the project installed and every encoded check for that project passed in the
   selected corpus profile.
 - `wrong_result`: a retrieval check completed but returned the wrong memory or missed

diff --git a/docs/guide/benchmarking/real_world_agent_memory_benchmark.md b/docs/guide/benchmarking/real_world_agent_memory_benchmark.md
@@ -0,0 +1,117 @@
+# Real-World Agent Memory Benchmark
+
+Goal: Explain the v1 real-world agent memory benchmark suite and route implementation
+work to the governing spec.
+Read this when: You need to create jobs, extend benchmark suites, interpret reports,
+or understand why retrieval-only comparisons are insufficient.
+Inputs: `docs/spec/real_world_agent_memory_benchmark_v1.md`, current live baseline
+reports, external project comparison docs, and the intended user-job scenario.
+Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`,
+`live_baseline_benchmark.md`, and `docs/guide/research/comparison_external_projects.md`.
+Outputs: Operator-facing suite overview, bias explanation, and implementation routing.
+
+## Governing Spec
+
+The authoritative contract is:
+
+- `docs/spec/real_world_agent_memory_benchmark_v1.md`
+
+Use the spec for field names, suite ids, report states, scoring rules, and claim
+boundaries. This guide is only an operator map.
+
+## Why This Suite Exists
+
+The current live baseline proves useful behavior: ELF and qmd can pass the encoded
+Docker smoke checks, and ELF can pass provider-backed synthetic, stress, backfill,
+restore, and lifecycle checks. That evidence remains valid for the existing benchmark.
+
+It is incomplete for real agent work. A memory system can retrieve the right chunk and
+still fail the user's job by repeating completed work, trusting stale evidence, missing
+a blocker, leaking private context, or inventing a decision that was never recorded.
+
+The real-world suite changes the unit from a query to a `real_world_job`:
+
+- corpus
+- timeline
+- prompt
+- expected answer
+- required evidence
+- negative traps
+- scoring rubric
+- allowed uncertainty
+
+This shape rewards systems that help agents resume, decide, debug, update stale memory,
+compile knowledge, and state honest uncertainty.
+
+## Suite Overview
+
+| Suite | What It Tests | Example Job |
+| --- | --- | --- |
+| Trust/source-of-truth | Provenance, rebuildability, and derived-index boundaries. | Restore a note after index rebuild and cite authoritative source evidence. |
+| Work resume | Resuming agent work without repeating completed steps. | Identify the next action after a retained lane failure. |
+| Project decisions | Current decisions, rationale, reversals, and caveats. | Explain why a benchmark gate uses typed failures. |
+| Retrieval | Task-relevant search with decoys and alternates. | Answer a task query while avoiding near-duplicate project evidence. |
+| Memory evolution | Update, delete, expiry, contradiction, and history behavior. | Report what superseded an old fact and suppress deleted memory. |
+| Consolidation | Reviewable derived memories without hidden mutation. | Produce a proposal with lineage and unsupported-claim flags. |
+| Knowledge compilation | Evidence-linked project/entity/concept pages. | Compile current project status with timeline and stale-section lint. |
+| Operator debugging UX | Ability to diagnose wrong results without raw store access. | Show which retrieval stage dropped expected evidence. |
+| Capture/integration | Accuracy of hooks, imports, exclusions, and write policies. | Capture a session decision while excluding private spans. |
+| Production ops | Backfill, restore, cold start, resource, and bounded-failure behavior. | Resume interrupted import without duplicate source notes. |
+| Personalization | Scoped preferences without cross-tenant leakage. | Apply the user's current preference and ignore another project's note. |
+
+## External Reference Mapping
+
+The suite uses external strengths as references, not as winners:
+
+- ELF: evidence-bound writes, deterministic ingestion boundaries, source-of-truth plus
+  rebuildable index, production ops, and evaluation tooling.
+- qmd: local retrieval quality, query expansion/routing, weighted fusion, rerank, and
+  transparent debug ergonomics.
+- agentmemory: cross-agent hooks, coding-agent continuity, local viewer, consolidation
+  lifecycle, and observability console.
+- claude-mem: progressive disclosure, automatic capture loop, local inspection, and
+  operator comfort.
+- OpenViking: filesystem context model, hierarchical retrieval, staged trajectory, and
+  session iteration.
+- mem0: multi-entity scoping, lifecycle history, optional graph context, hosted/OpenMemory
+  ecosystem, and personalization references.
+- memsearch: Markdown-first source-of-truth pattern, incremental indexing, and practical
+  local hybrid retrieval.
+- llm-wiki and gbrain: compiled knowledge pages, query-save/lint loops, current-truth
+  plus timeline shape.
+- Always-On Memory Agent, Claude Dreams, and Gemini CLI Auto Memory: background
+  consolidation patterns, with ELF's requirement that derived outputs remain reviewable.
+- Graphiti/Zep, Letta, LangGraph, graphify, and nanograph: temporal facts, core versus
+  archival memory, replay mindset, graph-compressed navigation, and typed graph ergonomics.
+
+## Report Interpretation
+
+A real-world benchmark report must preserve typed outcomes:
+
+- `pass`
+- `wrong_result`
+- `lifecycle_fail`
+- `incomplete`
+- `blocked`
+- `not_encoded`
+- `unsupported_claim`
+
+Do not collapse those terms into one leaderboard. `unsupported_claim` is especially
+important: it means the system made a substantive claim that the corpus or evidence did
+not support. That is a different and higher-risk failure than simply missing a result.
+
+## Implementation Routing
+
+Downstream runner issues can cite the spec directly. They should choose a small suite
+slice first, then report every untouched suite as `not_encoded`.
+
+Recommended first increments:
+
+1. Encode one `work_resume` job over the synthetic production corpus.
+2. Encode one `retrieval` job with decoys and required evidence.
+3. Encode one `memory_evolution` job that proves update/delete/supersession behavior.
+4. Add report output for `unsupported_claim` before broadening the suite count.
+
+Do not generate large fixtures or update production-adoption verdicts while adding the
+contract. The current adoption gate remains an existing benchmark decision until new
+real-world job reports are implemented and published.
diff --git a/docs/spec/index.md b/docs/spec/index.md
@@ -39,6 +39,8 @@ Question this index answers: "what must remain true?"
   whether ELF meets or exceeds selected external memory-system baselines.
 - `production_corpus_manifest_v1.md`: Sanitized/private coding-agent production
   corpus manifest schema for adoption benchmark runs.
+- `real_world_agent_memory_benchmark_v1.md`: Real-world agent memory benchmark job
+  schema, suite taxonomy, scoring dimensions, and report state semantics.
 
 ## Spec document contract