Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,10 @@ Detailed evidence and interpretation:
- [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md)
- [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
- [Single-User Production Runbook](docs/guide/single_user_production.md)
- Future benchmark contract:
[Real-World Agent Memory Benchmark v1](docs/spec/real_world_agent_memory_benchmark_v1.md).
This contract defines job-level suites for agent work, but no system win is claimed
under it until a runner encodes and reports those suites.

Quick comparison snapshot (objective/high-level).
This table compares capability coverage, not overall project quality.
Expand Down Expand Up @@ -199,6 +203,7 @@ Detailed comparison, mechanism-level analysis, and source map:
- [Synthetic Production Corpus Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-corpus-report.md)
- [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md)
- [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md)
- [Real-World Agent Memory Benchmark](docs/guide/benchmarking/real_world_agent_memory_benchmark.md)
- [External Memory Improvement Plan](docs/guide/research/external_memory_improvement_plan.md)
- [Detailed External Comparison](docs/guide/research/comparison_external_projects.md)
- [Research Projects Inventory](docs/guide/research/research_projects_inventory.md)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,9 @@ Recommended non-blocking follow-ups:

- Rerun `baseline-production-private` when an operator-owned private manifest is
available, and publish a private-corpus addendum that does not expose private text.
- Treat `docs/spec/real_world_agent_memory_benchmark_v1.md` as the future-work
contract for job-level memory evaluation. This report does not claim any pass under
that new suite because no real-world job runner was encoded in this gate.
- Keep qmd as the strongest external local baseline for routing/fusion/debuggability
comparison work.
- Treat agentmemory, memsearch, mem0, OpenViking, and claude-mem adapter failures as
Expand Down
5 changes: 5 additions & 0 deletions docs/guide/benchmarking/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ cleanup, use `docs/guide/single_user_production.md`.
- `2026-06-09-production-adoption-gate-report.md`: XY-836 production adoption
decision report with fresh provider-backed synthetic, stress, backfill, restore, and
external adapter evidence.
- `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world
agent memory benchmark contract, including suite taxonomy and typed report states.

## Update Rules

Expand All @@ -42,3 +44,6 @@ cleanup, use `docs/guide/single_user_production.md`.
- Link the newest decision-relevant report from README and this index.
- When benchmark semantics change, update `live_baseline_benchmark.md` and the
relevant spec before publishing a new result.
- Real-world job benchmark changes are governed by
`docs/spec/real_world_agent_memory_benchmark_v1.md`; keep this guide as routing and
do not duplicate the normative schema here.
4 changes: 4 additions & 0 deletions docs/guide/benchmarking/live_baseline_benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,10 @@ by the live baseline runner. It does not remove the host report directory.

## Result Semantics

The result terms below belong to the current Docker live baseline. For the future
job-level suite contract, including `unsupported_claim`, see
`docs/spec/real_world_agent_memory_benchmark_v1.md`.

- `pass`: the project installed and every encoded check for that project passed in the
selected corpus profile.
- `wrong_result`: a retrieval check completed but returned the wrong memory or missed
Expand Down
117 changes: 117 additions & 0 deletions docs/guide/benchmarking/real_world_agent_memory_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Real-World Agent Memory Benchmark

Goal: Explain the v1 real-world agent memory benchmark suite and route implementation
work to the governing spec.
Read this when: You need to create jobs, extend benchmark suites, interpret reports,
or understand why retrieval-only comparisons are insufficient.
Inputs: `docs/spec/real_world_agent_memory_benchmark_v1.md`, current live baseline
reports, external project comparison docs, and the intended user-job scenario.
Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`,
`live_baseline_benchmark.md`, and `docs/guide/research/comparison_external_projects.md`.
Outputs: Operator-facing suite overview, bias explanation, and implementation routing.

## Governing Spec

The authoritative contract is:

- `docs/spec/real_world_agent_memory_benchmark_v1.md`

Use the spec for field names, suite ids, report states, scoring rules, and claim
boundaries. This guide is only an operator map.

## Why This Suite Exists

The current live baseline proves useful behavior: ELF and qmd can pass the encoded
Docker smoke checks, and ELF can pass provider-backed synthetic, stress, backfill,
restore, and lifecycle checks. That evidence remains valid for the existing benchmark.

It is incomplete for real agent work. A memory system can retrieve the right chunk and
still fail the user's job by repeating completed work, trusting stale evidence, missing
a blocker, leaking private context, or inventing a decision that was never recorded.

The real-world suite changes the unit from a query to a `real_world_job`:

- corpus
- timeline
- prompt
- expected answer
- required evidence
- negative traps
- scoring rubric
- allowed uncertainty

This shape rewards systems that help agents resume, decide, debug, update stale memory,
compile knowledge, and state honest uncertainty.

## Suite Overview

| Suite | What It Tests | Example Job |
| --- | --- | --- |
| Trust/source-of-truth | Provenance, rebuildability, and derived-index boundaries. | Restore a note after index rebuild and cite authoritative source evidence. |
| Work resume | Resuming agent work without repeating completed steps. | Identify the next action after a retained lane failure. |
| Project decisions | Current decisions, rationale, reversals, and caveats. | Explain why a benchmark gate uses typed failures. |
| Retrieval | Task-relevant search with decoys and alternates. | Answer a task query while avoiding near-duplicate project evidence. |
| Memory evolution | Update, delete, expiry, contradiction, and history behavior. | Report what superseded an old fact and suppress deleted memory. |
| Consolidation | Reviewable derived memories without hidden mutation. | Produce a proposal with lineage and unsupported-claim flags. |
| Knowledge compilation | Evidence-linked project/entity/concept pages. | Compile current project status with timeline and stale-section lint. |
| Operator debugging UX | Ability to diagnose wrong results without raw store access. | Show which retrieval stage dropped expected evidence. |
| Capture/integration | Accuracy of hooks, imports, exclusions, and write policies. | Capture a session decision while excluding private spans. |
| Production ops | Backfill, restore, cold start, resource, and bounded-failure behavior. | Resume interrupted import without duplicate source notes. |
| Personalization | Scoped preferences without cross-tenant leakage. | Apply the user's current preference and ignore another project's note. |

## External Reference Mapping

The suite uses external strengths as references, not as winners:

- ELF: evidence-bound writes, deterministic ingestion boundaries, source-of-truth plus
rebuildable index, production ops, and evaluation tooling.
- qmd: local retrieval quality, query expansion/routing, weighted fusion, rerank, and
transparent debug ergonomics.
- agentmemory: cross-agent hooks, coding-agent continuity, local viewer, consolidation
lifecycle, and observability console.
- claude-mem: progressive disclosure, automatic capture loop, local inspection, and
operator comfort.
- OpenViking: filesystem context model, hierarchical retrieval, staged trajectory, and
session iteration.
- mem0: multi-entity scoping, lifecycle history, optional graph context, hosted/OpenMemory
ecosystem, and personalization references.
- memsearch: Markdown-first source-of-truth pattern, incremental indexing, and practical
local hybrid retrieval.
- llm-wiki and gbrain: compiled knowledge pages, query-save/lint loops, current-truth
plus timeline shape.
- Always-On Memory Agent, Claude Dreams, and Gemini CLI Auto Memory: background
consolidation patterns, with ELF's requirement that derived outputs remain reviewable.
- Graphiti/Zep, Letta, LangGraph, graphify, and nanograph: temporal facts, core versus
archival memory, replay mindset, graph-compressed navigation, and typed graph ergonomics.

## Report Interpretation

A real-world benchmark report must preserve typed outcomes:

- `pass`
- `wrong_result`
- `lifecycle_fail`
- `incomplete`
- `blocked`
- `not_encoded`
- `unsupported_claim`

Do not collapse those terms into one leaderboard. `unsupported_claim` is especially
important: it means the system made a substantive claim that the corpus or evidence did
not support. That is a different and higher-risk failure than simply missing a result.

## Implementation Routing

Downstream runner issues can cite the spec directly. They should choose a small suite
slice first, then report every untouched suite as `not_encoded`.

Recommended first increments:

1. Encode one `work_resume` job over the synthetic production corpus.
2. Encode one `retrieval` job with decoys and required evidence.
3. Encode one `memory_evolution` job that proves update/delete/supersession behavior.
4. Add report output for `unsupported_claim` before broadening the suite count.

Do not generate large fixtures or update production-adoption verdicts while adding the
contract. The current adoption gate remains an existing benchmark decision until new
real-world job reports are implemented and published.
2 changes: 2 additions & 0 deletions docs/spec/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ Question this index answers: "what must remain true?"
whether ELF meets or exceeds selected external memory-system baselines.
- `production_corpus_manifest_v1.md`: Sanitized/private coding-agent production
corpus manifest schema for adoption benchmark runs.
- `real_world_agent_memory_benchmark_v1.md`: Real-world agent memory benchmark job
schema, suite taxonomy, scoring dimensions, and report state semantics.

## Spec document contract

Expand Down
Loading