From 1d8b3033e28ecaf1bed998bc3f2ed7d9e0064a2f Mon Sep 17 00:00:00 2001 From: Yvette Carlisle Date: Tue, 9 Jun 2026 21:15:31 +0800 Subject: [PATCH] {"schema":"decodex/commit/1","summary":"Define real-world agent memory benchmark contract","authority":"XY-840"} --- README.md | 5 + ...6-06-09-production-adoption-gate-report.md | 3 + docs/guide/benchmarking/index.md | 5 + .../benchmarking/live_baseline_benchmark.md | 4 + .../real_world_agent_memory_benchmark.md | 117 +++++++ docs/spec/index.md | 2 + .../real_world_agent_memory_benchmark_v1.md | 328 ++++++++++++++++++ 7 files changed, 464 insertions(+) create mode 100644 docs/guide/benchmarking/real_world_agent_memory_benchmark.md create mode 100644 docs/spec/real_world_agent_memory_benchmark_v1.md diff --git a/README.md b/README.md index 0fb0a90f..356504c4 100644 --- a/README.md +++ b/README.md @@ -154,6 +154,10 @@ Detailed evidence and interpretation: - [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md) - [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md) - [Single-User Production Runbook](docs/guide/single_user_production.md) +- Future benchmark contract: + [Real-World Agent Memory Benchmark v1](docs/spec/real_world_agent_memory_benchmark_v1.md). + This contract defines job-level suites for agent work, but no system win is claimed + under it until a runner encodes and reports those suites. Quick comparison snapshot (objective/high-level). This table compares capability coverage, not overall project quality. @@ -199,6 +203,7 @@ Detailed comparison, mechanism-level analysis, and source map: - [Synthetic Production Corpus Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-corpus-report.md) - [Production Adoption Gate Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md) - [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md) +- [Real-World Agent Memory Benchmark](docs/guide/benchmarking/real_world_agent_memory_benchmark.md) - [External Memory Improvement Plan](docs/guide/research/external_memory_improvement_plan.md) - [Detailed External Comparison](docs/guide/research/comparison_external_projects.md) - [Research Projects Inventory](docs/guide/research/research_projects_inventory.md) diff --git a/docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md b/docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md index f8bfb7be..d1491423 100644 --- a/docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md +++ b/docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md @@ -254,6 +254,9 @@ Recommended non-blocking follow-ups: - Rerun `baseline-production-private` when an operator-owned private manifest is available, and publish a private-corpus addendum that does not expose private text. +- Treat `docs/spec/real_world_agent_memory_benchmark_v1.md` as the future-work + contract for job-level memory evaluation. This report does not claim any pass under + that new suite because no real-world job runner was encoded in this gate. - Keep qmd as the strongest external local baseline for routing/fusion/debuggability comparison work. - Treat agentmemory, memsearch, mem0, OpenViking, and claude-mem adapter failures as diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index d5921631..c47f491b 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -33,6 +33,8 @@ cleanup, use `docs/guide/single_user_production.md`. - `2026-06-09-production-adoption-gate-report.md`: XY-836 production adoption decision report with fresh provider-backed synthetic, stress, backfill, restore, and external adapter evidence. +- `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world + agent memory benchmark contract, including suite taxonomy and typed report states. ## Update Rules @@ -42,3 +44,6 @@ cleanup, use `docs/guide/single_user_production.md`. - Link the newest decision-relevant report from README and this index. - When benchmark semantics change, update `live_baseline_benchmark.md` and the relevant spec before publishing a new result. +- Real-world job benchmark changes are governed by + `docs/spec/real_world_agent_memory_benchmark_v1.md`; keep this guide as routing and + do not duplicate the normative schema here. diff --git a/docs/guide/benchmarking/live_baseline_benchmark.md b/docs/guide/benchmarking/live_baseline_benchmark.md index e6995f00..d1238181 100644 --- a/docs/guide/benchmarking/live_baseline_benchmark.md +++ b/docs/guide/benchmarking/live_baseline_benchmark.md @@ -251,6 +251,10 @@ by the live baseline runner. It does not remove the host report directory. ## Result Semantics +The result terms below belong to the current Docker live baseline. For the future +job-level suite contract, including `unsupported_claim`, see +`docs/spec/real_world_agent_memory_benchmark_v1.md`. + - `pass`: the project installed and every encoded check for that project passed in the selected corpus profile. - `wrong_result`: a retrieval check completed but returned the wrong memory or missed diff --git a/docs/guide/benchmarking/real_world_agent_memory_benchmark.md b/docs/guide/benchmarking/real_world_agent_memory_benchmark.md new file mode 100644 index 00000000..df11d9ef --- /dev/null +++ b/docs/guide/benchmarking/real_world_agent_memory_benchmark.md @@ -0,0 +1,117 @@ +# Real-World Agent Memory Benchmark + +Goal: Explain the v1 real-world agent memory benchmark suite and route implementation +work to the governing spec. +Read this when: You need to create jobs, extend benchmark suites, interpret reports, +or understand why retrieval-only comparisons are insufficient. +Inputs: `docs/spec/real_world_agent_memory_benchmark_v1.md`, current live baseline +reports, external project comparison docs, and the intended user-job scenario. +Depends on: `docs/spec/real_world_agent_memory_benchmark_v1.md`, +`live_baseline_benchmark.md`, and `docs/guide/research/comparison_external_projects.md`. +Outputs: Operator-facing suite overview, bias explanation, and implementation routing. + +## Governing Spec + +The authoritative contract is: + +- `docs/spec/real_world_agent_memory_benchmark_v1.md` + +Use the spec for field names, suite ids, report states, scoring rules, and claim +boundaries. This guide is only an operator map. + +## Why This Suite Exists + +The current live baseline proves useful behavior: ELF and qmd can pass the encoded +Docker smoke checks, and ELF can pass provider-backed synthetic, stress, backfill, +restore, and lifecycle checks. That evidence remains valid for the existing benchmark. + +It is incomplete for real agent work. A memory system can retrieve the right chunk and +still fail the user's job by repeating completed work, trusting stale evidence, missing +a blocker, leaking private context, or inventing a decision that was never recorded. + +The real-world suite changes the unit from a query to a `real_world_job`: + +- corpus +- timeline +- prompt +- expected answer +- required evidence +- negative traps +- scoring rubric +- allowed uncertainty + +This shape rewards systems that help agents resume, decide, debug, update stale memory, +compile knowledge, and state honest uncertainty. + +## Suite Overview + +| Suite | What It Tests | Example Job | +| --- | --- | --- | +| Trust/source-of-truth | Provenance, rebuildability, and derived-index boundaries. | Restore a note after index rebuild and cite authoritative source evidence. | +| Work resume | Resuming agent work without repeating completed steps. | Identify the next action after a retained lane failure. | +| Project decisions | Current decisions, rationale, reversals, and caveats. | Explain why a benchmark gate uses typed failures. | +| Retrieval | Task-relevant search with decoys and alternates. | Answer a task query while avoiding near-duplicate project evidence. | +| Memory evolution | Update, delete, expiry, contradiction, and history behavior. | Report what superseded an old fact and suppress deleted memory. | +| Consolidation | Reviewable derived memories without hidden mutation. | Produce a proposal with lineage and unsupported-claim flags. | +| Knowledge compilation | Evidence-linked project/entity/concept pages. | Compile current project status with timeline and stale-section lint. | +| Operator debugging UX | Ability to diagnose wrong results without raw store access. | Show which retrieval stage dropped expected evidence. | +| Capture/integration | Accuracy of hooks, imports, exclusions, and write policies. | Capture a session decision while excluding private spans. | +| Production ops | Backfill, restore, cold start, resource, and bounded-failure behavior. | Resume interrupted import without duplicate source notes. | +| Personalization | Scoped preferences without cross-tenant leakage. | Apply the user's current preference and ignore another project's note. | + +## External Reference Mapping + +The suite uses external strengths as references, not as winners: + +- ELF: evidence-bound writes, deterministic ingestion boundaries, source-of-truth plus + rebuildable index, production ops, and evaluation tooling. +- qmd: local retrieval quality, query expansion/routing, weighted fusion, rerank, and + transparent debug ergonomics. +- agentmemory: cross-agent hooks, coding-agent continuity, local viewer, consolidation + lifecycle, and observability console. +- claude-mem: progressive disclosure, automatic capture loop, local inspection, and + operator comfort. +- OpenViking: filesystem context model, hierarchical retrieval, staged trajectory, and + session iteration. +- mem0: multi-entity scoping, lifecycle history, optional graph context, hosted/OpenMemory + ecosystem, and personalization references. +- memsearch: Markdown-first source-of-truth pattern, incremental indexing, and practical + local hybrid retrieval. +- llm-wiki and gbrain: compiled knowledge pages, query-save/lint loops, current-truth + plus timeline shape. +- Always-On Memory Agent, Claude Dreams, and Gemini CLI Auto Memory: background + consolidation patterns, with ELF's requirement that derived outputs remain reviewable. +- Graphiti/Zep, Letta, LangGraph, graphify, and nanograph: temporal facts, core versus + archival memory, replay mindset, graph-compressed navigation, and typed graph ergonomics. + +## Report Interpretation + +A real-world benchmark report must preserve typed outcomes: + +- `pass` +- `wrong_result` +- `lifecycle_fail` +- `incomplete` +- `blocked` +- `not_encoded` +- `unsupported_claim` + +Do not collapse those terms into one leaderboard. `unsupported_claim` is especially +important: it means the system made a substantive claim that the corpus or evidence did +not support. That is a different and higher-risk failure than simply missing a result. + +## Implementation Routing + +Downstream runner issues can cite the spec directly. They should choose a small suite +slice first, then report every untouched suite as `not_encoded`. + +Recommended first increments: + +1. Encode one `work_resume` job over the synthetic production corpus. +2. Encode one `retrieval` job with decoys and required evidence. +3. Encode one `memory_evolution` job that proves update/delete/supersession behavior. +4. Add report output for `unsupported_claim` before broadening the suite count. + +Do not generate large fixtures or update production-adoption verdicts while adding the +contract. The current adoption gate remains an existing benchmark decision until new +real-world job reports are implemented and published. diff --git a/docs/spec/index.md b/docs/spec/index.md index 7cec41ce..228c81a8 100644 --- a/docs/spec/index.md +++ b/docs/spec/index.md @@ -39,6 +39,8 @@ Question this index answers: "what must remain true?" whether ELF meets or exceeds selected external memory-system baselines. - `production_corpus_manifest_v1.md`: Sanitized/private coding-agent production corpus manifest schema for adoption benchmark runs. +- `real_world_agent_memory_benchmark_v1.md`: Real-world agent memory benchmark job + schema, suite taxonomy, scoring dimensions, and report state semantics. ## Spec document contract diff --git a/docs/spec/real_world_agent_memory_benchmark_v1.md b/docs/spec/real_world_agent_memory_benchmark_v1.md new file mode 100644 index 00000000..fa94656f --- /dev/null +++ b/docs/spec/real_world_agent_memory_benchmark_v1.md @@ -0,0 +1,328 @@ +# Real-World Agent Memory Benchmark v1 + +Purpose: Define the v1 benchmark contract for evaluating agent memory systems through +real user jobs instead of isolated top-k retrieval queries. +Status: normative +Read this when: You are implementing, validating, reporting, or extending real-world +agent memory benchmark suites. +Not this document: Runner implementation steps, large fixture generation, operator +commands, or production adoption verdicts. +Defines: `real_world_job` schema, suite taxonomy, scoring dimensions, report states, +allowed uncertainty, and external reference mapping. + +## Scope + +The benchmark unit is `real_world_job`: a replayable user job that combines a corpus, +timeline, user prompt, expected answer, required evidence, negative traps, scoring +rubric, and allowed uncertainty. A job is intended to answer one question: would this +memory system help an agent do real work correctly, with less repetition and fewer +unsupported claims? + +This contract is future benchmark authority only. Existing live baseline reports remain +valid evidence for their encoded retrieval and lifecycle checks. A project must not +claim wins under this v1 suite until a runner encodes the relevant suites and publishes +a report against this contract. + +## Design Goals + +- Evaluate job completion, not only whether one expected chunk appears in top-k. +- Reward evidence-backed answers, stale-fact handling, and recoverable reasoning. +- Penalize confident but unsupported claims even when retrieval looks plausible. +- Preserve typed failure states instead of flattening every result into one leaderboard. +- Keep external project strengths visible as suite references, not as automatic + superiority claims. + +## Why The Current Benchmark Is Incomplete + +The June 2026 live baseline is necessary but biased toward service-style retrieval and +encoded lifecycle checks. ELF and qmd leading that matrix proves that those systems can +retrieve expected evidence and pass encoded update/delete/cold-start checks under the +selected Docker profiles. It does not prove that they help an agent resume a lane, +explain a decision, debug a failed retrieval, reconcile stale notes, compile durable +knowledge, or avoid unsupported claims during an end-to-end user job. + +This suite fixes that bias by making the job transcript, expected answer, required +evidence, traps, and scoring rubric first-class. A system can pass retrieval and still +fail a real-world job if it repeats completed work, cites obsolete evidence, omits a +blocking caveat, or fabricates a decision that is not in the corpus. + +## Real-World Job Schema + +A `real_world_job` record MUST include the fields below. JSON is the canonical exchange +shape; YAML fixtures MAY be used only when converted to the same field names before +runner execution. + +```json +{ + "schema": "elf.real_world_job/v1", + "job_id": "trust-sot-restore-001", + "suite": "trust_source_of_truth", + "title": "Recover the authoritative restore decision", + "corpus": {}, + "timeline": [], + "prompt": {}, + "expected_answer": {}, + "required_evidence": [], + "negative_traps": [], + "scoring_rubric": {}, + "allowed_uncertainty": {}, + "tags": [] +} +``` + +### Required Top-Level Fields + +| Field | Type | Required semantics | +| --- | --- | --- | +| `schema` | string | MUST equal `elf.real_world_job/v1`. | +| `job_id` | string | Stable ASCII identifier unique within a suite. | +| `suite` | string | One suite id from the Suite Taxonomy section. | +| `title` | string | Human-readable job title. | +| `corpus` | object | Documents, memory items, traces, source refs, and adapter setup needed to replay the job. | +| `timeline` | array | Ordered events that establish what happened before the user prompt. | +| `prompt` | object | The user-facing request sent to the evaluated memory system or agent harness. | +| `expected_answer` | object | Required answer content, accepted uncertainty, and forbidden claims. | +| `required_evidence` | array | Evidence ids, source refs, quotes, or trace handles that must support the answer. | +| `negative_traps` | array | Distractors, stale facts, or misleading memories that must not drive the answer. | +| `scoring_rubric` | object | Dimensions, weights, thresholds, and hard-fail rules for this job. | +| `allowed_uncertainty` | object | Explicit uncertainty language and fallback behavior accepted for the job. | +| `tags` | array | Optional labels such as `private_corpus`, `synthetic`, `adapter_required`, or `no_live_claim`. | + +### `corpus` + +`corpus` MUST identify all replay inputs without relying on hidden host state. + +Required fields: + +- `corpus_id`: stable id. +- `profile`: `synthetic`, `private_sanitized`, `generated_public`, or `external_adapter`. +- `items`: array of corpus items. + +Each `items[]` entry MUST include: + +- `evidence_id`: stable id used by `required_evidence` and `negative_traps`. +- `kind`: `note`, `document`, `trace`, `issue`, `pr`, `runbook`, `decision`, `message`, + `compiled_page`, or `adapter_state`. +- `text` or `local_ref`: inline sanitized text or a local fixture pointer. +- `source_ref`: object; MAY be `{}` only for generated synthetic fixtures. +- `created_at`: RFC3339 timestamp or `null` when time is intentionally irrelevant. + +Private corpus fixtures MUST use sanitized inline text or local refs excluded from git. +Reports MAY publish evidence ids and score summaries without publishing private text. + +### `timeline` + +`timeline` MUST model the user job as prior agent work, not just a bag of documents. + +Each event MUST include: + +- `event_id` +- `ts` +- `actor`: `user`, `agent`, `tool`, `system`, `operator`, or `external` +- `action`: short verb phrase such as `created_issue`, `made_decision`, + `ran_command`, `hit_blocker`, `updated_memory`, `deleted_memory`, or + `published_report` +- `evidence_ids`: one or more ids from `corpus.items[]` +- `summary`: compact English summary + +Timeline order is normative. If a later event supersedes an earlier fact, the expected +answer MUST follow the later event unless `allowed_uncertainty` permits a historical +answer. + +### `prompt` + +`prompt` MUST include: + +- `role`: normally `user`. +- `content`: the exact user request. +- `job_mode`: `resume`, `answer`, `debug`, `decide`, `compile`, `personalize`, or + `operate`. +- `constraints`: array of explicit instructions such as `do_not_run_live_actions`, + `cite_evidence`, `avoid_repeating_completed_work`, or `state_blockers`. + +The evaluated system MAY retrieve memory, inspect its own state, or call adapter tools +only when the runner profile permits those actions. + +### `expected_answer` + +`expected_answer` MUST define answer correctness at the job level. + +Required fields: + +- `must_include`: array of claims or actions that must appear. +- `must_not_include`: array of forbidden claims, stale facts, or unsafe actions. +- `evidence_links`: mapping from required claim ids to acceptable evidence ids. +- `answer_type`: `direct_answer`, `work_plan`, `resume_summary`, `debug_report`, + `decision_record`, `compiled_knowledge`, or `ops_runbook`. + +Optional fields: + +- `accepted_alternates`: array of alternate phrasings or equivalent evidence ids. +- `requires_caveat`: boolean; when true, omitting the caveat is a scoring failure. +- `requires_refusal`: boolean; when true, the correct answer is to decline or stop + because the memory system lacks evidence or authority. + +### `required_evidence` + +Each required evidence entry MUST include: + +- `evidence_id` +- `claim_id` +- `requirement`: `cite`, `use`, `avoid`, or `explain` +- `quote` or `selector`: exact quote for inline fixtures, or a stable selector for + local/private fixtures. + +An answer that states a required claim without any acceptable evidence link is an +`unsupported_claim` unless the job's `allowed_uncertainty` explicitly permits an +uncited low-confidence statement. + +### `negative_traps` + +Negative traps MUST be explicit so systems are tested against realistic memory failure +modes. + +Trap types: + +- `stale_fact`: once true but superseded later in the timeline. +- `near_duplicate`: semantically close but wrong project, user, tenant, or time. +- `decoy_evidence`: shares query terms but does not support the expected claim. +- `unsafe_action`: would perform live, destructive, credentialed, or out-of-scope work. +- `unsupported_prior`: plausible prior decision not present in the corpus. +- `privacy_leak`: private or excluded content that must not appear in the answer. + +Each trap MUST include `trap_id`, `type`, `evidence_ids`, and `failure_if_used`. + +### `scoring_rubric` + +The rubric MUST be job-specific but use the shared dimensions below. + +Required dimensions: + +- `answer_correctness`: expected answer content and action selection. +- `evidence_grounding`: correct use of required evidence and source refs. +- `trap_avoidance`: avoidance of stale, decoy, privacy, and unsafe traps. +- `uncertainty_handling`: honest caveats when evidence is missing or ambiguous. +- `workflow_helpfulness`: whether the answer advances the user job without needless + repetition. + +Optional dimensions: + +- `lifecycle_behavior`: update, delete, expiry, supersession, or cold-start behavior. +- `debuggability`: trace, timeline, viewer, or explanation quality. +- `latency_resource`: bounded runtime, cost proxy, or resource envelope. +- `personalization_fit`: correct user/project preference application without leakage. + +Rubric fields: + +- `dimensions`: object keyed by dimension name, each with `weight`, `max_points`, and + `criteria`. +- `pass_threshold`: total normalized score required for `pass`. +- `hard_fail_rules`: array of rules that force a non-pass status regardless of score. + +Hard-fail rules MUST include: + +- unsupported high-confidence claim about a required decision or fact; +- unsafe live/destructive action when the prompt forbids it; +- use of a negative trap marked `failure_if_used = true`; +- missing required refusal when the job has `requires_refusal = true`. + +### `allowed_uncertainty` + +`allowed_uncertainty` MUST distinguish honest uncertainty from failure. + +Required fields: + +- `can_answer_unknown`: boolean. +- `acceptable_phrases`: array of accepted uncertainty phrases or patterns. +- `fallback_action`: `ask_for_evidence`, `state_blocker`, `cite_partial_evidence`, + `refuse`, or `continue_with_caveat`. + +If `can_answer_unknown = false`, an answer that refuses despite sufficient evidence is +`wrong_result`. If `can_answer_unknown = true`, an answer that invents missing evidence +is `unsupported_claim`. + +## Suite Taxonomy + +Suite ids are stable public names. Each suite MUST contain at least one +`real_world_job` before a report may claim suite coverage. + +| Suite id | Goal | User-job examples | Evidence requirements | Scoring dimensions | Strongest external references | +| --- | --- | --- | --- | --- | --- | +| `trust_source_of_truth` | Verify authoritative storage, provenance, rebuild, and non-authoritative derived index handling. | Restore a note after Qdrant rebuild; identify whether a compiled page is derived; explain why a source ref supports a claim. | Source note/document ids, restore or rebuild trace, source_ref lineage, no hidden index-only evidence. | answer_correctness, evidence_grounding, trap_avoidance, lifecycle_behavior. | ELF, memsearch, OpenViking. | +| `work_resume` | Help an agent resume real work without repeating completed steps or losing blockers. | Resume a retained lane; identify next command after a failed run; summarize what remains blocked. | Timeline events, issue/PR ids, run summaries, latest blocker evidence. | answer_correctness, workflow_helpfulness, uncertainty_handling, trap_avoidance. | agentmemory, claude-mem, OpenViking. | +| `project_decisions` | Recover durable decisions, rationale, reversals, and current policy. | Explain why a design was chosen; distinguish old vs current validation gate; cite decision evidence. | Decision records, superseding events, accepted alternatives, current-policy timestamp. | answer_correctness, evidence_grounding, trap_avoidance, uncertainty_handling. | ELF, gbrain, llm-wiki, Letta. | +| `retrieval` | Measure task-relevant retrieval quality beyond top-k keyword matching. | Answer a task query with expected evidence; find alternate phrasing; avoid near-duplicate project evidence. | Expected evidence ids, allowed alternates, decoy evidence ids, trace ids when available. | answer_correctness, evidence_grounding, trap_avoidance, latency_resource. | qmd, ELF, memsearch, OpenViking. | +| `memory_evolution` | Verify updates, deletes, expiry, supersession, contradiction handling, and history. | Apply a new preference; suppress a deleted memory; explain what superseded an old fact. | Before/after memory versions, ingest decision rows or adapter history, current timeline event. | lifecycle_behavior, answer_correctness, evidence_grounding, trap_avoidance. | mem0, ELF, Graphiti/Zep, Letta. | +| `consolidation` | Test reviewable derived memory formation without hidden source mutation. | Produce a consolidation proposal; identify unsupported claims; discard stale synthesis. | Source inputs, derived proposal id, lineage, review state, conflict markers. | answer_correctness, evidence_grounding, uncertainty_handling, debuggability. | Claude Dreams, Gemini CLI Auto Memory, Always-On Memory Agent, ELF. | +| `knowledge_compilation` | Compile evidence into maintained project/entity/concept pages while preserving provenance. | Build a project status page; answer from compiled truth plus timeline; lint a stale page section. | Page section sources, backlinks, timeline entries, lint evidence. | answer_correctness, evidence_grounding, workflow_helpfulness, trap_avoidance. | llm-wiki, gbrain, graphify, ELF. | +| `operator_debugging_ux` | Show whether a wrong or ambiguous memory result can be debugged without raw store spelunking. | Explain why a result ranked first; inspect a trace; identify which stage dropped expected evidence. | Trace bundle, retrieval trajectory, candidate metrics, viewer or CLI readback. | debuggability, evidence_grounding, workflow_helpfulness, answer_correctness. | claude-mem, qmd, agentmemory, ELF. | +| `capture_integration` | Evaluate how accurately work observations become usable memory across agents and tools. | Capture a session decision; exclude private spans; import external agent observations. | Hook/import logs, write policy audits, excluded spans, resulting note ids. | answer_correctness, evidence_grounding, trap_avoidance, lifecycle_behavior. | agentmemory, claude-mem, memsearch, mem0. | +| `production_ops` | Prove safe operation under backup, restore, backfill, cold start, resource, and credential boundaries. | Resume interrupted import; restore from backup; report missing private manifest as bounded caveat. | Command/report artifacts, resource envelope, checkpoint state, failure guard evidence. | lifecycle_behavior, latency_resource, uncertainty_handling, evidence_grounding. | ELF, qmd, memsearch, LangGraph. | +| `personalization` | Apply user/project preferences correctly without leaking across scopes or overfitting stale preferences. | Remember preferred response style; avoid using another project tenant's note; update a preference. | Scoped memory ids, preference versions, tenant/project/agent context, negative cross-scope traps. | personalization_fit, trap_avoidance, evidence_grounding, answer_correctness. | mem0, Letta, agentmemory, ELF. | + +## Report Semantics + +Reports MUST preserve typed outcomes at job, suite, and project levels. A report MUST +NOT collapse the results into a single overall leaderboard without the underlying typed +state table. + +Outcome terms: + +| Term | Meaning | +| --- | --- | +| `pass` | The job or suite is encoded, ran to completion, met the pass threshold, satisfied required evidence, and hit no hard-fail rule. | +| `wrong_result` | The system completed the job but selected the wrong answer, wrong action, wrong current fact, or missed required evidence despite enough available evidence. | +| `lifecycle_fail` | The answer surface may be correct for retrieval, but encoded update, delete, expiry, cold-start, persistence, history, or supersession behavior failed. | +| `incomplete` | The runner could not reach the behavioral check because install, build, dependency, adapter wiring, parse, or runtime setup failed. | +| `blocked` | The check cannot be run safely without credentials, manual setup, private corpus input, durable runtime integration, or host integration outside the run scope. | +| `not_encoded` | The suite, job, adapter path, or scoring dimension is not implemented in the runner, so no pass/fail claim is allowed. | +| `unsupported_claim` | The system produced a substantive claim, decision, evidence citation, or capability claim that is not supported by the job corpus, required evidence, or report metadata. | + +`unsupported_claim` is distinct from `wrong_result`: `wrong_result` can be a supported +but incorrect selection, while `unsupported_claim` is an evidentiary failure. When both +apply, reports SHOULD surface `unsupported_claim` because it is higher risk for memory +systems used by agents. + +Suite status rules: + +- A suite is `pass` only when all encoded required jobs pass. +- A suite is `lifecycle_fail` when at least one lifecycle-scored job proves lifecycle + behavior wrong and no higher-risk `unsupported_claim` is present. +- A suite is `wrong_result` when at least one required job returns the wrong result and + no higher-risk `unsupported_claim` is present. +- A suite is `unsupported_claim` when any hard-fail unsupported claim occurs. +- A suite is `incomplete` or `blocked` when required jobs cannot run for those reasons. +- A suite is `not_encoded` when no job in that suite is implemented. + +Reports MUST include: + +- run id, runner version, corpus profile, job ids, suite ids, project adapter metadata; +- per-job status, normalized score, hard-fail hits, evidence ids used, trap ids used; +- per-suite typed status and score distribution; +- unsupported claim list with claim text or a bounded redacted description; +- explicit `not_encoded` suite list; +- private-corpus redaction policy when private fixtures are used. + +## Claim Rules + +- A project MAY claim a suite pass only for suites with encoded jobs and a published + report using this contract. +- A project MUST NOT use generated public jobs to claim private production readiness. +- A project MUST NOT treat `blocked`, `incomplete`, or `not_encoded` as evidence of + weakness or strength; those states only describe benchmark coverage. +- A project MUST NOT claim "best memory system" from this suite. Reports SHOULD describe + dimension-specific results and typed limitations. +- Existing ELF/qmd-leading live baseline results MAY be cited as retrieval/lifecycle + evidence, but MUST NOT be reinterpreted as real-world job suite wins. + +## Downstream Implementation Contract + +Runner implementation issues can cite this spec and choose any subset of suites. The +minimum useful runner increment is: + +- one encoded `real_world_job` fixture; +- one adapter path; +- scoring for all required rubric dimensions in that job; +- typed report output using the Report Semantics section. + +Implementation issues MUST state which suites remain `not_encoded`.