From c60678728a2f8fcd6282bb96f31269c00f814f67 Mon Sep 17 00:00:00 2001
From: Yvette Carlisle <y@acg.box>
Date: Tue, 9 Jun 2026 21:17:58 +0800
Subject: [PATCH] {"schema":"decodex/commit/1","summary":"Refresh external
 memory benchmark dimension map","authority":"XY-841"}

---
 README.md                                     |   3 +-
 .../research/comparison_external_projects.md  |  87 ++++++++++-
 .../research/research_projects_inventory.md   |  49 +++----
 ...-external-memory-benchmark-dimensions.json | 136 ++++++++++++++++++
 4 files changed, 249 insertions(+), 26 deletions(-)
 create mode 100644 docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json

diff --git a/README.md b/README.md
index 0fb0a90f..edc59038 100644
--- a/README.md
+++ b/README.md
@@ -203,8 +203,9 @@ Detailed comparison, mechanism-level analysis, and source map:
 - [Detailed External Comparison](docs/guide/research/comparison_external_projects.md)
 - [Research Projects Inventory](docs/guide/research/research_projects_inventory.md)
 - [Agent Memory Selection Research Run](docs/research/2026-06-08-agent-memory-selection.json)
+- [Real-World Benchmark Dimension Research Run](docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json)
 
-Latest external research refresh: June 8, 2026.
+Latest external research refresh: June 9, 2026.
 
 ## Documentation
 
diff --git a/docs/guide/research/comparison_external_projects.md b/docs/guide/research/comparison_external_projects.md
index 4594b8b2..54be2ba7 100644
--- a/docs/guide/research/comparison_external_projects.md
+++ b/docs/guide/research/comparison_external_projects.md
@@ -10,6 +10,8 @@ Scope note: This document is intentionally detailed and source-heavy. Keep `READ
 For a full list of reviewed and pending projects, see `docs/guide/research/research_projects_inventory.md`.
 For the June 2026 agentmemory and dreaming decision run, see
 `docs/research/2026-06-08-agent-memory-selection.json`.
+For the June 2026 real-world benchmark-dimension refresh, see
+`docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json`.
 
 Comparison focuses on shared capabilities, ELF distinctives, and objective trade-offs. These projects solve adjacent problems, but their primary storage units and default workflows differ.
 
@@ -32,6 +34,87 @@ Legend:
 Note: In this section, mem0 refers to the Mem0 ecosystem, including OpenMemory (an MCP memory server with a built-in UI).
 OpenViking is included as a newly reviewed project with mechanism-level analysis.
 
+## June 2026 Real-World Benchmark-Dimension Map
+
+Snapshot date for this subsection: June 9, 2026.
+
+This map translates the existing external-project research into benchmark dimensions
+for the real-world agent memory suite. It does not add new adapter pass/fail evidence.
+Use the evidence class before making claims:
+
+- `benchmark-grounded`: ELF's Docker benchmark has runnable adapter evidence for this
+  project and dimension. Read the exact report before quoting a pass/fail result.
+- `docs-grounded`: official docs or READMEs indicate a likely strength, but ELF has not
+  reproduced the behavior in the benchmark runner.
+- `watch`: the project remains D0 or otherwise pending; do not assign strength claims
+  until a deep dive or adapter run exists.
+
+Current benchmark-grounded scope is narrow. The June 9, 2026 all-project smoke run
+proved encoded same-corpus/lifecycle behavior only for the current adapters: ELF and qmd
+passed their encoded smoke checks; agentmemory passed same-corpus retrieval but failed
+or could not prove durable lifecycle behavior; memsearch, mem0, OpenViking, and
+claude-mem retained `incomplete`, wrong-result, or not-encoded states. All broader suite
+fit below is research guidance, not a benchmark result.
+
+Benchmark suite labels:
+
+| Suite | Real-world job shape |
+| ----- | -------------------- |
+| `rw.resume-evidence` | Resume a stalled agent task, recover the right prior decision, cite required evidence, and avoid negative traps. |
+| `rw.lifecycle-staleness` | Update, delete, expire, cold-start, and contradiction cases where stale facts must stop winning. |
+| `rw.operator-continuity` | Capture session observations, inspect memory state, and support day-to-day agent continuity with low friction. |
+| `rw.retrieval-debug` | Explain query expansion, hybrid retrieval, fusion, rerank, and wrong-result causes. |
+| `rw.context-trajectory` | Navigate multi-stage or hierarchical context before selecting final evidence. |
+| `rw.knowledge-synthesis` | Compile durable project/entity/concept pages from memory and keep them lintable or repairable. |
+| `rw.consolidation-review` | Run background consolidation while keeping derived output reviewable and evidence-linked. |
+| `rw.graph-temporal` | Track facts, entities, relations, validity windows, and current-versus-historical answers. |
+| `rw.core-archival` | Separate always-loaded operating memory from retrieval-only archival memory. |
+| `rw.replay-regression` | Replay, fork, or checkpoint agent state to debug memory-assisted work and regression failures. |
+| `rw.graph-navigation` | Use graph-compressed corpus structure to guide agents before raw retrieval or file inspection. |
+
+Project-to-suite map:
+
+| Project | Best-fit real-world suites | Why this project matters for that suite | Fair adapter evidence before claims | Evidence class and confidence | Current ELF position |
+| ------- | -------------------------- | -------------------------------------- | ---------------------------------- | ----------------------------- | -------------------- |
+| agentmemory | `rw.operator-continuity`, `rw.resume-evidence`, `rw.lifecycle-staleness` | Cross-agent hooks, MCP/REST packaging, viewer, lifecycle/consolidation claims, and coding-agent continuity focus make it the right reference for daily agent memory ergonomics. | Use durable upstream storage rather than the current in-memory mock; ingest realistic agent sessions through the public hook/API path; prove restart, update/supersede, delete, and viewer/trace readback. | Mixed: benchmark-grounded only for current same-corpus retrieval; current lifecycle evidence is a failure/blocker, while hooks/viewer/consolidation are docs-grounded. Confidence: medium for suite fit, low for durable adapter quality. | ELF is stronger on evidence-bound writes and source-of-truth discipline; agentmemory remains the reference for capture breadth and agent-continuity UX. |
+| qmd | `rw.retrieval-debug`, `rw.lifecycle-staleness`, `rw.resume-evidence` | Its local CLI, structured JSON query output, expansion modes, hybrid routing, weighted fusion, rerank, update, delete, and cold-start path make it the strongest local retrieval-debug baseline. | Run `qmd` over the real-world corpus, capture query JSON, then rewrite/delete corpus files and rerun update/embed/query in fresh processes. | Benchmark-grounded for current smoke retrieval/update/delete/cold-start pass; docs-grounded for deeper query planning ergonomics. Confidence: high for local adapter baseline. | ELF is not yet stronger on local CLI debug ergonomics; treat qmd as the retrieval-debug reference while keeping ELF's service/provenance model. |
+| claude-mem | `rw.operator-continuity`, `rw.resume-evidence`, `rw.retrieval-debug` | Progressive-disclosure search, auto-capture hooks, local viewer, and observation/timeline workflows are directly aligned with real agent resumption jobs. | Exercise a real local repository with hook-driven capture, then evaluate `search -> timeline -> observations` behavior after restart; do not rely on mocked storage. | Docs-grounded for progressive disclosure/viewer; current benchmark adapter evidence is incomplete/wrong-result and mostly not encoded for lifecycle. Confidence: medium for product reference, low for current adapter claims. | ELF has stronger provenance and service boundaries, but claude-mem remains a reference for operator workflow and progressive disclosure UX. |
+| mem0 / OpenMemory | `rw.lifecycle-staleness`, `rw.graph-temporal`, `rw.operator-continuity`, `rw.resume-evidence` | Entity-scoped memory, memory history, expiration, hosted/OSS surfaces, OpenMemory UI, and optional graph memory make it the broadest lifecycle and ecosystem comparison target. | Separate OSS local FastEmbed/Qdrant evidence from hosted Platform claims; prove add/update/delete/history, entity-scoped retrieval, expiration exclusion, OpenMemory UI readback, and optional graph context on the same corpus. | Docs-grounded for lifecycle/entity/graph/UI claims; current local adapter is incomplete/wrong-result for same-corpus retrieval and delete remains not encoded. Confidence: medium for suite fit, low for current adapter quality. | ELF is stronger on deterministic evidence-bound writes; mem0/OpenMemory is the reference for ecosystem reach, entity-scoped history, hosted option, and optional graph UX. |
+| memsearch | `rw.lifecycle-staleness`, `rw.retrieval-debug`, `rw.resume-evidence` | Markdown as canonical memory plus incremental/content-addressed reindexing is a useful model for source transparency and rebuildable derived indexes. | Index a real-world Markdown corpus, mutate/delete files, rerun index/search from fresh processes, and record Milvus mode so Lite/Server/Cloud behavior is not conflated. | Docs-grounded for architecture; current adapter is incomplete/invalid-result, so no pass/fail quality claim is allowed. Confidence: medium for design pattern, low for current adapter evidence. | ELF already owns source-of-truth plus rebuildable index at service level; memsearch remains a reference for simple local canonical-store ergonomics. |
+| OpenViking | `rw.context-trajectory`, `rw.resume-evidence`, `rw.retrieval-debug` | `viking://` context organization, intent analysis, hierarchical retrieval, staged find/search behavior, and session compression are relevant to multi-hop agent context jobs. | Pin or provide a Docker-compatible local embedding path, then evaluate `add_resource`/`find`/`search` over multi-stage jobs with stage output, hierarchy, and session memory evidence. | Docs-grounded for mechanism; current benchmark adapter is incomplete due local embedding install failure. Confidence: medium for architecture reference, low for runnable adapter quality. | ELF has first-class traces and evidence-bound notes, but OpenViking is the reference for hierarchical context trajectory and filesystem-like organization. |
+| llm-wiki | `rw.knowledge-synthesis`, `rw.resume-evidence` | Query/save/lint flows and topic-scoped wiki pages are a useful reference for turning retrieved memory into maintained project knowledge. | Run a corpus-to-wiki job, ask resume/decision questions, require page citations back to source memory, then mutate a stale source and prove lint/repair catches it. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for derived-knowledge fit. | ELF is not yet stronger on derived knowledge pages; llm-wiki should inform rebuildable, evidence-cited dossiers rather than core storage. |
+| gbrain | `rw.knowledge-synthesis`, `rw.operator-continuity` | `compiled_truth`, timeline sections, backlinks, primary-home routing, and enrichment workflows model a living operational brain for project work. | Build or update pages from the real-world corpus, require current-truth plus timeline answers, and prove enrichment/backlink maintenance does not hide unsupported claims. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for operator knowledge UX. | ELF should keep source notes authoritative; gbrain is a reference for presentation, enrichment, and maintenance loops. |
+| Always-On Memory Agent | `rw.consolidation-review`, `rw.operator-continuity` | The file/API/dashboard ingest loop and timer-based consolidation show how background memory formation becomes a user-visible product surface. | Run scheduled consolidation on a fixed corpus, record source rows and output insights, then score whether consolidation is reviewable, repeatable, and bounded against unsupported claims. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for consolidation workflow reference. | ELF should borrow scheduling and operator controls while keeping deterministic writes and reviewable derived outputs. |
+| graphify | `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.resume-evidence` | Deterministic code extraction, LLM-assisted graph building, honesty tags, graph reports, and assistant hooks are strong references for graph-compressed navigation over large corpora. | Generate graph/report artifacts from the benchmark corpus, require answers to use graph structure plus source evidence, and prove rebuild behavior after corpus edits. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for graph-navigation reference. | ELF is stronger as a memory service; graphify is the reference for rebuildable graph reports and pre-search guidance. |
+| Letta | `rw.core-archival`, `rw.operator-continuity` | Core memory blocks, archival memory, and shared/read-only memory blocks map directly to always-loaded operating context versus retrievable memory. | Build a multi-agent job where core blocks must be attached/detached/shared read-only, while archival memory is retrieved separately and audited. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for memory-semantics reference. | ELF has scoped notes but not first-class core/archival block ergonomics; Letta is the reference dimension. |
+| LangGraph | `rw.replay-regression`, `rw.resume-evidence` | Thread checkpoints, durable execution, replay, fork, and time travel define a strong model for debugging agent-state and memory-regression behavior. | Run an agent job with memory reads across checkpoints, replay/fork the thread after a stale-memory failure, and verify side-effect boundaries. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for replay workflow reference. | ELF traces are useful but do not replace full agent checkpoint replay; LangGraph is the reference for replay-regression jobs. |
+| Graphiti / Zep | `rw.graph-temporal`, `rw.resume-evidence` | Temporal entities, relations, fact triples, validity windows, and graph search directly target stale/contradictory factual memory. | Add fact triples with validity changes, query current and historical answers, and score invalidation/append behavior under contradiction traps. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium-high for temporal-graph dimension. | ELF graph-lite is not yet stronger on temporal graph validity; Graphiti/Zep is the reference dimension. |
+| nanograph | `rw.graph-temporal`, `rw.retrieval-debug` | Typed schema and typed query ergonomics are relevant to making ELF graph-lite interactions inspectable and hard to misuse. | Define typed graph schemas and queries for the same fact set, then score developer-visible validation, query shape, and explainability rather than retrieval quality alone. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for DX reference, low for memory-system comparison. | ELF should borrow typed graph ergonomics without treating nanograph as a full memory backend. |
+
+Pending watch items remain D0. Keep them out of benchmark strength claims until current
+evidence is gathered:
+
+| Watch item | Candidate suite if promoted | Minimum evidence needed before adapter or quality claims |
+| ---------- | --------------------------- | ------------------------------------------------------- |
+| RAGFlow | `rw.resume-evidence`, `rw.graph-navigation`, `rw.retrieval-debug` | D1/D2 deep dive on deployability, corpus ingestion, graph/RAG retrieval path, API/CLI outputs, and Docker resource envelope. |
+| LightRAG | `rw.graph-navigation`, `rw.graph-temporal`, `rw.retrieval-debug` | D1/D2 deep dive on graph extraction/update semantics, local persistence, query output, and whether stale/corrected facts can be tested fairly. |
+| GraphRAG | `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.retrieval-debug` | D1/D2 deep dive on indexing cost, graph summaries, update/rebuild behavior, source citation guarantees, and task-level output inspectability. |
+
+## Where ELF Is Not Yet The Reference
+
+| Benchmark dimension | Current reference project(s) | ELF gap to test before claiming strength |
+| ------------------- | ---------------------------- | ---------------------------------------- |
+| Local retrieval debugging and CLI transparency | qmd | ELF needs equally fast local knobs/readback for expansion, hybrid fusion, rerank, and wrong-result diagnosis. |
+| Turn-by-turn agent capture and daily continuity | agentmemory, claude-mem, OpenMemory | ELF has service and viewer surfaces, but not the same turnkey hook breadth or session-continuity product ergonomics. |
+| Progressive disclosure UX | claude-mem, OpenViking | ELF has L0/L1/L2 shaping and traces, but the operator workflow still needs better search-session navigation. |
+| Entity-scoped history and managed ecosystem reach | mem0/OpenMemory | ELF has ingest decisions and versions, but not the same hosted option, SDK reach, or first-class memory history surface. |
+| Core memory versus archival memory | Letta | ELF scopes notes well, but lacks attachable/read-only core memory blocks as a distinct user-facing layer. |
+| Temporal graph validity | Graphiti/Zep | ELF graph-lite persists relation context, but temporal invalidation/current-vs-historical graph behavior is not the reference yet. |
+| Agent replay and forkable regression debugging | LangGraph | ELF traces are replay evidence for retrieval, not full persisted agent-state replay with side-effect boundaries. |
+| Derived knowledge pages and lint/repair loops | llm-wiki, gbrain | ELF does not yet ship rebuildable entity/project pages with unsupported-claim lint as a first-class workflow. |
+| Scheduled consolidation as a product surface | Always-On Memory Agent | ELF's target should be reviewable derived consolidation, but the scheduling/operator-control workflow is not implemented. |
+| Graph-compressed navigation over large corpora | graphify, GraphRAG/LightRAG watch items | ELF relation context is bounded and evidence-linked, but broader graph report/navigation workflows remain future work. |
+
 ## June 2026 Agentmemory And Dreaming Refresh
 
 Snapshot date for this subsection: June 8, 2026.
@@ -276,7 +359,9 @@ Key takeaways for ELF from this deeper pass:
 
 ## Where ELF Is Currently Weaker (Objective Gaps)
 
-- No built-in web UI viewer yet (claude-mem and OpenMemory provide this today).
+- ELF now has a local admin viewer and retrieval observability surfaces, but
+  claude-mem, OpenMemory, and agentmemory remain stronger references for turnkey
+  memory-inspection and session-continuity ergonomics.
 - No hosted/cloud product option (mem0 provides managed deployment).
 - Graph support is currently graph-lite (`POST /v2/graph/query`) and does not yet include multi-hop/global graph reasoning patterns used by GraphRAG-focused projects.
 - Less turnkey for zero-config local plugin workflows than memsearch/claude-mem defaults.
diff --git a/docs/guide/research/research_projects_inventory.md b/docs/guide/research/research_projects_inventory.md
index 6cf50e62..c84ddab6 100644
--- a/docs/guide/research/research_projects_inventory.md
+++ b/docs/guide/research/research_projects_inventory.md
@@ -6,7 +6,7 @@ Inputs: Existing research notes, open architecture questions, and tracked adopti
 Depends on: `docs/guide/research/comparison_external_projects.md`.
 Outputs: A current inventory of reviewed and pending external projects.
 
-Last updated: June 8, 2026.
+Last updated: June 9, 2026.
 
 ## Legend
 
@@ -16,28 +16,28 @@ Last updated: June 8, 2026.
 
 ## Inventory
 
-| Project | Research depth | Current status | Why it matters to ELF | Primary reference |
-| ------- | -------------- | -------------- | --------------------- | ----------------- |
-| [agentmemory](https://github.com/rohitg00/agentmemory) | D1 | Reviewed | Cross-agent coding-memory hooks, MCP/REST surface, viewer, consolidation lifecycle, and external benchmark target | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-08-agent-memory-selection.json` |
-| [OpenAI ChatGPT Memory Dreaming](https://openai.com/index/chatgpt-memory-dreaming/) | D1 | Reviewed | Background memory synthesis and staleness repair as a product direction | `docs/research/2026-06-08-agent-memory-selection.json` |
-| [Claude Managed Agents Dreams](https://platform.claude.com/docs/en/managed-agents/dreams) | D1 | Reviewed | Reviewable derived memory-store output over past sessions; strong safety shape for ELF consolidation | `docs/research/2026-06-08-agent-memory-selection.json` |
-| [Gemini CLI Auto Memory](https://github.com/google-gemini/gemini-cli/blob/main/docs/cli/auto-memory.md) | D1 | Reviewed | Background session mining with project-local review inbox for memory patches and skills | `docs/research/2026-06-08-agent-memory-selection.json` |
-| [mem0](https://github.com/mem0ai/mem0) | D2 | Reviewed | Graph memory as additive context, memory history and async mode trade-offs | `docs/guide/research/comparison_external_projects.md` |
-| [memsearch](https://github.com/zilliztech/memsearch) | D2 | Reviewed | Markdown-first SoT + rebuildable index pattern | `docs/guide/research/comparison_external_projects.md` |
-| [qmd](https://github.com/tobi/qmd) | D2 | Reviewed | Retrieval routing, weighted fusion, and local-first explainability | `docs/guide/research/comparison_external_projects.md` |
-| [claude-mem](https://github.com/thedotmack/claude-mem) | D2 | Reviewed | Progressive disclosure and strong operator workflow | `docs/guide/research/comparison_external_projects.md` |
-| [OpenViking](https://github.com/volcengine/OpenViking) | D2 | Reviewed | Filesystem context paradigm, hierarchical retrieval, trajectory observability | `docs/guide/research/comparison_external_projects.md` |
-| [llm-wiki](https://github.com/nvk/llm-wiki) | D1 | Reviewed | LLM-maintained wiki pattern, topic-scoped knowledge bases, query-save and lint workflows | `docs/guide/research/comparison_external_projects.md` |
-| [gbrain](https://github.com/garrytan/gbrain) | D1 | Reviewed | Operational knowledge brain, `compiled_truth` + timeline pages, enrichment and maintenance loops | `docs/guide/research/comparison_external_projects.md` |
-| [Always-On Memory Agent](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/agents/always-on-memory-agent) | D1 | Reviewed | Always-on multimodal ingest + scheduled consolidation loop with simple local ops surface | `docs/guide/research/comparison_external_projects.md` |
-| [graphify](https://github.com/safishamsi/graphify) | D1 | Reviewed | Multimodal graph compression, deterministic code extraction, and always-on graph-guided assistant workflow | `docs/guide/research/comparison_external_projects.md` |
-| [Letta](https://github.com/letta-ai/letta) | D1 | Reviewed | Core vs archival memory split, shared blocks | `docs/guide/research/comparison_external_projects.md` |
-| [LangGraph](https://docs.langchain.com/oss/python/langgraph/persistence) | D1 | Reviewed | Checkpoint/replay mindset for quality regression workflows | `docs/guide/research/comparison_external_projects.md` |
-| [Graphiti / Zep](https://help.getzep.com/graphiti/core-concepts/temporal-awareness) | D1 | Reviewed | Temporal fact validity model for graph-like memory evolution | `docs/guide/research/comparison_external_projects.md` |
-| [nanograph](https://github.com/aaltshuler/nanograph) | D1 | Reviewed | Typed schema + typed query ergonomics for graph-lite developer experience | `docs/guide/research/comparison_external_projects.md` |
-| [RAGFlow](https://github.com/infiniflow/ragflow) | D0 | Pending deep dive | Potential framework integration discussion; not yet audited to adoption level | Discussion history only |
-| [LightRAG](https://github.com/HKUDS/LightRAG) | D0 | Pending deep dive | Graph-augmented RAG strategy relevance; not yet audited to adoption level | Discussion history only |
-| [GraphRAG](https://www.microsoft.com/en-us/research/project/graphrag/) | D0 | Pending deep dive | Graph-based retrieval concepts; not yet audited to implementation decision level | Discussion history only |
+| Project | Research depth | Current status | Benchmark dimension role | Why it matters to ELF | Primary reference |
+| ------- | -------------- | -------------- | ------------------------ | --------------------- | ----------------- |
+| [agentmemory](https://github.com/rohitg00/agentmemory) | D1 | Reviewed | `rw.operator-continuity`, `rw.resume-evidence`, `rw.lifecycle-staleness` | Cross-agent coding-memory hooks, MCP/REST surface, viewer, consolidation lifecycle, and external benchmark target | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-08-agent-memory-selection.json`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [OpenAI ChatGPT Memory Dreaming](https://openai.com/index/chatgpt-memory-dreaming/) | D1 | Reviewed | `rw.consolidation-review` | Background memory synthesis and staleness repair as a product direction | `docs/research/2026-06-08-agent-memory-selection.json` |
+| [Claude Managed Agents Dreams](https://platform.claude.com/docs/en/managed-agents/dreams) | D1 | Reviewed | `rw.consolidation-review` | Reviewable derived memory-store output over past sessions; strong safety shape for ELF consolidation | `docs/research/2026-06-08-agent-memory-selection.json` |
+| [Gemini CLI Auto Memory](https://github.com/google-gemini/gemini-cli/blob/main/docs/cli/auto-memory.md) | D1 | Reviewed | `rw.consolidation-review`, `rw.operator-continuity` | Background session mining with project-local review inbox for memory patches and skills | `docs/research/2026-06-08-agent-memory-selection.json` |
+| [mem0](https://github.com/mem0ai/mem0) | D2 | Reviewed | `rw.lifecycle-staleness`, `rw.graph-temporal`, `rw.operator-continuity` | Graph memory as additive context, memory history and async mode trade-offs | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [memsearch](https://github.com/zilliztech/memsearch) | D2 | Reviewed | `rw.lifecycle-staleness`, `rw.retrieval-debug`, `rw.resume-evidence` | Markdown-first SoT + rebuildable index pattern | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [qmd](https://github.com/tobi/qmd) | D2 | Reviewed | `rw.retrieval-debug`, `rw.lifecycle-staleness`, `rw.resume-evidence` | Retrieval routing, weighted fusion, and local-first explainability | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [claude-mem](https://github.com/thedotmack/claude-mem) | D2 | Reviewed | `rw.operator-continuity`, `rw.resume-evidence`, `rw.retrieval-debug` | Progressive disclosure and strong operator workflow | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [OpenViking](https://github.com/volcengine/OpenViking) | D2 | Reviewed | `rw.context-trajectory`, `rw.resume-evidence`, `rw.retrieval-debug` | Filesystem context paradigm, hierarchical retrieval, trajectory observability | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [llm-wiki](https://github.com/nvk/llm-wiki) | D1 | Reviewed | `rw.knowledge-synthesis`, `rw.resume-evidence` | LLM-maintained wiki pattern, topic-scoped knowledge bases, query-save and lint workflows | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [gbrain](https://github.com/garrytan/gbrain) | D1 | Reviewed | `rw.knowledge-synthesis`, `rw.operator-continuity` | Operational knowledge brain, `compiled_truth` + timeline pages, enrichment and maintenance loops | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [Always-On Memory Agent](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/agents/always-on-memory-agent) | D1 | Reviewed | `rw.consolidation-review`, `rw.operator-continuity` | Always-on multimodal ingest + scheduled consolidation loop with simple local ops surface | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [graphify](https://github.com/safishamsi/graphify) | D1 | Reviewed | `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.resume-evidence` | Multimodal graph compression, deterministic code extraction, and always-on graph-guided assistant workflow | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [Letta](https://github.com/letta-ai/letta) | D1 | Reviewed | `rw.core-archival`, `rw.operator-continuity` | Core vs archival memory split, shared blocks | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [LangGraph](https://docs.langchain.com/oss/python/langgraph/persistence) | D1 | Reviewed | `rw.replay-regression`, `rw.resume-evidence` | Checkpoint/replay mindset for quality regression workflows | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [Graphiti / Zep](https://help.getzep.com/graphiti/core-concepts/temporal-awareness) | D1 | Reviewed | `rw.graph-temporal`, `rw.resume-evidence` | Temporal fact validity model for graph-like memory evolution | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [nanograph](https://github.com/aaltshuler/nanograph) | D1 | Reviewed | `rw.graph-temporal`, `rw.retrieval-debug` | Typed schema + typed query ergonomics for graph-lite developer experience | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` |
+| [RAGFlow](https://github.com/infiniflow/ragflow) | D0 | Watch item; pending deep dive | Candidate `rw.resume-evidence`, `rw.graph-navigation`, `rw.retrieval-debug`; no strength claim | Potential framework integration discussion; not yet audited to adoption level | Discussion history only; see watch-item evidence requirements in `docs/guide/research/comparison_external_projects.md` |
+| [LightRAG](https://github.com/HKUDS/LightRAG) | D0 | Watch item; pending deep dive | Candidate `rw.graph-navigation`, `rw.graph-temporal`, `rw.retrieval-debug`; no strength claim | Graph-augmented RAG strategy relevance; not yet audited to adoption level | Discussion history only; see watch-item evidence requirements in `docs/guide/research/comparison_external_projects.md` |
+| [GraphRAG](https://www.microsoft.com/en-us/research/project/graphrag/) | D0 | Watch item; pending deep dive | Candidate `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.retrieval-debug`; no strength claim | Graph-based retrieval concepts; not yet audited to implementation decision level | Discussion history only; see watch-item evidence requirements in `docs/guide/research/comparison_external_projects.md` |
 
 ## June 2026 Activity Snapshot
 
@@ -70,8 +70,9 @@ replacing ELF's evidence-bound service contract.
   - [XY-40](https://linear.app/hack-ink/issue/XY-40/vision-track-elf-as-a-high-trust-memory-system-for-singlemulti-agent)
   - [XY-51](https://linear.app/hack-ink/issue/XY-51/agent-memory-ux-mcp-surface-skills-doc-pointers-epic)
   - [XY-63](https://linear.app/hack-ink/issue/XY-63/research-openviking-as-optional-doc-backend-integration-sketch)
-- Current June 2026 research run:
+- Current June 2026 research runs:
   - `docs/research/2026-06-08-agent-memory-selection.json`
+  - `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json`
 
 ## Notes
 
diff --git a/docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json b/docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json
new file mode 100644
index 00000000..198df1af
--- /dev/null
+++ b/docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json
@@ -0,0 +1,136 @@
+{
+  "schema": "research-run/2",
+  "run_id": "2026-06-09-xy-841-external-memory-benchmark-dimensions",
+  "question": "How should ELF map reviewed external memory projects to real-world benchmark dimensions without overstating docs-only evidence as benchmark proof?",
+  "success_criteria": [
+    "Map every reviewed external project in the issue scope to one or more real-world benchmark suites.",
+    "Separate benchmark-grounded adapter evidence from docs-grounded research claims.",
+    "Identify dimensions where ELF should not be treated as the reference yet.",
+    "Keep pending D0 projects as watch items unless current evidence is gathered in scope."
+  ],
+  "constraints": [
+    "Do not implement benchmark adapters or change ELF runtime behavior.",
+    "Do not make benchmark pass/fail claims without runnable evidence from checked-in reports.",
+    "Use existing reviewed docs and benchmark reports as the authority for this docs-only refresh."
+  ],
+  "stop_rule": "Stop once the comparison and inventory can route future real_world_job benchmark design without implying unproven external quality claims.",
+  "primary_hypothesis": "The capability map should treat qmd, claude-mem, agentmemory, mem0/OpenMemory, OpenViking, memsearch, llm-wiki, gbrain, Always-On Memory Agent, graphify, Letta, LangGraph, Graphiti/Zep, and nanograph as dimension references only where docs or benchmark evidence supports the fit; D0 RAG projects should remain watch items.",
+  "rival_hypotheses": [
+    "Use the current smoke benchmark status alone to rank external projects.",
+    "Treat official external README claims as sufficient benchmark-quality evidence.",
+    "Drop pending RAGFlow, LightRAG, and GraphRAG from the map until adapters exist."
+  ],
+  "falsifiers": [
+    "If a current runnable adapter report exists for a broader dimension, docs-only confidence would be too conservative.",
+    "If a listed project lacks any documented mechanism matching the assigned suite, the suite map would overstate its reference role.",
+    "If D0 watch items are assigned strengths, the map would violate the no-current-evidence boundary."
+  ],
+  "coverage": {
+    "mode": "repo_docs_and_existing_external_research",
+    "min_source_families": 3
+  },
+  "events": [
+    {
+      "seq": 1,
+      "type": "probe_completed",
+      "remaining_option_count": 3,
+      "independent_option_questions": [
+        "Which benchmark dimensions are already proven by ELF's checked-in adapter evidence?",
+        "Which projects should be treated as docs-grounded references for unencoded dimensions?",
+        "Which pending projects must stay as watch items?"
+      ],
+      "external_slices": []
+    },
+    {
+      "seq": 2,
+      "type": "evidence_recorded",
+      "evidence": [
+        {
+          "id": "E1",
+          "kind": "observation",
+          "summary": "README states that the June 9 Docker live baseline and production adoption gate prove a bounded ELF production-provider path, while the all-project smoke has ELF and qmd passing encoded checks and other external projects retaining typed failure or incomplete states.",
+          "source_family": "repo_docs",
+          "source_locator": "README.md"
+        },
+        {
+          "id": "E2",
+          "kind": "observation",
+          "summary": "The production adoption gate explicitly bounds external comparison as an objective adapter matrix, not an overall superiority claim, and records qmd pass, agentmemory lifecycle_fail, and memsearch/mem0/OpenViking/claude-mem incomplete or wrong-result states.",
+          "source_family": "benchmark_report",
+          "source_locator": "docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md"
+        },
+        {
+          "id": "E3",
+          "kind": "observation",
+          "summary": "The live baseline runbook defines pass, wrong_result, lifecycle_fail, incomplete, blocked, and not_encoded semantics, and warns that incomplete, blocked, and not_encoded are not passes.",
+          "source_family": "repo_runbook",
+          "source_locator": "docs/guide/benchmarking/live_baseline_benchmark.md"
+        },
+        {
+          "id": "E4",
+          "kind": "observation",
+          "summary": "The existing comparison contains D1/D2 docs-grounded mechanism research for agentmemory, qmd, claude-mem, mem0/OpenMemory, memsearch, OpenViking, llm-wiki, gbrain, Always-On Memory Agent, graphify, Letta, LangGraph, Graphiti/Zep, and nanograph.",
+          "source_family": "repo_research_docs",
+          "source_locator": "docs/guide/research/comparison_external_projects.md"
+        },
+        {
+          "id": "E5",
+          "kind": "observation",
+          "summary": "The inventory marks RAGFlow, LightRAG, and GraphRAG as D0 pending deep dives, so they can only be watch items in this lane.",
+          "source_family": "repo_research_docs",
+          "source_locator": "docs/guide/research/research_projects_inventory.md"
+        }
+      ]
+    },
+    {
+      "seq": 3,
+      "type": "tradeoffs_recorded",
+      "tradeoffs": [
+        {
+          "id": "T1",
+          "summary": "Using only current smoke results would hide useful future benchmark dimensions such as operator continuity, temporal graph validity, core/archival memory, and knowledge synthesis.",
+          "supporting_evidence_ids": [
+            "E2",
+            "E4"
+          ],
+          "disconfirming_evidence_ids": []
+        },
+        {
+          "id": "T2",
+          "summary": "Using docs-grounded references without labels would overstate external project quality because the benchmark runner has not reproduced most broader claims.",
+          "supporting_evidence_ids": [
+            "E2",
+            "E3"
+          ],
+          "disconfirming_evidence_ids": []
+        },
+        {
+          "id": "T3",
+          "summary": "Keeping D0 RAG projects as watch items preserves future coverage without pretending that adapter feasibility, resource envelope, or evidence quality has been audited.",
+          "supporting_evidence_ids": [
+            "E3",
+            "E5"
+          ],
+          "disconfirming_evidence_ids": []
+        }
+      ]
+    },
+    {
+      "seq": 4,
+      "type": "challenge_recorded",
+      "summary": "The main risk is that a broad suite map could read like a quality ranking. The mitigation is to label evidence class per project, repeat that only current adapter reports can support pass/fail claims, and call out ELF gaps by reference dimension instead of claiming overall superiority.",
+      "resolved": true
+    },
+    {
+      "seq": 5,
+      "type": "finalized_decision_ready",
+      "confidence": "medium",
+      "decision": "Update the comparison and inventory with a real-world benchmark-dimension map. Treat qmd, claude-mem, agentmemory, mem0/OpenMemory, memsearch, OpenViking, llm-wiki, gbrain, Always-On Memory Agent, graphify, Letta, LangGraph, Graphiti/Zep, and nanograph as reference projects for specific dimensions, but separate benchmark-grounded evidence from docs-grounded suite fit. Keep RAGFlow, LightRAG, and GraphRAG as D0 watch items.",
+      "missing_evidence": [
+        "No new upstream source refresh was performed in this lane.",
+        "No new benchmark adapter or real_world_job suite was executed.",
+        "Most non-smoke dimensions remain docs-grounded until future adapter evidence exists."
+      ]
+    }
+  ]
+}