From c60678728a2f8fcd6282bb96f31269c00f814f67 Mon Sep 17 00:00:00 2001 From: Yvette Carlisle Date: Tue, 9 Jun 2026 21:17:58 +0800 Subject: [PATCH] {"schema":"decodex/commit/1","summary":"Refresh external memory benchmark dimension map","authority":"XY-841"} --- README.md | 3 +- .../research/comparison_external_projects.md | 87 ++++++++++- .../research/research_projects_inventory.md | 49 +++---- ...-external-memory-benchmark-dimensions.json | 136 ++++++++++++++++++ 4 files changed, 249 insertions(+), 26 deletions(-) create mode 100644 docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json diff --git a/README.md b/README.md index 0fb0a90f..edc59038 100644 --- a/README.md +++ b/README.md @@ -203,8 +203,9 @@ Detailed comparison, mechanism-level analysis, and source map: - [Detailed External Comparison](docs/guide/research/comparison_external_projects.md) - [Research Projects Inventory](docs/guide/research/research_projects_inventory.md) - [Agent Memory Selection Research Run](docs/research/2026-06-08-agent-memory-selection.json) +- [Real-World Benchmark Dimension Research Run](docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json) -Latest external research refresh: June 8, 2026. +Latest external research refresh: June 9, 2026. ## Documentation diff --git a/docs/guide/research/comparison_external_projects.md b/docs/guide/research/comparison_external_projects.md index 4594b8b2..54be2ba7 100644 --- a/docs/guide/research/comparison_external_projects.md +++ b/docs/guide/research/comparison_external_projects.md @@ -10,6 +10,8 @@ Scope note: This document is intentionally detailed and source-heavy. Keep `READ For a full list of reviewed and pending projects, see `docs/guide/research/research_projects_inventory.md`. For the June 2026 agentmemory and dreaming decision run, see `docs/research/2026-06-08-agent-memory-selection.json`. +For the June 2026 real-world benchmark-dimension refresh, see +`docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json`. Comparison focuses on shared capabilities, ELF distinctives, and objective trade-offs. These projects solve adjacent problems, but their primary storage units and default workflows differ. @@ -32,6 +34,87 @@ Legend: Note: In this section, mem0 refers to the Mem0 ecosystem, including OpenMemory (an MCP memory server with a built-in UI). OpenViking is included as a newly reviewed project with mechanism-level analysis. +## June 2026 Real-World Benchmark-Dimension Map + +Snapshot date for this subsection: June 9, 2026. + +This map translates the existing external-project research into benchmark dimensions +for the real-world agent memory suite. It does not add new adapter pass/fail evidence. +Use the evidence class before making claims: + +- `benchmark-grounded`: ELF's Docker benchmark has runnable adapter evidence for this + project and dimension. Read the exact report before quoting a pass/fail result. +- `docs-grounded`: official docs or READMEs indicate a likely strength, but ELF has not + reproduced the behavior in the benchmark runner. +- `watch`: the project remains D0 or otherwise pending; do not assign strength claims + until a deep dive or adapter run exists. + +Current benchmark-grounded scope is narrow. The June 9, 2026 all-project smoke run +proved encoded same-corpus/lifecycle behavior only for the current adapters: ELF and qmd +passed their encoded smoke checks; agentmemory passed same-corpus retrieval but failed +or could not prove durable lifecycle behavior; memsearch, mem0, OpenViking, and +claude-mem retained `incomplete`, wrong-result, or not-encoded states. All broader suite +fit below is research guidance, not a benchmark result. + +Benchmark suite labels: + +| Suite | Real-world job shape | +| ----- | -------------------- | +| `rw.resume-evidence` | Resume a stalled agent task, recover the right prior decision, cite required evidence, and avoid negative traps. | +| `rw.lifecycle-staleness` | Update, delete, expire, cold-start, and contradiction cases where stale facts must stop winning. | +| `rw.operator-continuity` | Capture session observations, inspect memory state, and support day-to-day agent continuity with low friction. | +| `rw.retrieval-debug` | Explain query expansion, hybrid retrieval, fusion, rerank, and wrong-result causes. | +| `rw.context-trajectory` | Navigate multi-stage or hierarchical context before selecting final evidence. | +| `rw.knowledge-synthesis` | Compile durable project/entity/concept pages from memory and keep them lintable or repairable. | +| `rw.consolidation-review` | Run background consolidation while keeping derived output reviewable and evidence-linked. | +| `rw.graph-temporal` | Track facts, entities, relations, validity windows, and current-versus-historical answers. | +| `rw.core-archival` | Separate always-loaded operating memory from retrieval-only archival memory. | +| `rw.replay-regression` | Replay, fork, or checkpoint agent state to debug memory-assisted work and regression failures. | +| `rw.graph-navigation` | Use graph-compressed corpus structure to guide agents before raw retrieval or file inspection. | + +Project-to-suite map: + +| Project | Best-fit real-world suites | Why this project matters for that suite | Fair adapter evidence before claims | Evidence class and confidence | Current ELF position | +| ------- | -------------------------- | -------------------------------------- | ---------------------------------- | ----------------------------- | -------------------- | +| agentmemory | `rw.operator-continuity`, `rw.resume-evidence`, `rw.lifecycle-staleness` | Cross-agent hooks, MCP/REST packaging, viewer, lifecycle/consolidation claims, and coding-agent continuity focus make it the right reference for daily agent memory ergonomics. | Use durable upstream storage rather than the current in-memory mock; ingest realistic agent sessions through the public hook/API path; prove restart, update/supersede, delete, and viewer/trace readback. | Mixed: benchmark-grounded only for current same-corpus retrieval; current lifecycle evidence is a failure/blocker, while hooks/viewer/consolidation are docs-grounded. Confidence: medium for suite fit, low for durable adapter quality. | ELF is stronger on evidence-bound writes and source-of-truth discipline; agentmemory remains the reference for capture breadth and agent-continuity UX. | +| qmd | `rw.retrieval-debug`, `rw.lifecycle-staleness`, `rw.resume-evidence` | Its local CLI, structured JSON query output, expansion modes, hybrid routing, weighted fusion, rerank, update, delete, and cold-start path make it the strongest local retrieval-debug baseline. | Run `qmd` over the real-world corpus, capture query JSON, then rewrite/delete corpus files and rerun update/embed/query in fresh processes. | Benchmark-grounded for current smoke retrieval/update/delete/cold-start pass; docs-grounded for deeper query planning ergonomics. Confidence: high for local adapter baseline. | ELF is not yet stronger on local CLI debug ergonomics; treat qmd as the retrieval-debug reference while keeping ELF's service/provenance model. | +| claude-mem | `rw.operator-continuity`, `rw.resume-evidence`, `rw.retrieval-debug` | Progressive-disclosure search, auto-capture hooks, local viewer, and observation/timeline workflows are directly aligned with real agent resumption jobs. | Exercise a real local repository with hook-driven capture, then evaluate `search -> timeline -> observations` behavior after restart; do not rely on mocked storage. | Docs-grounded for progressive disclosure/viewer; current benchmark adapter evidence is incomplete/wrong-result and mostly not encoded for lifecycle. Confidence: medium for product reference, low for current adapter claims. | ELF has stronger provenance and service boundaries, but claude-mem remains a reference for operator workflow and progressive disclosure UX. | +| mem0 / OpenMemory | `rw.lifecycle-staleness`, `rw.graph-temporal`, `rw.operator-continuity`, `rw.resume-evidence` | Entity-scoped memory, memory history, expiration, hosted/OSS surfaces, OpenMemory UI, and optional graph memory make it the broadest lifecycle and ecosystem comparison target. | Separate OSS local FastEmbed/Qdrant evidence from hosted Platform claims; prove add/update/delete/history, entity-scoped retrieval, expiration exclusion, OpenMemory UI readback, and optional graph context on the same corpus. | Docs-grounded for lifecycle/entity/graph/UI claims; current local adapter is incomplete/wrong-result for same-corpus retrieval and delete remains not encoded. Confidence: medium for suite fit, low for current adapter quality. | ELF is stronger on deterministic evidence-bound writes; mem0/OpenMemory is the reference for ecosystem reach, entity-scoped history, hosted option, and optional graph UX. | +| memsearch | `rw.lifecycle-staleness`, `rw.retrieval-debug`, `rw.resume-evidence` | Markdown as canonical memory plus incremental/content-addressed reindexing is a useful model for source transparency and rebuildable derived indexes. | Index a real-world Markdown corpus, mutate/delete files, rerun index/search from fresh processes, and record Milvus mode so Lite/Server/Cloud behavior is not conflated. | Docs-grounded for architecture; current adapter is incomplete/invalid-result, so no pass/fail quality claim is allowed. Confidence: medium for design pattern, low for current adapter evidence. | ELF already owns source-of-truth plus rebuildable index at service level; memsearch remains a reference for simple local canonical-store ergonomics. | +| OpenViking | `rw.context-trajectory`, `rw.resume-evidence`, `rw.retrieval-debug` | `viking://` context organization, intent analysis, hierarchical retrieval, staged find/search behavior, and session compression are relevant to multi-hop agent context jobs. | Pin or provide a Docker-compatible local embedding path, then evaluate `add_resource`/`find`/`search` over multi-stage jobs with stage output, hierarchy, and session memory evidence. | Docs-grounded for mechanism; current benchmark adapter is incomplete due local embedding install failure. Confidence: medium for architecture reference, low for runnable adapter quality. | ELF has first-class traces and evidence-bound notes, but OpenViking is the reference for hierarchical context trajectory and filesystem-like organization. | +| llm-wiki | `rw.knowledge-synthesis`, `rw.resume-evidence` | Query/save/lint flows and topic-scoped wiki pages are a useful reference for turning retrieved memory into maintained project knowledge. | Run a corpus-to-wiki job, ask resume/decision questions, require page citations back to source memory, then mutate a stale source and prove lint/repair catches it. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for derived-knowledge fit. | ELF is not yet stronger on derived knowledge pages; llm-wiki should inform rebuildable, evidence-cited dossiers rather than core storage. | +| gbrain | `rw.knowledge-synthesis`, `rw.operator-continuity` | `compiled_truth`, timeline sections, backlinks, primary-home routing, and enrichment workflows model a living operational brain for project work. | Build or update pages from the real-world corpus, require current-truth plus timeline answers, and prove enrichment/backlink maintenance does not hide unsupported claims. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for operator knowledge UX. | ELF should keep source notes authoritative; gbrain is a reference for presentation, enrichment, and maintenance loops. | +| Always-On Memory Agent | `rw.consolidation-review`, `rw.operator-continuity` | The file/API/dashboard ingest loop and timer-based consolidation show how background memory formation becomes a user-visible product surface. | Run scheduled consolidation on a fixed corpus, record source rows and output insights, then score whether consolidation is reviewable, repeatable, and bounded against unsupported claims. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for consolidation workflow reference. | ELF should borrow scheduling and operator controls while keeping deterministic writes and reviewable derived outputs. | +| graphify | `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.resume-evidence` | Deterministic code extraction, LLM-assisted graph building, honesty tags, graph reports, and assistant hooks are strong references for graph-compressed navigation over large corpora. | Generate graph/report artifacts from the benchmark corpus, require answers to use graph structure plus source evidence, and prove rebuild behavior after corpus edits. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for graph-navigation reference. | ELF is stronger as a memory service; graphify is the reference for rebuildable graph reports and pre-search guidance. | +| Letta | `rw.core-archival`, `rw.operator-continuity` | Core memory blocks, archival memory, and shared/read-only memory blocks map directly to always-loaded operating context versus retrievable memory. | Build a multi-agent job where core blocks must be attached/detached/shared read-only, while archival memory is retrieved separately and audited. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for memory-semantics reference. | ELF has scoped notes but not first-class core/archival block ergonomics; Letta is the reference dimension. | +| LangGraph | `rw.replay-regression`, `rw.resume-evidence` | Thread checkpoints, durable execution, replay, fork, and time travel define a strong model for debugging agent-state and memory-regression behavior. | Run an agent job with memory reads across checkpoints, replay/fork the thread after a stale-memory failure, and verify side-effect boundaries. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for replay workflow reference. | ELF traces are useful but do not replace full agent checkpoint replay; LangGraph is the reference for replay-regression jobs. | +| Graphiti / Zep | `rw.graph-temporal`, `rw.resume-evidence` | Temporal entities, relations, fact triples, validity windows, and graph search directly target stale/contradictory factual memory. | Add fact triples with validity changes, query current and historical answers, and score invalidation/append behavior under contradiction traps. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium-high for temporal-graph dimension. | ELF graph-lite is not yet stronger on temporal graph validity; Graphiti/Zep is the reference dimension. | +| nanograph | `rw.graph-temporal`, `rw.retrieval-debug` | Typed schema and typed query ergonomics are relevant to making ELF graph-lite interactions inspectable and hard to misuse. | Define typed graph schemas and queries for the same fact set, then score developer-visible validation, query shape, and explainability rather than retrieval quality alone. | Docs-grounded D1; no benchmark adapter evidence. Confidence: medium for DX reference, low for memory-system comparison. | ELF should borrow typed graph ergonomics without treating nanograph as a full memory backend. | + +Pending watch items remain D0. Keep them out of benchmark strength claims until current +evidence is gathered: + +| Watch item | Candidate suite if promoted | Minimum evidence needed before adapter or quality claims | +| ---------- | --------------------------- | ------------------------------------------------------- | +| RAGFlow | `rw.resume-evidence`, `rw.graph-navigation`, `rw.retrieval-debug` | D1/D2 deep dive on deployability, corpus ingestion, graph/RAG retrieval path, API/CLI outputs, and Docker resource envelope. | +| LightRAG | `rw.graph-navigation`, `rw.graph-temporal`, `rw.retrieval-debug` | D1/D2 deep dive on graph extraction/update semantics, local persistence, query output, and whether stale/corrected facts can be tested fairly. | +| GraphRAG | `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.retrieval-debug` | D1/D2 deep dive on indexing cost, graph summaries, update/rebuild behavior, source citation guarantees, and task-level output inspectability. | + +## Where ELF Is Not Yet The Reference + +| Benchmark dimension | Current reference project(s) | ELF gap to test before claiming strength | +| ------------------- | ---------------------------- | ---------------------------------------- | +| Local retrieval debugging and CLI transparency | qmd | ELF needs equally fast local knobs/readback for expansion, hybrid fusion, rerank, and wrong-result diagnosis. | +| Turn-by-turn agent capture and daily continuity | agentmemory, claude-mem, OpenMemory | ELF has service and viewer surfaces, but not the same turnkey hook breadth or session-continuity product ergonomics. | +| Progressive disclosure UX | claude-mem, OpenViking | ELF has L0/L1/L2 shaping and traces, but the operator workflow still needs better search-session navigation. | +| Entity-scoped history and managed ecosystem reach | mem0/OpenMemory | ELF has ingest decisions and versions, but not the same hosted option, SDK reach, or first-class memory history surface. | +| Core memory versus archival memory | Letta | ELF scopes notes well, but lacks attachable/read-only core memory blocks as a distinct user-facing layer. | +| Temporal graph validity | Graphiti/Zep | ELF graph-lite persists relation context, but temporal invalidation/current-vs-historical graph behavior is not the reference yet. | +| Agent replay and forkable regression debugging | LangGraph | ELF traces are replay evidence for retrieval, not full persisted agent-state replay with side-effect boundaries. | +| Derived knowledge pages and lint/repair loops | llm-wiki, gbrain | ELF does not yet ship rebuildable entity/project pages with unsupported-claim lint as a first-class workflow. | +| Scheduled consolidation as a product surface | Always-On Memory Agent | ELF's target should be reviewable derived consolidation, but the scheduling/operator-control workflow is not implemented. | +| Graph-compressed navigation over large corpora | graphify, GraphRAG/LightRAG watch items | ELF relation context is bounded and evidence-linked, but broader graph report/navigation workflows remain future work. | + ## June 2026 Agentmemory And Dreaming Refresh Snapshot date for this subsection: June 8, 2026. @@ -276,7 +359,9 @@ Key takeaways for ELF from this deeper pass: ## Where ELF Is Currently Weaker (Objective Gaps) -- No built-in web UI viewer yet (claude-mem and OpenMemory provide this today). +- ELF now has a local admin viewer and retrieval observability surfaces, but + claude-mem, OpenMemory, and agentmemory remain stronger references for turnkey + memory-inspection and session-continuity ergonomics. - No hosted/cloud product option (mem0 provides managed deployment). - Graph support is currently graph-lite (`POST /v2/graph/query`) and does not yet include multi-hop/global graph reasoning patterns used by GraphRAG-focused projects. - Less turnkey for zero-config local plugin workflows than memsearch/claude-mem defaults. diff --git a/docs/guide/research/research_projects_inventory.md b/docs/guide/research/research_projects_inventory.md index 6cf50e62..c84ddab6 100644 --- a/docs/guide/research/research_projects_inventory.md +++ b/docs/guide/research/research_projects_inventory.md @@ -6,7 +6,7 @@ Inputs: Existing research notes, open architecture questions, and tracked adopti Depends on: `docs/guide/research/comparison_external_projects.md`. Outputs: A current inventory of reviewed and pending external projects. -Last updated: June 8, 2026. +Last updated: June 9, 2026. ## Legend @@ -16,28 +16,28 @@ Last updated: June 8, 2026. ## Inventory -| Project | Research depth | Current status | Why it matters to ELF | Primary reference | -| ------- | -------------- | -------------- | --------------------- | ----------------- | -| [agentmemory](https://github.com/rohitg00/agentmemory) | D1 | Reviewed | Cross-agent coding-memory hooks, MCP/REST surface, viewer, consolidation lifecycle, and external benchmark target | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-08-agent-memory-selection.json` | -| [OpenAI ChatGPT Memory Dreaming](https://openai.com/index/chatgpt-memory-dreaming/) | D1 | Reviewed | Background memory synthesis and staleness repair as a product direction | `docs/research/2026-06-08-agent-memory-selection.json` | -| [Claude Managed Agents Dreams](https://platform.claude.com/docs/en/managed-agents/dreams) | D1 | Reviewed | Reviewable derived memory-store output over past sessions; strong safety shape for ELF consolidation | `docs/research/2026-06-08-agent-memory-selection.json` | -| [Gemini CLI Auto Memory](https://github.com/google-gemini/gemini-cli/blob/main/docs/cli/auto-memory.md) | D1 | Reviewed | Background session mining with project-local review inbox for memory patches and skills | `docs/research/2026-06-08-agent-memory-selection.json` | -| [mem0](https://github.com/mem0ai/mem0) | D2 | Reviewed | Graph memory as additive context, memory history and async mode trade-offs | `docs/guide/research/comparison_external_projects.md` | -| [memsearch](https://github.com/zilliztech/memsearch) | D2 | Reviewed | Markdown-first SoT + rebuildable index pattern | `docs/guide/research/comparison_external_projects.md` | -| [qmd](https://github.com/tobi/qmd) | D2 | Reviewed | Retrieval routing, weighted fusion, and local-first explainability | `docs/guide/research/comparison_external_projects.md` | -| [claude-mem](https://github.com/thedotmack/claude-mem) | D2 | Reviewed | Progressive disclosure and strong operator workflow | `docs/guide/research/comparison_external_projects.md` | -| [OpenViking](https://github.com/volcengine/OpenViking) | D2 | Reviewed | Filesystem context paradigm, hierarchical retrieval, trajectory observability | `docs/guide/research/comparison_external_projects.md` | -| [llm-wiki](https://github.com/nvk/llm-wiki) | D1 | Reviewed | LLM-maintained wiki pattern, topic-scoped knowledge bases, query-save and lint workflows | `docs/guide/research/comparison_external_projects.md` | -| [gbrain](https://github.com/garrytan/gbrain) | D1 | Reviewed | Operational knowledge brain, `compiled_truth` + timeline pages, enrichment and maintenance loops | `docs/guide/research/comparison_external_projects.md` | -| [Always-On Memory Agent](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/agents/always-on-memory-agent) | D1 | Reviewed | Always-on multimodal ingest + scheduled consolidation loop with simple local ops surface | `docs/guide/research/comparison_external_projects.md` | -| [graphify](https://github.com/safishamsi/graphify) | D1 | Reviewed | Multimodal graph compression, deterministic code extraction, and always-on graph-guided assistant workflow | `docs/guide/research/comparison_external_projects.md` | -| [Letta](https://github.com/letta-ai/letta) | D1 | Reviewed | Core vs archival memory split, shared blocks | `docs/guide/research/comparison_external_projects.md` | -| [LangGraph](https://docs.langchain.com/oss/python/langgraph/persistence) | D1 | Reviewed | Checkpoint/replay mindset for quality regression workflows | `docs/guide/research/comparison_external_projects.md` | -| [Graphiti / Zep](https://help.getzep.com/graphiti/core-concepts/temporal-awareness) | D1 | Reviewed | Temporal fact validity model for graph-like memory evolution | `docs/guide/research/comparison_external_projects.md` | -| [nanograph](https://github.com/aaltshuler/nanograph) | D1 | Reviewed | Typed schema + typed query ergonomics for graph-lite developer experience | `docs/guide/research/comparison_external_projects.md` | -| [RAGFlow](https://github.com/infiniflow/ragflow) | D0 | Pending deep dive | Potential framework integration discussion; not yet audited to adoption level | Discussion history only | -| [LightRAG](https://github.com/HKUDS/LightRAG) | D0 | Pending deep dive | Graph-augmented RAG strategy relevance; not yet audited to adoption level | Discussion history only | -| [GraphRAG](https://www.microsoft.com/en-us/research/project/graphrag/) | D0 | Pending deep dive | Graph-based retrieval concepts; not yet audited to implementation decision level | Discussion history only | +| Project | Research depth | Current status | Benchmark dimension role | Why it matters to ELF | Primary reference | +| ------- | -------------- | -------------- | ------------------------ | --------------------- | ----------------- | +| [agentmemory](https://github.com/rohitg00/agentmemory) | D1 | Reviewed | `rw.operator-continuity`, `rw.resume-evidence`, `rw.lifecycle-staleness` | Cross-agent coding-memory hooks, MCP/REST surface, viewer, consolidation lifecycle, and external benchmark target | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-08-agent-memory-selection.json`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [OpenAI ChatGPT Memory Dreaming](https://openai.com/index/chatgpt-memory-dreaming/) | D1 | Reviewed | `rw.consolidation-review` | Background memory synthesis and staleness repair as a product direction | `docs/research/2026-06-08-agent-memory-selection.json` | +| [Claude Managed Agents Dreams](https://platform.claude.com/docs/en/managed-agents/dreams) | D1 | Reviewed | `rw.consolidation-review` | Reviewable derived memory-store output over past sessions; strong safety shape for ELF consolidation | `docs/research/2026-06-08-agent-memory-selection.json` | +| [Gemini CLI Auto Memory](https://github.com/google-gemini/gemini-cli/blob/main/docs/cli/auto-memory.md) | D1 | Reviewed | `rw.consolidation-review`, `rw.operator-continuity` | Background session mining with project-local review inbox for memory patches and skills | `docs/research/2026-06-08-agent-memory-selection.json` | +| [mem0](https://github.com/mem0ai/mem0) | D2 | Reviewed | `rw.lifecycle-staleness`, `rw.graph-temporal`, `rw.operator-continuity` | Graph memory as additive context, memory history and async mode trade-offs | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [memsearch](https://github.com/zilliztech/memsearch) | D2 | Reviewed | `rw.lifecycle-staleness`, `rw.retrieval-debug`, `rw.resume-evidence` | Markdown-first SoT + rebuildable index pattern | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [qmd](https://github.com/tobi/qmd) | D2 | Reviewed | `rw.retrieval-debug`, `rw.lifecycle-staleness`, `rw.resume-evidence` | Retrieval routing, weighted fusion, and local-first explainability | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [claude-mem](https://github.com/thedotmack/claude-mem) | D2 | Reviewed | `rw.operator-continuity`, `rw.resume-evidence`, `rw.retrieval-debug` | Progressive disclosure and strong operator workflow | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [OpenViking](https://github.com/volcengine/OpenViking) | D2 | Reviewed | `rw.context-trajectory`, `rw.resume-evidence`, `rw.retrieval-debug` | Filesystem context paradigm, hierarchical retrieval, trajectory observability | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [llm-wiki](https://github.com/nvk/llm-wiki) | D1 | Reviewed | `rw.knowledge-synthesis`, `rw.resume-evidence` | LLM-maintained wiki pattern, topic-scoped knowledge bases, query-save and lint workflows | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [gbrain](https://github.com/garrytan/gbrain) | D1 | Reviewed | `rw.knowledge-synthesis`, `rw.operator-continuity` | Operational knowledge brain, `compiled_truth` + timeline pages, enrichment and maintenance loops | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [Always-On Memory Agent](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/agents/always-on-memory-agent) | D1 | Reviewed | `rw.consolidation-review`, `rw.operator-continuity` | Always-on multimodal ingest + scheduled consolidation loop with simple local ops surface | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [graphify](https://github.com/safishamsi/graphify) | D1 | Reviewed | `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.resume-evidence` | Multimodal graph compression, deterministic code extraction, and always-on graph-guided assistant workflow | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [Letta](https://github.com/letta-ai/letta) | D1 | Reviewed | `rw.core-archival`, `rw.operator-continuity` | Core vs archival memory split, shared blocks | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [LangGraph](https://docs.langchain.com/oss/python/langgraph/persistence) | D1 | Reviewed | `rw.replay-regression`, `rw.resume-evidence` | Checkpoint/replay mindset for quality regression workflows | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [Graphiti / Zep](https://help.getzep.com/graphiti/core-concepts/temporal-awareness) | D1 | Reviewed | `rw.graph-temporal`, `rw.resume-evidence` | Temporal fact validity model for graph-like memory evolution | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [nanograph](https://github.com/aaltshuler/nanograph) | D1 | Reviewed | `rw.graph-temporal`, `rw.retrieval-debug` | Typed schema + typed query ergonomics for graph-lite developer experience | `docs/guide/research/comparison_external_projects.md`; `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` | +| [RAGFlow](https://github.com/infiniflow/ragflow) | D0 | Watch item; pending deep dive | Candidate `rw.resume-evidence`, `rw.graph-navigation`, `rw.retrieval-debug`; no strength claim | Potential framework integration discussion; not yet audited to adoption level | Discussion history only; see watch-item evidence requirements in `docs/guide/research/comparison_external_projects.md` | +| [LightRAG](https://github.com/HKUDS/LightRAG) | D0 | Watch item; pending deep dive | Candidate `rw.graph-navigation`, `rw.graph-temporal`, `rw.retrieval-debug`; no strength claim | Graph-augmented RAG strategy relevance; not yet audited to adoption level | Discussion history only; see watch-item evidence requirements in `docs/guide/research/comparison_external_projects.md` | +| [GraphRAG](https://www.microsoft.com/en-us/research/project/graphrag/) | D0 | Watch item; pending deep dive | Candidate `rw.graph-navigation`, `rw.knowledge-synthesis`, `rw.retrieval-debug`; no strength claim | Graph-based retrieval concepts; not yet audited to implementation decision level | Discussion history only; see watch-item evidence requirements in `docs/guide/research/comparison_external_projects.md` | ## June 2026 Activity Snapshot @@ -70,8 +70,9 @@ replacing ELF's evidence-bound service contract. - [XY-40](https://linear.app/hack-ink/issue/XY-40/vision-track-elf-as-a-high-trust-memory-system-for-singlemulti-agent) - [XY-51](https://linear.app/hack-ink/issue/XY-51/agent-memory-ux-mcp-surface-skills-doc-pointers-epic) - [XY-63](https://linear.app/hack-ink/issue/XY-63/research-openviking-as-optional-doc-backend-integration-sketch) -- Current June 2026 research run: +- Current June 2026 research runs: - `docs/research/2026-06-08-agent-memory-selection.json` + - `docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json` ## Notes diff --git a/docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json b/docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json new file mode 100644 index 00000000..198df1af --- /dev/null +++ b/docs/research/2026-06-09-xy-841-external-memory-benchmark-dimensions.json @@ -0,0 +1,136 @@ +{ + "schema": "research-run/2", + "run_id": "2026-06-09-xy-841-external-memory-benchmark-dimensions", + "question": "How should ELF map reviewed external memory projects to real-world benchmark dimensions without overstating docs-only evidence as benchmark proof?", + "success_criteria": [ + "Map every reviewed external project in the issue scope to one or more real-world benchmark suites.", + "Separate benchmark-grounded adapter evidence from docs-grounded research claims.", + "Identify dimensions where ELF should not be treated as the reference yet.", + "Keep pending D0 projects as watch items unless current evidence is gathered in scope." + ], + "constraints": [ + "Do not implement benchmark adapters or change ELF runtime behavior.", + "Do not make benchmark pass/fail claims without runnable evidence from checked-in reports.", + "Use existing reviewed docs and benchmark reports as the authority for this docs-only refresh." + ], + "stop_rule": "Stop once the comparison and inventory can route future real_world_job benchmark design without implying unproven external quality claims.", + "primary_hypothesis": "The capability map should treat qmd, claude-mem, agentmemory, mem0/OpenMemory, OpenViking, memsearch, llm-wiki, gbrain, Always-On Memory Agent, graphify, Letta, LangGraph, Graphiti/Zep, and nanograph as dimension references only where docs or benchmark evidence supports the fit; D0 RAG projects should remain watch items.", + "rival_hypotheses": [ + "Use the current smoke benchmark status alone to rank external projects.", + "Treat official external README claims as sufficient benchmark-quality evidence.", + "Drop pending RAGFlow, LightRAG, and GraphRAG from the map until adapters exist." + ], + "falsifiers": [ + "If a current runnable adapter report exists for a broader dimension, docs-only confidence would be too conservative.", + "If a listed project lacks any documented mechanism matching the assigned suite, the suite map would overstate its reference role.", + "If D0 watch items are assigned strengths, the map would violate the no-current-evidence boundary." + ], + "coverage": { + "mode": "repo_docs_and_existing_external_research", + "min_source_families": 3 + }, + "events": [ + { + "seq": 1, + "type": "probe_completed", + "remaining_option_count": 3, + "independent_option_questions": [ + "Which benchmark dimensions are already proven by ELF's checked-in adapter evidence?", + "Which projects should be treated as docs-grounded references for unencoded dimensions?", + "Which pending projects must stay as watch items?" + ], + "external_slices": [] + }, + { + "seq": 2, + "type": "evidence_recorded", + "evidence": [ + { + "id": "E1", + "kind": "observation", + "summary": "README states that the June 9 Docker live baseline and production adoption gate prove a bounded ELF production-provider path, while the all-project smoke has ELF and qmd passing encoded checks and other external projects retaining typed failure or incomplete states.", + "source_family": "repo_docs", + "source_locator": "README.md" + }, + { + "id": "E2", + "kind": "observation", + "summary": "The production adoption gate explicitly bounds external comparison as an objective adapter matrix, not an overall superiority claim, and records qmd pass, agentmemory lifecycle_fail, and memsearch/mem0/OpenViking/claude-mem incomplete or wrong-result states.", + "source_family": "benchmark_report", + "source_locator": "docs/guide/benchmarking/2026-06-09-production-adoption-gate-report.md" + }, + { + "id": "E3", + "kind": "observation", + "summary": "The live baseline runbook defines pass, wrong_result, lifecycle_fail, incomplete, blocked, and not_encoded semantics, and warns that incomplete, blocked, and not_encoded are not passes.", + "source_family": "repo_runbook", + "source_locator": "docs/guide/benchmarking/live_baseline_benchmark.md" + }, + { + "id": "E4", + "kind": "observation", + "summary": "The existing comparison contains D1/D2 docs-grounded mechanism research for agentmemory, qmd, claude-mem, mem0/OpenMemory, memsearch, OpenViking, llm-wiki, gbrain, Always-On Memory Agent, graphify, Letta, LangGraph, Graphiti/Zep, and nanograph.", + "source_family": "repo_research_docs", + "source_locator": "docs/guide/research/comparison_external_projects.md" + }, + { + "id": "E5", + "kind": "observation", + "summary": "The inventory marks RAGFlow, LightRAG, and GraphRAG as D0 pending deep dives, so they can only be watch items in this lane.", + "source_family": "repo_research_docs", + "source_locator": "docs/guide/research/research_projects_inventory.md" + } + ] + }, + { + "seq": 3, + "type": "tradeoffs_recorded", + "tradeoffs": [ + { + "id": "T1", + "summary": "Using only current smoke results would hide useful future benchmark dimensions such as operator continuity, temporal graph validity, core/archival memory, and knowledge synthesis.", + "supporting_evidence_ids": [ + "E2", + "E4" + ], + "disconfirming_evidence_ids": [] + }, + { + "id": "T2", + "summary": "Using docs-grounded references without labels would overstate external project quality because the benchmark runner has not reproduced most broader claims.", + "supporting_evidence_ids": [ + "E2", + "E3" + ], + "disconfirming_evidence_ids": [] + }, + { + "id": "T3", + "summary": "Keeping D0 RAG projects as watch items preserves future coverage without pretending that adapter feasibility, resource envelope, or evidence quality has been audited.", + "supporting_evidence_ids": [ + "E3", + "E5" + ], + "disconfirming_evidence_ids": [] + } + ] + }, + { + "seq": 4, + "type": "challenge_recorded", + "summary": "The main risk is that a broad suite map could read like a quality ranking. The mitigation is to label evidence class per project, repeat that only current adapter reports can support pass/fail claims, and call out ELF gaps by reference dimension instead of claiming overall superiority.", + "resolved": true + }, + { + "seq": 5, + "type": "finalized_decision_ready", + "confidence": "medium", + "decision": "Update the comparison and inventory with a real-world benchmark-dimension map. Treat qmd, claude-mem, agentmemory, mem0/OpenMemory, memsearch, OpenViking, llm-wiki, gbrain, Always-On Memory Agent, graphify, Letta, LangGraph, Graphiti/Zep, and nanograph as reference projects for specific dimensions, but separate benchmark-grounded evidence from docs-grounded suite fit. Keep RAGFlow, LightRAG, and GraphRAG as D0 watch items.", + "missing_evidence": [ + "No new upstream source refresh was performed in this lane.", + "No new benchmark adapter or real_world_job suite was executed.", + "Most non-smoke dimensions remain docs-grounded until future adapter evidence exists." + ] + } + ] +}