An open ecosystem for tool-agnostic AI agent memory.
Download
·
Report Bug
·
Specification
- Cognitive CLI — commands named after memory processes:
memorize,remember,note,learn,consolidate,reflect,forget - Two-level memory — project (
~/.mnemonist/{project}/) and global (~/.mnemonist/global/) - Working memory inbox — capacity-limited staging area (default 7 items) with attention scoring; items promoted to long-term memory via
consolidate - Memory metadata — strength, access count, last accessed, source tracking; Hebbian reinforcement on retrieval
- Plain markdown with YAML frontmatter — human-readable, git-friendly
- Typed memories — user, feedback, project, reference
- Local embedding —
candlecrate withall-MiniLM-L6-v2(384-dim, CPU/CUDA); no external server needed; model downloads from HuggingFace Hub on first use - Layered graph — three HNSW layers: code (
.code-index.hnsw), project memory (.memory-index.hnsw), and global memory; inter-layer edges viarefsfrontmatter field - Pluggable code chunking —
ChunkingStrategytrait with built-inParagraphChunking(blank-line boundaries) andFixedLineChunking(sliding window with overlap); no tree-sitter dependency - Cross-layer recall —
remembersearches memory and code indices in parallel with blended relevance scoring (semantic + temporal); followsrefsedges to surface referenced code chunks - Consolidation —
consolidatepromotes inbox items, decays stale memories, and re-embeds - Fuzzy forget —
forgetresolves partial and suffix matches so you don't need the full filename - Embedding quality metrics —
learnreports anisotropy and similarity_range after indexing - TurboQuant — vector quantization (1-4 bit) available as a research/eval module (see the storage-footprint benchmark); not yet wired into
learn/remember, which store full f32 embeddings - JSON-first — stdout for structured JSON, stderr for UX; pipe-friendly
- Works with Claude Code, Codex, Gemini, Copilot, Cursor, or any AI tool
curl -fsSL https://raw.githubusercontent.com/urmzd/mnemonist/main/install.sh | shPre-built binaries run on CPU with pure Rust matmuls — functional, but large batch operations like learn are slower than accelerated builds. If you have a Rust toolchain, you can build from source with hardware acceleration:
# macOS — Apple's Accelerate BLAS (~2x faster embedding throughput)
cargo install mnemonist --features accelerate
# Linux/Windows with an NVIDIA GPU
cargo install mnemonist --features cuda# 1. Install
curl -fsSL https://raw.githubusercontent.com/urmzd/mnemonist/main/install.sh | sh
# Or, if you have a Rust toolchain: cargo install mnemonist
# 2. Ingest the codebase — auto-creates ~/.mnemonist/{project}/ and embeds source files
mnemonist learn .
# 3. Memorize long-term knowledge
mnemonist memorize "prefer Rust for CLI tools" -t feedback
mnemonist memorize "deep Go expertise, new to React" -t user
# 4. Jot quick notes into the working memory inbox
mnemonist note "look into async runtime choices"
mnemonist note "check Linear project INGEST for pipeline bugs"
# 5. Consolidate — promote inbox to long-term memory, decay stale items, re-embed
mnemonist consolidate
# 6. Recall — semantic + text search across memories and code
mnemonist remember "rust async patterns"
# 7. Review everything
mnemonist reflect --all
# 8. Forget something you no longer need (fuzzy name matching)
mnemonist forget prefer-rust| Level | Location | Scope |
|---|---|---|
| Project | ~/.mnemonist/{project}/ |
Per-repo corrections, decisions |
| Global | ~/.mnemonist/global/ |
Cross-project preferences, expertise |
Project memory takes precedence over global when they conflict.
| Type | When | Example |
|---|---|---|
user |
Expertise, preferences | "Deep Rust knowledge, new to React" |
feedback |
Corrections, validated approaches | "Never mock the database in tests" |
project |
Repo-specific context (project-level only) | "Auth rewrite driven by compliance" |
reference |
External resource pointers | "Bugs tracked in Linear project INGEST" |
| Command | Description |
|---|---|
mnemonist memorize "<point>" [-t type] [-n name] |
Deliberately encode a point into long-term memory (auto-embeds) |
mnemonist note "<point>" |
Jot a quick note into working memory inbox |
mnemonist remember "<ask>" [--budget N] [--level both] |
Recall memories by cue — searches memory and code indices in parallel with blended relevance scoring, follows refs |
mnemonist learn [path] [--attend glob] [--capacity N] |
Ingest a codebase; chunks files with ParagraphChunking, embeds into .code-index.hnsw, reports quality metrics. --attend only indexes files matching this glob |
mnemonist consolidate [--dry-run] |
Promote inbox items, decay stale memories, re-embed into .memory-index.hnsw |
mnemonist reflect [--all] [--global] |
Introspect — review memories and inbox contents |
mnemonist forget <file> |
Deliberately forget a memory (supports fuzzy/suffix name matching) |
mnemonist config init |
Create default config file |
mnemonist config show |
Show current configuration |
mnemonist config get <key> |
Get a config value (dot-notation) |
mnemonist config set <key> <value> |
Set a config value |
mnemonist config path |
Print config file path |
All commands output JSON to stdout ({"ok": true, "data": {...}}). stdout is always JSON; --format json (default) is compact, --format human pretty-prints it.
The inbox is a capacity-limited staging area modeled after human working memory (default capacity: 7). Items enter via note (manual) or learn (code ingestion) and are scored by attention:
- Items are sorted by attention score; lowest-scored items are evicted at capacity
consolidatepromotes inbox items to long-term memory and clears the inbox- Stored in
.inbox.jsonalongside memory files
mnemonist consolidate runs a sleep-like consolidation cycle:
- Promote — inbox items become long-term memories with type and strength
- Decay — memories not accessed within
consolidation.decay_days(default 90) and belowprotected_access_count(default 5) are pruned - Re-embed — all surviving memories are re-embedded for fresh semantic search
Use --dry-run to preview what would change.
Each memory file tracks cognitive metadata in its frontmatter:
| Field | Description |
|---|---|
strength |
Consolidation strength (increases on survival) |
access_count |
Retrieval count (Hebbian reinforcement) |
last_accessed |
ISO 8601 timestamp of last retrieval |
created_at |
When the memory was first created |
source |
How it was created: memorize, note, learn, consolidation |
consolidated_from |
Original files if created via merge |
refs |
Inter-layer edges — code chunk IDs or memory filenames this memory links to |
Layered config: ~/.mnemonist/mnemonist.toml (global default, created with mnemonist config init) + ./mnemonist.toml at the project root (per-project overrides; missing fields inherit).
[storage]
root = "~/.mnemonist"
[embedding]
provider = "candle"
model = "sentence-transformers/all-MiniLM-L6-v2"
[recall]
budget = 2000
expand_refs = true
max_ref_expansions = 3
min_results = 2
[index]
max_lines = 200
[code]
# Trimmed here; `mnemonist config show` prints the full default list.
exclude_patterns = ["dist", "node_modules", "target", "package-lock", ".min.js"]
[consolidation]
decay_days = 90
merge_threshold = 0.85
protected_access_count = 5
max_memory_tokens = 120
[inbox]
capacity = 7
[output]
quiet = falseUse mnemonist config set recall.budget 3000 to change values. Every key shown above is read by the CLI; a test enforces that no dead keys are accepted.
See the full Specification for details on file format, dynamic loading, precedence rules, and integration guides.
Two kinds of benchmark live here:
- System-level evaluation — does the memory/RAG pipeline retrieve the right thing and answer correctly? Measured on real code (your repositories) and on the LongMemEval conversational-memory dataset.
- Microbenchmarks — how fast are the individual primitives (distance kernels,
HNSW, quantization)? Measured with
cargo bench(criterion).
Provenance. LongMemEval experiments 1–6 (retrieval side): commit
b25e88e(clean working tree), 2026-06-09, viajust longmemeval-json(mnemonist-bench --dataset data/longmemeval_s_cleaned.json --temporal-cycles 10 --format json) on an Apple M4 Pro (CPU, candle embeddings, default features +bench-cli), embedder all-MiniLM-L6-v2 (384-dim, candle + accelerate), HNSWm=16, m0=32, ef_construction=200, ef_search=100, datasetlongmemeval_s_cleaned.json(19,195 sessions / 500 questions, sha256d6f21ea9d60a0d56f34a05b609c79c88a451d2ae03597821ea3d5a9678c3a442). The code-RAG results and the gpt-4o-judged QA accuracy are from the earlier 2026-05 run atf68c13aand were not rerun. Reproduce with the commands in each section.
The product's headline use case: mnemonist learn <repo> ingests a codebase, then
mnemonist remember "<question>" should surface the source files that answer it. This
measures that directly with natural-language, intent-based queries mapped to gold
target files, over five real repositories spanning Rust, Go, and Python.
| repo | n | recall@1 | recall@3 | recall@5 | recall@10 | MRR | precision@5 |
|---|---|---|---|---|---|---|---|
| fsrc (Rust/Py) | 16 | 38% | 69% | 81% | 81% | 0.542 | 0.163 |
| sr (Rust) | 16 | 31% | 50% | 56% | 88% | 0.441 | 0.125 |
| teasr (Rust) | 16 | 62% | 75% | 81% | 81% | 0.703 | 0.163 |
| saige (Go) | 14 | 21% | 57% | 86% | 93% | 0.461 | 0.186 |
| mnemonist (Rust) | 16 | 19% | 38% | 56% | 62% | 0.332 | 0.125 |
| overall (macro) | 78 | 34% | 58% | 72% | 81% | 0.496 | 0.152 |
For ~3 in 4 natural-language queries a relevant file lands in the top 5 (recall@5 72%),
and 4 in 5 by top 10. recall@1 (34%) is lower — the single best chunk is often a
sibling of the true answer. Retrieval is file-level strong; exact top-chunk ranking has
headroom (a code-tuned embedder or reranker would lift it). Gold sets are in
docs/benchmarks/rag_gold/ — intent-based queries with verified gold
paths. Each repo is learned into an isolated storage root (HOME override) so the
real ~/.mnemonist is never touched.
uv run scripts/rag_eval.py \
--gold-dir docs/benchmarks/rag_gold --repos-root ~/github \
--out docs/benchmarks/rag_results.json --md docs/benchmarks/rag_results.mdSix experiments in crates/mnemonist-core/src/evals/bench/ run against a LongMemEval
dataset:
just longmemeval # all experiments
just longmemeval-select 2,4 # specific experiments| # | Experiment | Measures |
|---|---|---|
| 1 | Vector retrieval | per-question session retrieval recall@k (NOT QA) |
| 2 | Latency scaling | index build + p50/p95/p99 query latency, 100–10k docs |
| 3 | Storage footprint | raw vs TurboQuant size + recall, 1–4 bits |
| 4 | Temporal retrieval | recall lift from Hebbian reinforcement over time |
| 5 | MemPalace comparison | apples-to-apples retrieval parity (NOT a QA score) |
| 6 | LongMemEval QA | real end-to-end QA accuracy (retrieve → LLM → judge) |
Per question, an HNSW index is built from that question's ~48-session haystack and
queried. This is retrieval recall, not QA accuracy. Intervals are 95% Wilson (n=500),
emitted by the harness and committed in docs/benchmarks/results.json.
| metric | value | 95% CI |
|---|---|---|
| recall_any@5 | 96.4% | [94.4, 97.7] |
| recall_all@5 | 84.8% | [81.4, 87.7] |
| recall_any@10 | 98.2% | [96.6, 99.1] |
| recall_all@10 | 93.2% | [90.6, 95.1] |
| MRR | 0.873 | — |
| avg query | 9.7 ms (incl. embedding) | — |
Timing scope (corrected 2026-06-09):
embed_time_ms(1,583,789 ms) now measures haystack embedding alone, and the experiment loop's wall time is reported separately astotal_time_ms(1,590,953 ms); the previously committed value timed the whole loop. The recall numbers were reproduced exactly under the fix (96.4% / 84.8%).
MemPalace's quoted "82.8% LongMemEval score" is retrieval recall, not QA accuracy; its 96.6% figure was never independently reproduced. mnemonist's 96.4% above is the comparable retrieval number. The QA number is below.
Full pipeline: retrieve top-5 sessions → a reader LLM answers from the retrieved transcripts → an LLM judge scores against the gold answer.
just longmemeval-select 6 # Phase A: retrieve context (--qa-output)
# Phase B+C: generate answers, then judge (needs OPENAI_API_KEY)
uv run scripts/longmemeval_qa.py all --context context.jsonl \
--reader-model gpt-4o-mini --judge-model gpt-4o --report report.jsonConfiguration: reader gpt-4o-mini, judge gpt-4o, top_k=5, embedder all-MiniLM-L6-v2,
500 questions. (LongMemEval scores are configuration-dependent; a stronger reader raises
them.) Accuracy comes from the 2026-05 judged run at f68c13a; the LLM-judge pipeline
was not rerun for the 2026-06-09 update and is unaffected by the retrieval-side fixes.
The retrieval-recall column was reproduced exactly by the 2026-06-09 rerun.
Accuracy intervals are 95% Wilson, computed from the committed per-type counts in
docs/benchmarks/results.json; both the judge script and the substring scorer now emit
them on every run. The small-n rows are wide — single-session-preference at n=30 is
consistent with anything from 3.5% to 25.6%.
| question type | accuracy | 95% CI | retrieval recall | n |
|---|---|---|---|---|
| single-session-assistant | 87.5% | [76.4, 93.8] | 96.4% | 56 |
| single-session-user | 74.3% | [63.0, 83.1] | 91.4% | 70 |
| knowledge-update | 39.7% | [29.6, 50.8] | 100% | 78 |
| multi-session | 19.5% | [13.7, 27.1] | 99.2% | 133 |
| temporal-reasoning | 18.8% | [13.1, 26.3] | 94.0% | 133 |
| single-session-preference | 10.0% | [3.5, 25.6] | 96.7% | 30 |
| overall | 37.2% | [33.1, 41.5] | 96.4% | 500 |
Retrieval is strong everywhere (91–100%) — the gap to QA accuracy is the reader's reasoning, not the memory's recall. Factual single-session recall is high (74–88%); single-session-assistant 87.5% is only possible because the QA pipeline now emits full transcripts (user + assistant turns) — it previously fed user-turns only, making assistant-sourced answers impossible. The hard categories are genuinely hard for a small reader: temporal date-arithmetic, multi-session synthesis, and preference (the last is largely an artifact of a factual-recall reader prompt that abstains rather than applying the remembered preference — recall there is 96.7%).
Per-query latency over the HNSW index at increasing corpus sizes (M4 Pro, 384-dim). Methodology: 50 warmup queries per scale point, percentiles are the median of 5 runs, and session subsets are deterministic (sorted session IDs) so every run measures the same documents.
| n_docs | build | p50 | p95 | p99 |
|---|---|---|---|---|
| 100 | 18 ms | 44 µs | 46 µs | 48 µs |
| 500 | 167 ms | 111 µs | 131 µs | 137 µs |
| 1,000 | 407 ms | 130 µs | 167 µs | 179 µs |
| 5,000 | 2.8 s | 148 µs | 221 µs | 253 µs |
| 10,000 | 6.0 s | 156 µs | 244 µs | 295 µs |
Query latency stays sub-millisecond even at 10k documents and scales sub-linearly.
Does recall improve as memories are used? Over 10 consolidation cycles with Hebbian reinforcement (access strengthens frequently-retrieved memories), global-retrieval setting. Both arms are scored on the same held-out eval slice (n=250), and a no-reinforcement control runs the identical retrieve-20/rerank/take-5 path with zero access counts, so the reported delta isolates the reinforcement signal itself.
| arm | recall_any@5 |
|---|---|
| baseline (static) | 37.2% |
| control (rerank, no reinforcement) | 37.2% |
| reinforced (10 cycles) | 37.2% |
| delta vs control | +0.0 pp |
All three arms sit at 37.2% (95% Wilson [31.4, 43.3], n=250); the exact McNemar test on the paired reinforced-vs-control hits gives p = 1.0000 with 0/0 discordant pairs — the two arms retrieve identical top-5 sets for every eval query. The previously published +9.4 pp compared arms over different query populations (all 500 queries vs the held-out 250) and subtracted the pure-cosine baseline, which credited the reranker's own contribution to reinforcement; under the corrected design the reinforcement signal contributes nothing on this dataset. (Access patterns are synthetic repeated queries, not real logs — a controlled demonstration of the mechanism.)
Raw vs TurboQuant (MSE) at 1–4 bits over 19,195 × 384-dim embeddings. Recall is the global retrieval setting (every query vs all sessions — far harder than the per-question haystack above, hence lower absolute numbers; the point is degradation vs raw).
| storage | bytes/vector | total | compression | recall@5 | recall@10 | McNemar p vs raw (@5 / @10) | cosine distortion |
|---|---|---|---|---|---|---|---|
| raw f32 | 1536 B | 29.5 MB | 1.0× | 26.8% | 37.0% | — | 0 |
| 4-bit | 196 B | 3.76 MB | 7.8× | 27.0% | 36.8% | 1.000 / 1.000 | 0.0047 |
| 3-bit | 148 B | 2.84 MB | 10.4× | 26.2% | 35.6% | 0.581 / 0.092 | 0.0173 |
| 2-bit | 100 B | 1.92 MB | 15.4× | 26.0% | 34.6% | 0.523 / 0.017 | 0.0603 |
| 1-bit | 52 B | 1.00 MB | 29.5× | 28.8% | 35.2% | 0.121 / 0.262 | 0.2017 |
At 3–4 bits, cosine distortion is <0.02 and recall matches the raw baseline within
noise (4-bit recall@10 36.8% vs raw 37.0%, exact McNemar p = 1.0 at both cutoffs) —
quantization is effectively lossless for retrieval at 8–10× compression. The 1-bit
recall@5 cell exceeding raw (28.8% vs 26.8%) is sampling noise, not signal: the paired
exact McNemar test gives p = 0.121, and at n=500 the 95% Wilson intervals are raw
[23.1, 30.8] and 1-bit [25.0, 32.9], overlapping heavily. The only cell that clears
p < 0.05 is 2-bit recall@10 (34.6% vs raw 37.0%, p = 0.017, uncorrected for the eight
comparisons), consistent with a small real degradation at 2 bits. TurboQuant is
currently a research/eval module and is
not wired into learn/remember (which store full f32); this benchmark is the basis
for deciding whether to integrate it.
Run just bench (all) or cargo bench -p mnemonist-core --features evals,ann,quant.
| Function | 32-d | 128-d | 384-d |
|---|---|---|---|
cosine_similarity |
12 ns | 59 ns | 207 ns |
dot_product |
4 ns | 28 ns | 120 ns |
l2_distance_squared |
5 ns | 30 ns | 125 ns |
normalize |
18 ns | 82 ns | 239 ns |
| Operation | Time |
|---|---|
| Build (500 inserts) | 32.7 ms |
| Search top-1/10/50 | 15.2 µs |
| Save / Load | 91 µs / 85 µs |
| Operation | Time |
|---|---|
| Train (k-means, 16 clusters) | 2.2 ms |
| Search top-1/10/50 | ~12 µs |
| Save / Load | 66 µs / 57 µs |
| Bit-width | MSE quantize | MSE dequantize | Prod quantize | Prod dequantize |
|---|---|---|---|---|
| 1-bit | 3.9 µs | 991 ns | — | — |
| 2-bit | 3.9 µs | 988 ns | 116 µs | 141 µs |
| 3-bit | 3.9 µs | 997 ns | 115 µs | 111 µs |
| 4-bit | 4.1 µs | 998 ns | 115 µs | 111 µs |
| Operation | 128x2b | 384x2b | 384x4b |
|---|---|---|---|
| Pack | 161 ns | 539 ns | 264 ns |
| Unpack | 90 ns | 270 ns | 241 ns |
| Operation | 128d x 100 | 384d x 100 | 384d x 500 |
|---|---|---|---|
upsert |
34.3 µs | 76.2 µs | 493.5 µs |
get |
72 ns | 71 ns | 71 ns |
remove |
4.0 µs | 5.5 µs | 35.7 µs |
save |
46.9 µs | 61.0 µs | 177.4 µs |
load |
15.4 µs | 20.9 µs | 74.1 µs |
| Operation | cap=7 | cap=50 |
|---|---|---|
push_to_capacity |
698 ns | 10.3 µs |
push_with_eviction |
1.3 µs | 26.9 µs |
save |
39.6 µs | 46.9 µs |
load |
10.9 µs | 19.7 µs |
drain |
688 ns | 10.4 µs |
| Operation | 10 entries | 100 entries |
|---|---|---|
parse_line |
73 ns | — |
to_line |
91 ns | — |
search |
612 ns | 6.0 µs |
upsert_new |
549 ns | 4.7 µs |
upsert_existing |
508 ns | 4.5 µs |
| Function | 32d x 50 | 128d x 50 | 384d x 20 |
|---|---|---|---|
anisotropy |
15.8 µs | 73.0 µs | 40.9 µs |
similarity_range |
15.8 µs | 73.0 µs | 40.9 µs |
mean_center |
1.3 µs | 3.8 µs | 5.6 µs |
discrimination_gap |
16.0 µs | — | — |
Measured on Apple Silicon (M4 Pro) with
cargo bench. Runjust benchto reproduce. Raw results indocs/benchmarks/.
just test # cargo test --workspace
bash scripts/validate.sh # full E2E validation (requires release build)See CONTRIBUTING.md for what each test suite covers and per-crate test counts.
This repo's conventions are available as portable agent skills in skills/, following the Agent Skills Specification.
Related standards: AGENTS.md · llms.txt
Apache-2.0

