mnemonist

An open ecosystem for tool-agnostic AI agent memory.

Download · Report Bug · Specification

Features

Cognitive CLI — commands named after memory processes: memorize, remember, note, learn, consolidate, reflect, forget
Two-level memory — project (~/.mnemonist/{project}/) and global (~/.mnemonist/global/)
Working memory inbox — capacity-limited staging area (default 7 items) with attention scoring; items promoted to long-term memory via consolidate
Memory metadata — strength, access count, last accessed, source tracking; Hebbian reinforcement on retrieval
Plain markdown with YAML frontmatter — human-readable, git-friendly
Typed memories — user, feedback, project, reference
Local embedding — candle crate with all-MiniLM-L6-v2 (384-dim, CPU/CUDA); no external server needed; model downloads from HuggingFace Hub on first use
Layered graph — three HNSW layers: code (.code-index.hnsw), project memory (.memory-index.hnsw), and global memory; inter-layer edges via refs frontmatter field
Pluggable code chunking — ChunkingStrategy trait with built-in ParagraphChunking (blank-line boundaries) and FixedLineChunking (sliding window with overlap); no tree-sitter dependency
Cross-layer recall — remember searches memory and code indices in parallel with blended relevance scoring (semantic + temporal); follows refs edges to surface referenced code chunks
Consolidation — consolidate promotes inbox items, decays stale memories, and re-embeds
Fuzzy forget — forget resolves partial and suffix matches so you don't need the full filename
Embedding quality metrics — learn reports anisotropy and similarity_range after indexing
TurboQuant — vector quantization (1-4 bit) available as a research/eval module (see the storage-footprint benchmark); not yet wired into learn/remember, which store full f32 embeddings
JSON-first — stdout for structured JSON, stderr for UX; pipe-friendly
Works with Claude Code, Codex, Gemini, Copilot, Cursor, or any AI tool

Install

curl -fsSL https://raw.githubusercontent.com/urmzd/mnemonist/main/install.sh | sh

Hardware acceleration

Pre-built binaries run on CPU with pure Rust matmuls — functional, but large batch operations like learn are slower than accelerated builds. If you have a Rust toolchain, you can build from source with hardware acceleration:

# macOS — Apple's Accelerate BLAS (~2x faster embedding throughput)
cargo install mnemonist --features accelerate

# Linux/Windows with an NVIDIA GPU
cargo install mnemonist --features cuda

Quick Start

# 1. Install
curl -fsSL https://raw.githubusercontent.com/urmzd/mnemonist/main/install.sh | sh
# Or, if you have a Rust toolchain: cargo install mnemonist

# 2. Ingest the codebase — auto-creates ~/.mnemonist/{project}/ and embeds source files
mnemonist learn .

# 3. Memorize long-term knowledge
mnemonist memorize "prefer Rust for CLI tools" -t feedback
mnemonist memorize "deep Go expertise, new to React" -t user

# 4. Jot quick notes into the working memory inbox
mnemonist note "look into async runtime choices"
mnemonist note "check Linear project INGEST for pipeline bugs"

# 5. Consolidate — promote inbox to long-term memory, decay stale items, re-embed
mnemonist consolidate

# 6. Recall — semantic + text search across memories and code
mnemonist remember "rust async patterns"

# 7. Review everything
mnemonist reflect --all

# 8. Forget something you no longer need (fuzzy name matching)
mnemonist forget prefer-rust

Usage

Memory Levels

Level	Location	Scope
Project	`~/.mnemonist/{project}/`	Per-repo corrections, decisions
Global	`~/.mnemonist/global/`	Cross-project preferences, expertise

Project memory takes precedence over global when they conflict.

Memory Types

Type	When	Example
`user`	Expertise, preferences	"Deep Rust knowledge, new to React"
`feedback`	Corrections, validated approaches	"Never mock the database in tests"
`project`	Repo-specific context (project-level only)	"Auth rewrite driven by compliance"
`reference`	External resource pointers	"Bugs tracked in Linear project INGEST"

CLI at a glance

CLI Commands

Command	Description
`mnemonist memorize "<point>" [-t type] [-n name]`	Deliberately encode a point into long-term memory (auto-embeds)
`mnemonist note "<point>"`	Jot a quick note into working memory inbox
`mnemonist remember "<ask>" [--budget N] [--level both]`	Recall memories by cue — searches memory and code indices in parallel with blended relevance scoring, follows refs
`mnemonist learn [path] [--attend glob] [--capacity N]`	Ingest a codebase; chunks files with `ParagraphChunking`, embeds into `.code-index.hnsw`, reports quality metrics. `--attend` only indexes files matching this glob
`mnemonist consolidate [--dry-run]`	Promote inbox items, decay stale memories, re-embed into `.memory-index.hnsw`
`mnemonist reflect [--all] [--global]`	Introspect — review memories and inbox contents
`mnemonist forget <file>`	Deliberately forget a memory (supports fuzzy/suffix name matching)
`mnemonist config init`	Create default config file
`mnemonist config show`	Show current configuration
`mnemonist config get <key>`	Get a config value (dot-notation)
`mnemonist config set <key> <value>`	Set a config value
`mnemonist config path`	Print config file path

All commands output JSON to stdout ({"ok": true, "data": {...}}). stdout is always JSON; --format json (default) is compact, --format human pretty-prints it.

Working Memory (Inbox)

The inbox is a capacity-limited staging area modeled after human working memory (default capacity: 7). Items enter via note (manual) or learn (code ingestion) and are scored by attention:

Items are sorted by attention score; lowest-scored items are evicted at capacity
consolidate promotes inbox items to long-term memory and clears the inbox
Stored in .inbox.json alongside memory files

Consolidation

mnemonist consolidate runs a sleep-like consolidation cycle:

Promote — inbox items become long-term memories with type and strength
Decay — memories not accessed within consolidation.decay_days (default 90) and below protected_access_count (default 5) are pruned
Re-embed — all surviving memories are re-embedded for fresh semantic search

Use --dry-run to preview what would change.

Memory Metadata

Each memory file tracks cognitive metadata in its frontmatter:

Field	Description
`strength`	Consolidation strength (increases on survival)
`access_count`	Retrieval count (Hebbian reinforcement)
`last_accessed`	ISO 8601 timestamp of last retrieval
`created_at`	When the memory was first created
`source`	How it was created: `memorize`, `note`, `learn`, `consolidation`
`consolidated_from`	Original files if created via merge
`refs`	Inter-layer edges — code chunk IDs or memory filenames this memory links to

Configuration

Layered config: ~/.mnemonist/mnemonist.toml (global default, created with mnemonist config init) + ./mnemonist.toml at the project root (per-project overrides; missing fields inherit).

[storage]
root = "~/.mnemonist"

[embedding]
provider = "candle"
model = "sentence-transformers/all-MiniLM-L6-v2"

[recall]
budget = 2000
expand_refs = true
max_ref_expansions = 3
min_results = 2

[index]
max_lines = 200

[code]
# Trimmed here; `mnemonist config show` prints the full default list.
exclude_patterns = ["dist", "node_modules", "target", "package-lock", ".min.js"]

[consolidation]
decay_days = 90
merge_threshold = 0.85
protected_access_count = 5
max_memory_tokens = 120

[inbox]
capacity = 7

[output]
quiet = false

Use mnemonist config set recall.budget 3000 to change values. Every key shown above is read by the CLI; a test enforces that no dead keys are accepted.

See the full Specification for details on file format, dynamic loading, precedence rules, and integration guides.

Benchmarks

Two kinds of benchmark live here:

System-level evaluation — does the memory/RAG pipeline retrieve the right thing and answer correctly? Measured on real code (your repositories) and on the LongMemEval conversational-memory dataset.
Microbenchmarks — how fast are the individual primitives (distance kernels, HNSW, quantization)? Measured with cargo bench (criterion).

Provenance. LongMemEval experiments 1–6 (retrieval side): commit b25e88e (clean working tree), 2026-06-09, via just longmemeval-json (mnemonist-bench --dataset data/longmemeval_s_cleaned.json --temporal-cycles 10 --format json) on an Apple M4 Pro (CPU, candle embeddings, default features + bench-cli), embedder all-MiniLM-L6-v2 (384-dim, candle + accelerate), HNSW m=16, m0=32, ef_construction=200, ef_search=100, dataset longmemeval_s_cleaned.json (19,195 sessions / 500 questions, sha256 d6f21ea9d60a0d56f34a05b609c79c88a451d2ae03597821ea3d5a9678c3a442). The code-RAG results and the gpt-4o-judged QA accuracy are from the earlier 2026-05 run at f68c13a and were not rerun. Reproduce with the commands in each section.

Code retrieval — RAG over real repositories

The product's headline use case: mnemonist learn <repo> ingests a codebase, then mnemonist remember "<question>" should surface the source files that answer it. This measures that directly with natural-language, intent-based queries mapped to gold target files, over five real repositories spanning Rust, Go, and Python.

repo	n	recall@1	recall@3	recall@5	recall@10	MRR	precision@5
fsrc (Rust/Py)	16	38%	69%	81%	81%	0.542	0.163
sr (Rust)	16	31%	50%	56%	88%	0.441	0.125
teasr (Rust)	16	62%	75%	81%	81%	0.703	0.163
saige (Go)	14	21%	57%	86%	93%	0.461	0.186
mnemonist (Rust)	16	19%	38%	56%	62%	0.332	0.125
overall (macro)	78	34%	58%	72%	81%	0.496	0.152

For ~3 in 4 natural-language queries a relevant file lands in the top 5 (recall@5 72%), and 4 in 5 by top 10. recall@1 (34%) is lower — the single best chunk is often a sibling of the true answer. Retrieval is file-level strong; exact top-chunk ranking has headroom (a code-tuned embedder or reranker would lift it). Gold sets are in docs/benchmarks/rag_gold/ — intent-based queries with verified gold paths. Each repo is learned into an isolated storage root (HOME override) so the real ~/.mnemonist is never touched.

uv run scripts/rag_eval.py \
  --gold-dir docs/benchmarks/rag_gold --repos-root ~/github \
  --out docs/benchmarks/rag_results.json --md docs/benchmarks/rag_results.md

LongMemEval — conversational memory

Six experiments in crates/mnemonist-core/src/evals/bench/ run against a LongMemEval dataset:

just longmemeval                # all experiments
just longmemeval-select 2,4     # specific experiments

#	Experiment	Measures
1	Vector retrieval	per-question session retrieval recall@k (NOT QA)
2	Latency scaling	index build + p50/p95/p99 query latency, 100–10k docs
3	Storage footprint	raw vs TurboQuant size + recall, 1–4 bits
4	Temporal retrieval	recall lift from Hebbian reinforcement over time
5	MemPalace comparison	apples-to-apples retrieval parity (NOT a QA score)
6	LongMemEval QA	real end-to-end QA accuracy (retrieve → LLM → judge)

Vector retrieval recall (Exp 1 / 5)

Per question, an HNSW index is built from that question's ~48-session haystack and queried. This is retrieval recall, not QA accuracy. Intervals are 95% Wilson (n=500), emitted by the harness and committed in docs/benchmarks/results.json.

metric	value	95% CI
recall_any@5	96.4%	[94.4, 97.7]
recall_all@5	84.8%	[81.4, 87.7]
recall_any@10	98.2%	[96.6, 99.1]
recall_all@10	93.2%	[90.6, 95.1]
MRR	0.873	—
avg query	9.7 ms (incl. embedding)	—

Timing scope (corrected 2026-06-09): embed_time_ms (1,583,789 ms) now measures haystack embedding alone, and the experiment loop's wall time is reported separately as total_time_ms (1,590,953 ms); the previously committed value timed the whole loop. The recall numbers were reproduced exactly under the fix (96.4% / 84.8%).

MemPalace's quoted "82.8% LongMemEval score" is retrieval recall, not QA accuracy; its 96.6% figure was never independently reproduced. mnemonist's 96.4% above is the comparable retrieval number. The QA number is below.

End-to-end QA accuracy (Exp 6) — "LongMemEval for real"

Full pipeline: retrieve top-5 sessions → a reader LLM answers from the retrieved transcripts → an LLM judge scores against the gold answer.

just longmemeval-select 6        # Phase A: retrieve context (--qa-output)
# Phase B+C: generate answers, then judge (needs OPENAI_API_KEY)
uv run scripts/longmemeval_qa.py all --context context.jsonl \
  --reader-model gpt-4o-mini --judge-model gpt-4o --report report.json

Configuration: reader gpt-4o-mini, judge gpt-4o, top_k=5, embedder all-MiniLM-L6-v2, 500 questions. (LongMemEval scores are configuration-dependent; a stronger reader raises them.) Accuracy comes from the 2026-05 judged run at f68c13a; the LLM-judge pipeline was not rerun for the 2026-06-09 update and is unaffected by the retrieval-side fixes. The retrieval-recall column was reproduced exactly by the 2026-06-09 rerun. Accuracy intervals are 95% Wilson, computed from the committed per-type counts in docs/benchmarks/results.json; both the judge script and the substring scorer now emit them on every run. The small-n rows are wide — single-session-preference at n=30 is consistent with anything from 3.5% to 25.6%.

question type	accuracy	95% CI	retrieval recall	n
single-session-assistant	87.5%	[76.4, 93.8]	96.4%	56
single-session-user	74.3%	[63.0, 83.1]	91.4%	70
knowledge-update	39.7%	[29.6, 50.8]	100%	78
multi-session	19.5%	[13.7, 27.1]	99.2%	133
temporal-reasoning	18.8%	[13.1, 26.3]	94.0%	133
single-session-preference	10.0%	[3.5, 25.6]	96.7%	30
overall	37.2%	[33.1, 41.5]	96.4%	500

Retrieval is strong everywhere (91–100%) — the gap to QA accuracy is the reader's reasoning, not the memory's recall. Factual single-session recall is high (74–88%); single-session-assistant 87.5% is only possible because the QA pipeline now emits full transcripts (user + assistant turns) — it previously fed user-turns only, making assistant-sourced answers impossible. The hard categories are genuinely hard for a small reader: temporal date-arithmetic, multi-session synthesis, and preference (the last is largely an artifact of a factual-recall reader prompt that abstains rather than applying the remembered preference — recall there is 96.7%).

Latency scaling (Exp 2)

Per-query latency over the HNSW index at increasing corpus sizes (M4 Pro, 384-dim). Methodology: 50 warmup queries per scale point, percentiles are the median of 5 runs, and session subsets are deterministic (sorted session IDs) so every run measures the same documents.

n_docs	build	p50	p95	p99
100	18 ms	44 µs	46 µs	48 µs
500	167 ms	111 µs	131 µs	137 µs
1,000	407 ms	130 µs	167 µs	179 µs
5,000	2.8 s	148 µs	221 µs	253 µs
10,000	6.0 s	156 µs	244 µs	295 µs

Query latency stays sub-millisecond even at 10k documents and scales sub-linearly.

Temporal reinforcement (Exp 4) — performance over time

Does recall improve as memories are used? Over 10 consolidation cycles with Hebbian reinforcement (access strengthens frequently-retrieved memories), global-retrieval setting. Both arms are scored on the same held-out eval slice (n=250), and a no-reinforcement control runs the identical retrieve-20/rerank/take-5 path with zero access counts, so the reported delta isolates the reinforcement signal itself.

arm	recall_any@5
baseline (static)	37.2%
control (rerank, no reinforcement)	37.2%
reinforced (10 cycles)	37.2%
delta vs control	+0.0 pp

All three arms sit at 37.2% (95% Wilson [31.4, 43.3], n=250); the exact McNemar test on the paired reinforced-vs-control hits gives p = 1.0000 with 0/0 discordant pairs — the two arms retrieve identical top-5 sets for every eval query. The previously published +9.4 pp compared arms over different query populations (all 500 queries vs the held-out 250) and subtracted the pure-cosine baseline, which credited the reranker's own contribution to reinforcement; under the corrected design the reinforcement signal contributes nothing on this dataset. (Access patterns are synthetic repeated queries, not real logs — a controlled demonstration of the mechanism.)

Storage & quantization (Exp 3)

Raw vs TurboQuant (MSE) at 1–4 bits over 19,195 × 384-dim embeddings. Recall is the global retrieval setting (every query vs all sessions — far harder than the per-question haystack above, hence lower absolute numbers; the point is degradation vs raw).

storage	bytes/vector	total	compression	recall@5	recall@10	McNemar p vs raw (@5 / @10)	cosine distortion
raw f32	1536 B	29.5 MB	1.0×	26.8%	37.0%	—	0
4-bit	196 B	3.76 MB	7.8×	27.0%	36.8%	1.000 / 1.000	0.0047
3-bit	148 B	2.84 MB	10.4×	26.2%	35.6%	0.581 / 0.092	0.0173
2-bit	100 B	1.92 MB	15.4×	26.0%	34.6%	0.523 / 0.017	0.0603
1-bit	52 B	1.00 MB	29.5×	28.8%	35.2%	0.121 / 0.262	0.2017

At 3–4 bits, cosine distortion is <0.02 and recall matches the raw baseline within noise (4-bit recall@10 36.8% vs raw 37.0%, exact McNemar p = 1.0 at both cutoffs) — quantization is effectively lossless for retrieval at 8–10× compression. The 1-bit recall@5 cell exceeding raw (28.8% vs 26.8%) is sampling noise, not signal: the paired exact McNemar test gives p = 0.121, and at n=500 the 95% Wilson intervals are raw [23.1, 30.8] and 1-bit [25.0, 32.9], overlapping heavily. The only cell that clears p < 0.05 is 2-bit recall@10 (34.6% vs raw 37.0%, p = 0.017, uncorrected for the eight comparisons), consistent with a small real degradation at 2 bits. TurboQuant is currently a research/eval module and is not wired into learn/remember (which store full f32); this benchmark is the basis for deciding whether to integrate it.

Microbenchmarks (`cargo bench`)

Run just bench (all) or cargo bench -p mnemonist-core --features evals,ann,quant.

Distance Functions

Function	32-d	128-d	384-d
`cosine_similarity`	12 ns	59 ns	207 ns
`dot_product`	4 ns	28 ns	120 ns
`l2_distance_squared`	5 ns	30 ns	125 ns
`normalize`	18 ns	82 ns	239 ns

HNSW Index (500 vectors, dim=32)

Operation	Time
Build (500 inserts)	32.7 ms
Search top-1/10/50	15.2 µs
Save / Load	91 µs / 85 µs

IVF-Flat Index (500 vectors, dim=32)

Operation	Time
Train (k-means, 16 clusters)	2.2 ms
Search top-1/10/50	~12 µs
Save / Load	66 µs / 57 µs

TurboQuant (dim=128)

Bit-width	MSE quantize	MSE dequantize	Prod quantize	Prod dequantize
1-bit	3.9 µs	991 ns	—	—
2-bit	3.9 µs	988 ns	116 µs	141 µs
3-bit	3.9 µs	997 ns	115 µs	111 µs
4-bit	4.1 µs	998 ns	115 µs	111 µs

Bit Packing

Operation	128x2b	384x2b	384x4b
Pack	161 ns	539 ns	264 ns
Unpack	90 ns	270 ns	241 ns

Embedding Store

Operation	128d x 100	384d x 100	384d x 500
`upsert`	34.3 µs	76.2 µs	493.5 µs
`get`	72 ns	71 ns	71 ns
`remove`	4.0 µs	5.5 µs	35.7 µs
`save`	46.9 µs	61.0 µs	177.4 µs
`load`	15.4 µs	20.9 µs	74.1 µs

Inbox

Operation	cap=7	cap=50
`push_to_capacity`	698 ns	10.3 µs
`push_with_eviction`	1.3 µs	26.9 µs
`save`	39.6 µs	46.9 µs
`load`	10.9 µs	19.7 µs
`drain`	688 ns	10.4 µs

Memory Index

Operation	10 entries	100 entries
`parse_line`	73 ns	—
`to_line`	91 ns	—
`search`	612 ns	6.0 µs
`upsert_new`	549 ns	4.7 µs
`upsert_existing`	508 ns	4.5 µs

Eval Functions

Function	32d x 50	128d x 50	384d x 20
`anisotropy`	15.8 µs	73.0 µs	40.9 µs
`similarity_range`	15.8 µs	73.0 µs	40.9 µs
`mean_center`	1.3 µs	3.8 µs	5.6 µs
`discrimination_gap`	16.0 µs	—	—

Measured on Apple Silicon (M4 Pro) with cargo bench. Run just bench to reproduce. Raw results in docs/benchmarks/.

Testing

just test                  # cargo test --workspace
bash scripts/validate.sh   # full E2E validation (requires release build)

See CONTRIBUTING.md for what each test suite covers and per-crate test counts.

Agent Skill

This repo's conventions are available as portable agent skills in skills/, following the Agent Skills Specification.

Related standards: AGENTS.md · llms.txt

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.github		.github
crates		crates
dataset		dataset
docs		docs
scripts		scripts
showcase		showcase
skills/mnemonist		skills/mnemonist
spec		spec
training		training
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
install.sh		install.sh
justfile		justfile
llms.txt		llms.txt
mnemonist.toml		mnemonist.toml
sr.yaml		sr.yaml
teasr.toml		teasr.toml

Folders and files

Latest commit

History

Repository files navigation

mnemonist

Contents

Features

Install

Hardware acceleration

Quick Start

Usage

Memory Levels

Memory Types

CLI at a glance

CLI Commands

Working Memory (Inbox)

Consolidation

Memory Metadata

Configuration

Benchmarks

Code retrieval — RAG over real repositories

LongMemEval — conversational memory

Vector retrieval recall (Exp 1 / 5)

End-to-end QA accuracy (Exp 6) — "LongMemEval for real"

Latency scaling (Exp 2)

Temporal reinforcement (Exp 4) — performance over time

Storage & quantization (Exp 3)

Microbenchmarks (cargo bench)

Distance Functions

HNSW Index (500 vectors, dim=32)

IVF-Flat Index (500 vectors, dim=32)

TurboQuant (dim=128)

Bit Packing

Embedding Store

Inbox

Memory Index

Eval Functions

Testing

Agent Skill

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Microbenchmarks (`cargo bench`)

Packages