The built-in benchmark (bench/bench.py) is honest but self-authored. The memory-tools space compares on LoCoMo / LongMemEval — and self-reported numbers there are famously contested, so an independent, reproducible harness would stand out. Task: a bench/locomo.py that downloads a public subset, maps it to remember/recall calls, and publishes the score with the exact commit + command. Zero-dependency constraint applies to mind.py only — the harness may use whatever it needs.
The built-in benchmark (
bench/bench.py) is honest but self-authored. The memory-tools space compares on LoCoMo / LongMemEval — and self-reported numbers there are famously contested, so an independent, reproducible harness would stand out. Task: abench/locomo.pythat downloads a public subset, maps it to remember/recall calls, and publishes the score with the exact commit + command. Zero-dependency constraint applies tomind.pyonly — the harness may use whatever it needs.