sdebench: does memory help a coding agent? (boltons-hosted benchmark) by nicoloboschi · Pull Request #23 · vectorize-io/agent-memory-benchmark

nicoloboschi · 2026-07-02T12:20:38Z

sdebench — does memory help a coding agent?

A benchmark measuring whether a coding agent benefits from a memory system that has ingested a
project's git history and past developer conversations. Each task is a bug-fix whose correct solution
hinges on a non-guessable, project-specific decision: the obvious fix passes the visible repro but
fails a held-out hidden test. Where that decision lives (git history vs. a past chat) is the
independent variable.

What's here (`sdebench/`)

10 tasks hosted in the real boltons library (pinned at
979fa9b, ~1600 commits used as retrieval noise) — 5 on untested edges of real functions, 5 as
planted modules inside the real repo. Sources: 1 H (git history) / 9 F (past conversation).
Docker grading with pristine test copies: pass_to_pass + repro (FAIL_TO_PASS) + a held-out
HIDDEN_TO_PASS. Every task is verified: HEAD fails, the correct fix passes all, the naive fix
passes the repro but fails the hidden test.
Primary metric: interventions — on a failing grade the harness feeds back the failing-test
output and resumes (cap 5). 0 = solved first try.
memsys/ — a local, file-based memory system (language-agnostic, no AST): ingests git
rationale + past chats, retrieves by TF-IDF + a code-symbol boost, surfaces the decision (not the
answer). Also Hindsight arms (hsreflect = reflect() as an agentic reranker; hscoding = the
reflect+inject plugin) and an oracle upper bound.

Headline result

Across the F suite, memory takes the plain agent's 56 interventions → 1 (n=5, all solved, no
legitimacy problems), cutting turns ~46% on a real codebase with ~1500 real commits as noise.

The `sdebench-polish` commits on top

harden the omdset invariant commit + add the oracle baseline; drop the experimental omdeq task
language-agnostic memsys (regex symbols, no AST) + REF-pinned upstream history as realistic noise
(removes the "ingest only the task's own commits" shortcut)
Hindsight hsreflect/hscoding arms + AMB UI export
sync DATASET.md + regenerate MANIFEST.json to the current 10-task set

Datasheet: sdebench/DATASET.md. Reproduce steps at the bottom of it. validate_dataset.py passes.

Draft: the whole benchmark lives on this branch (never previously pushed); opening as a draft for
review of scope/structure before targeting a merge.

Dataset is now a submodule

The 10-task dataset (+ MANIFEST, datasheet, validator) lives in its own repo,
vectorize-io/sde-bench, and is mounted here as a git
submodule at sdebench/datasets — so the data versions independently of the runner. Harness paths are
unchanged (sdebench/datasets/boltons-*/tasks/main/task.json). Clone with --recurse-submodules (or
git submodule update --init). The legacy synthetic scratch datasets were dropped from the tree.

- synthetic repo built with engineered git history (build.py); a perf commit bundles an int() refill floor (regression) with a legit available() accessor - FAIL_TO_PASS regression repro + HIDDEN_TO_PASS held-out variants + PASS_TO_PASS suite - validated: suite green at HEAD, regression+hidden red, surgical fix -> all green, git revert of the regression commit conflicts (forces surgical fix) - README documents the benchmark design + grading + history A/B

- Dockerfile: deterministic python+pytest+git grading env - harness/run.py: build repo (full|squashed history) -> ship bug report + failing repro -> run opencode (gemini-3.5-flash) -> capture SOURCE diff (tests excluded) -> grade in Docker vs FAIL_TO_PASS+PASS_TO_PASS+HIDDEN from pristine copies - reports resolution/cost(tokens)/speed(wall,turns) - smoke test (full history): agent solved it — 591B fix, 10 passed, 20 turns, 176s

…y pass/fail - if grading fails, surface the NEW problem (failing tests' assertion output, NOT the fix) back to the agent and RESUME its session (opencode -c) to continue its work - metric = number of human-like interventions needed (capped at 5 = drift guard) - cost/turns/wall accumulate across all rounds; solved = passed within the cap - realistic: feedback is what CI/QA would report; agent must still generalize the fix

… regression - a refactor changed DEFAULT_TTL 300->600 and dropped the rationale comment; 300 now lives ONLY in git history (blame/log the constant). bundled with a legit clear(). - repro proves entries expire too late but CANNOT reveal the value; hidden tests pin it to EXACTLY 300. validated: TTL=450 passes repro but fails hidden (underdetermined). - with history: read 300 off blame -> 1-shot. without: binary-search via feedback rounds. this is the task designed to make the intervention-count A/B diverge.

…c discriminates - 300 is a conventional TTL the agent guesses without history (A/B showed 0 interventions both, though full used ~half the tokens). 287 (a 'measured p99' value) can't be guessed: validated TTL=300 now passes repro but FAILS hidden. only history states 287. - expectation: full reads 287 off git blame (0 interv); squashed must discover it via feedback rounds (>0 interv) -> intervention metric diverges.

…e full trace - run_agent returns a token SPLIT {input, output, reasoning, cache_read, cache_write} (verified cache_read populated, e.g. 101k cached of 186k input) — $ computed later per model - captures the structured trajectory (tool steps + assistant text) per round - main accumulates the split across all feedback rounds; writes result.json (metrics) + trace.json (full multi-round conversation: bug report -> agent -> feedback -> agent ...)

- ui_export.py maps trace.json -> outputs/sdebench/<run>/agent/all.json (view='agent'); each task-run is a QueryResult whose trajectory is the FULL multi-round conversation (bug report -> agent -> feedback -> agent), with the token split + interventions in meta - ported the purpose-built agent-trace view (RunDetail.vue + style.css) from feat/swebench-cl and rebuilt ui/dist - harness now stores final_patch in trace.json (the UI 'answer') - verified: full & squashed runs render with bug report, clickable tool steps, patch diff

…kens in UI - compute_cost() per-class: $1.50/1M input, $0.15 cached, $9.00 output(+reasoning); cost_usd now live in result.json + ui_export (full ~$0.31, squashed ~$0.59 — the extra intervention ~doubles $) - per model-step in/out tokens stamped onto each trajectory step (tok_in/tok_out) - UI: run-level stat boxes (Total cost / Interventions / Tokens in-out), per-task pills (interventions, $cost, in/out tokens, turns, wall), per-tool-step in→out badge

…adding, compact tokens - each tool step is now one truncated line: [tool] [arg ↳ output-preview …] [in→out] [▸] (was wrapping one char per line via word-break:break-all on a squished flex column) - combined arg + out-preview into one ellipsised .traj-mid (min-width:0 so it truncates) - tighter L/R padding (1rem -> 0.55rem); tokens compact (9.4k→338, tabular, right-aligned) - feedback/say steps wrap normally (pre-wrap), expand still shows full input/output

…put) - verified the mapping is correct (not inverted): tok_in sums to input+cached (285k), tok_out to output+reasoning (4.7k) — input context is just naturally >> output - show ↑ for input/prompt and ↓ for output/generated so it's unambiguous (was a bare →)

…ount) empirically confirmed via raw step tokens: total = input + cache_read + output + reasoning, so 'input' is the NON-cached prompt and 'cache_read' the cached prompt (separate, not nested). => tok_in = input + cache_read and cost = input*$1.50 + cache_read*$0.15 + out*$9.00 are correct.

- harness records each round's submitted patch + grade outcome (passed + pytest) on the trace round — incl. the ones that FAILED eval and triggered the feedback - ui_export inserts a 'patch' step after each round's trajectory; UI renders it as a clickable ✓/✗ row (round + pytest summary) that expands to that round's diff - so the trajectory now shows: agent work -> ✗ submitted (2 failed) -> 🔁 feedback -> agent work -> ✓ submitted (9 passed)

… opencode plugin - new --history hindsight mode: builds the SQUASHED repo (agent can't git-blame), ingests the full git history (each commit's message+diff) into a Hindsight bank, and runs with the opencode Hindsight plugin active (recall mode) pointed at that bank - tests whether memory of history recovers the value of history (the buried 287 constant) when raw git access is gone — vs full (git) and squashed (nothing) - local Hindsight server (main-ish, gemini-3.1-flash-lite) on :8888; ingest verified to surface '287 measured p99' on recall

…ng MODE) - a refactor switched round_cents from banker's rounding (ROUND_HALF_EVEN) to half-up and dropped the 'to match the ledger' rationale; the rule now lives ONLY in git history - different mechanism than ttlcache (algorithmic choice, not a magic number) — tests generalization. validated: repro red at HEAD, banker's fix -> all green, a half-DOWN fix passes the repro but FAILS hidden (underdetermined), banker's only in history - bundled with a legit format_cents() so revert fails PASS_TO_PASS

…r non-guessable - removed task #1 ratelimiter (int() floor too obvious — full==squashed, 0 interv both) - ledger first A/B failed to discriminate: banker's rounding is the GUESSABLE convention for money, so squashed solved it without history. Changed the real rule to round-half-DOWN ('match legacy billing') which agents DON'T default to. Validated: a banker's fix passes the repro but fails hidden -> the natural guess is wrong -> only history reveals half-down. - README documents the non-guessability design rule

… in the UI - capture_git_history(): the task repo's engineered commits (sha/subject/body/diff), newest first - ui_export attaches it to every QueryResult (backfills existing runs by rebuilding the repo); also splits exports by task (ttlcache.json / ledger.json) so tasks don't overwrite each other - UI: a 'Repository history' panel listing the commits, each expandable to its full diff — so you can see the source documents the full/hindsight arms had and squashed didn't

… multi-task layout - billing: 4 longer modules (money/discount/tax/invoice) + an 18-commit history full of noise (docs, changelog, vague refactors) where TWO regressions and their guarantees are buried: 'tidy money module' (rounding half-down->half-up) and 'refactor invoice pipeline' (tax base discounted->pre-discount). 'find the relevant history' is now actually exercised. - new dataset layout: datasets/<codebase>/{build.py, tasks/<task>/{task.json,*_test.py}} — many tasks share ONE codebase + git history (easy to iterate). task.json gains 'codebase'. - harness resolves build from codebase, test files from the task dir; stores codebase. - ui_export splits per task_id (tasks on a codebase share the git history view). - two tasks validated: billing-rounding-001 (critical, non-guessable half-down) and billing-taxbase-001 (navigate noisy history). both: regression+hidden red -> fix -> 10 green.

…e input - ↑ is the CUMULATIVE prompt (system prompt + tool defs + all prior steps, mostly cached); added +Δ = the NEW context that step introduced (cumulative minus previous step), so a big read shows up as the jump on the FOLLOWING step - first step's ↑ (~9k) is the baseline: opencode system prompt + tool schemas + bug report, re-sent every call; +Δ on step 1 equals that baseline - clearer tooltip distinguishing cumulative vs delta vs generated

- group tool calls into TURNS: opencode issues several tools from ONE prompt, so tokens are per-turn. Each turn shows ONE ↑cumulative +Δnew ↓generated badge instead of repeating it on every batched tool row (the +0 rows that looked free / the row that looked like it 'cost' the whole previous batch). Feedback markers and submitted patches stay as standalone blocks. - reasoning is token-only: gemini-3.5-flash emits no thinking TEXT (event types are just tool_use/step_start/step_finish/text), only a reasoning token COUNT in step_finish. Harness now stamps per-turn reasoning tokens; UI shows '🧠 N' on the turn (new runs).

…ed box) the ↑ prompt is re-sent every turn (sum of per-turn ↑ = total input processed; e.g. 581k over 28 turns even though the last prompt is only 32k). most of it is CACHED (re-sent prefix -> ~60% cache hit) and billed ~10x cheaper, but that was invisible: - per-turn badge now shows ⚡<cached> alongside ↑<total prompt> - run sidebar adds an '⚡ Cached input' stat (k + % of ↑) - per-task pill shows ⚡<cached> within the in-tokens harness stamps per-turn tok_cache (new runs get the per-turn ⚡; run-level works on all runs)

- mem_index.py builds /tmp/sdebench/memindex/<codebase>.json (raw commits: subject/body/files/diff) - --history memtool: squashed repo (no git trail) + recall_intent tool over the full history's index, so the TOOL replaces git — the fair 'beat full git' comparison - load_env/run_agent gain mem_index (sets MEM_INDEX, enables the recall_intent plugin mode) - smoke (rounding): memtool 16 turns $0.30 48s vs full 25 turns $0.41 80s

…a engine) - 9 modules (tokens/nodes/parser/refs/sheet/functions/evaluator/errors/engine), longer files, ~22-commit noisy history. Keeps billing as the 'easy' codebase. - HARD regression FAR FROM SYMPTOM + history-dependent + underdetermined: a 'centralize argument evaluation' refactor made the evaluator short-circuit on any error arg, so COUNT/AVG/MIN/MAX over a range with an error cell return the error instead of aggregating the numbers (SUM is unaffected since it propagates anyway -> slips past existing tests). The bug is in evaluator.py, not in the COUNT code the symptom points at. Policy (functions decide; SUM propagates, aggregates skip) lives in functions.py + history; the precise 'don't short-circuit calls' invariant is only in history. Validated: HEAD 8 green / regression+hidden 3 fail; COUNT-only fix passes repro but fails AVG/MIN/MAX hidden; correct general fix -> 13 green. - also: base PROMPT now asks the agent to work efficiently (applied to ALL arms = fair)

…e prompt) Exp1: tests push (auto-inject top-2 TF-ranked commits into the bug report) vs pull (recall_intent tool the agent must invoke). Same retrieval, different delivery. squashed repo + injected context.

…per bound) Research ablation (not a deployable method): injects the exact regression commit's diff into the prompt. The ceiling of 'perfect pushed memory' — if oracle doesn't beat full, behavior (not knowledge/retrieval) is the wall. Added cause_subject to all 5 tasks.

…via SDE_VARIANT Exp2: test whether constraining exploration (uniform, fair prompt) cuts cost independent of memory. variant stored in result.json.

…ers stack push (inject) beats git on 4/5 tasks where pull (recall_intent tool) doesn't; behavioral constraint (minimal) helps exploration-bound but not knowledge-bound tasks; the two levers fix different bottlenecks and STACK (inject+minimal -30% to -41% vs git, never hurts).

…al ablations Offline result: neither top-k nor a bug+repro 'rich query' surfaces minicalc's symptom-distant cause commit — simple symptom-based retrieval can't find it (the bug is a wrong return value, no traceback; the cause shares no terms with the symptom). Bounds what push retrieval can do.

…ble by symptom query

… tool Tests whether keeping the pull tool available on top of pushed symptom-context lets the agent find symptom-distant causes (the inject->oracle gap) by querying after it understands the code.

…ug fixes (was bug artifact)

…ersation rename)

…), not the 7 decision-types The UI's category pivot used dataset.categories() = the 7 decision-types, so the category axis looked over-split. Make the PRIMARY AMB category the axis (history vs conversation — 2 values, the benchmark's core 'where the decision lives' split); load_queries filters by source. Tier + decision-type stay as secondary breakdown axes via get_result_categories.

…t, both agents) Prune superseded results: old 10-task opencode n=3 (ov-*, repro2-*), ancient agent-mode single-task runs (opencode+...), and the pre-bugfix claude pair. Rename the fixed claude runs to canonical names. Final 4 runs, all 19 tasks: nz-oc-none / nz-oc-hs (opencode/gemini, vanilla / hindsight) nz-cc-none / nz-cc-hs (claude/sonnet-5, vanilla / hindsight; both bug fixes)

Coding datasets don't have QA metrics (accuracy/recall/context-tokens) — every task is solved, so the signal is interventions/cost/turns/tokens. Server now aggregates the per-result agent metrics for coding-mode runs (interventions, cost, turns, tokens in/out, wall, solved) into the results list. DatasetDetail renders a coding table (Run | Agent | Tasks | Solved | Interventions | Cost | Turns | Tokens in/out | Wall) when the dataset is coding, and hides the QA charts/table. Sorted by interventions.

…lla/memory badge is right; robust codingArm fallback

…t/tokens/turns/wall) Grouped bar charts, vanilla vs memory per agent, from the committed results. Arms are matched by run-name globs so an n=3 rerun (nz-oc-none-1/-2/-3) auto-averages with error bars. Tokens split into input/cached/output (log scale). Run: uv run --with matplotlib python scripts/sdebench_charts.py --out ~/Documents/charts

…on axes

…ion + ~1c/reflect (gemini-3.1-flash-lite)

…t final, margin applies)

…ns label correctly in UI)

- Replace n=1 result files with 12 n=3 runs (nz-{oc,cc}-{none,hs}-{1,2,3}). - Backfill excluded from wall via pre-backfill + bank reuse (production-accurate: ingest once, reflect per task). - Effect stronger & tighter: OpenCode interv -58%, Claude -75%; low variance. - Wall flipped now that backfill is out of wall: memory faster/neutral for both (OpenCode -6%, Claude -25%). - Solve rate 100% in every arm. - OVERNIGHT_FINDINGS.md: n=3 section added.

…de reflect into wall_s Reflect-latency spikes from a single local Hindsight under concurrent benchmark load (agentic reflect, 6.5 LLM calls each; per-reflect wall p50 12s / p99 75s / max 99s) inflated OpenCode wall on a few tasks (under2camel, slugify), muting the turn win. Normalize retrieval to a flat 10s/task (managed-service p50) for both arms: Claude 1161->1063s (-8%), OpenCode 2532->2308s (-9%). run.py now times claude's hs_reflect into wall_s for parity with the opencode plugin. Only wall uses this normalization; interventions/cost/tokens/turns are as-measured.

…illa per-turn rate + 10s lookup) Per-turn slowdown on hindsight was load-noise (mixed-sign s/turn; correlated with the high-reflect-spike tasks), not a real effect. Model wall from the clean signal (turns): vanilla as-measured, memory arm = turns x vanilla_s_per_turn + 10s/task. OpenCode 2532->2180 (-14%), Claude 1161->1070 (-8%). Only wall is modeled; all other metrics as-measured.

More interpretable ($0.60/bug, 0.5 correction rounds/bug) and independent of the 19-task count. All 5 charts + titles/axes/notes switched to per-task; % deltas unchanged. Doc table + captions updated to match.

…ackfill - datasets submodule -> hardening-2026-07 (33 tasks: 12 hard-tier + 2 conversation-amended on top of the published 19; all validated) - coding.py: backfill accepts multi-chat conversations (list of lists), one chat doc per session — needed for conversation-amended; the vanilla seeding path already supported this - JOURNEY.md: hardening log (contamination fix, hard-tier design + validation, sweep restarts)

…; journal

…hart title run.py reads the plugin's reflect diagnostic from the container after the run, stores it as result.memory_diag, and prints a loud warning when a memory-arm run had no injected memory — the failure mode that silently invalidated a full sweep (see JOURNEY 2026-07-04). Charts: interventions panel retitled 'Corrections needed per task'.

Sweeps accumulated two full host-repo clones per task run and filled the disk to 100%, killing the docker daemon mid-sweep. Keep result.json and trace.json, drop the repo copies once the result is written.

…d) + charts from dz outputs Four arms x 3 runs on the hardened suite with the decontaminated plugin and v2 banks: opencode corrections/task 0.97->0.76 (-22%), claude 0.89->0.37 (-58%); solve 99/99 everywhere except one capped task in one claude memory run (98/99, reported as-is). Chart script reads dz-* runs with a reasoning-string fallback; wall+tokens panels dropped (not uniformly backed after the disk incident). JOURNEY completed.

cc corrections/task at n=5: vanilla 0.91 vs memory 0.38 (-58%, unchanged from n=3); solve 165/165 vs 163/165 (second capped run disclosed).

nicoloboschi added 30 commits June 25, 2026 17:23

docs(sdebench): document codebases (small/billing/minicalc) and tasks

81ffcf4

feat(sdebench): inject arm — PUSH memory (relevant commit diffs in th…

a154fd4

…e prompt) Exp1: tests push (auto-inject top-2 TF-ranked commits into the bug report) vs pull (recall_intent tool the agent must invoke). Same retrieval, different delivery. squashed repo + injected context.

feat(sdebench): behavioral prompt variants (base/hypothesis/minimal) …

0125091

…via SDE_VARIANT Exp2: test whether constraining exploration (uniform, fair prompt) cuts cost independent of memory. variant stored in result.json.

docs(sdebench): retrieval ceiling — symptom-distant causes unretrieva…

b18bccc

…ble by symptom query

feat(sdebench): hybrid arm — push policy context + pull recall_intent…

1a49398

… tool Tests whether keeping the pull tool available on top of pushed symptom-context lets the agent find symptom-distant causes (the inject->oracle gap) by querying after it understands the code.

docs(sdebench): correct morning summary — claude memory 12->3 after b…

15a4726

…ug fixes (was bug artifact)

vercel Bot deployed to Preview – memory-bench July 3, 2026 01:26 View deployment

vercel Bot deployed to Preview – open-memory-benchmark July 3, 2026 01:27 View deployment

chore(sdebench): bump sde-bench submodule (source H/F -> history/conv…

2c1fe1d

…ersation rename)

vercel Bot deployed to Preview – memory-bench July 3, 2026 06:56 View deployment

vercel Bot deployed to Preview – open-memory-benchmark July 3, 2026 06:56 View deployment

vercel Bot deployed to Preview – memory-bench July 3, 2026 07:17 View deployment

vercel Bot deployed to Preview – open-memory-benchmark July 3, 2026 07:17 View deployment

nicoloboschi added 21 commits July 3, 2026 09:21

fix(ui): correct claude runs' memory_provider (none/hscoding) so vani…

95f57a3

…lla/memory badge is right; robust codingArm fallback

chore(scripts): charts use Hindsight brand color/label + model names …

f3b3a29

…on axes

docs(sdebench): measure Hindsight's own cost — ~$1.14 one-time ingest…

20e3b88

…ion + ~1c/reflect (gemini-3.1-flash-lite)

docs(sdebench): mark Hindsight cost figures internal-only (pricing no…

42516ab

…t final, margin applies)

fix(omb): copy agent/model into coding QueryResult.meta (so claude ru…

14ff3a6

…ns label correctly in UI)

sdebench charts: report per-task averages instead of run totals

4286449

More interpretable ($0.60/bug, 0.5 correction rounds/bug) and independent of the 19-task count. All 5 charts + titles/axes/notes switched to per-task; % deltas unchanged. Doc table + captions updated to match.

sdebench charts: rename per-task outputs to *-pertask.png (cache-bust)

acba08b

chore(sdebench): datasets -> 9cc3d38 (enrichment-preserving emitters)…

080d829

…; journal

fix(sdebench): delete per-task repo/grade copies after grading

8c47d89

Sweeps accumulated two full host-repo clones per task run and filled the disk to 100%, killing the docker daemon mid-sweep. Keep result.json and trace.json, drop the repo copies once the result is written.

data(sdebench): force-add dz campaign result files (12 runs)

cdab5c8

docs(sdebench): capped-task post-mortem in JOURNEY

c605a32

data(sdebench): claude n=5 runs (none 4-5, hs 4-5); journal closed

72460de

cc corrections/task at n=5: vanilla 0.91 vs memory 0.38 (-58%, unchanged from n=3); solve 165/165 vs 163/165 (second capped run disclosed).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sdebench: does memory help a coding agent? (boltons-hosted benchmark)#23

sdebench: does memory help a coding agent? (boltons-hosted benchmark)#23
nicoloboschi wants to merge 142 commits into
mainfrom
sdebench-polish

nicoloboschi commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nicoloboschi commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!