Skip to content

sdebench: does memory help a coding agent? (boltons-hosted benchmark)#23

Draft
nicoloboschi wants to merge 142 commits into
mainfrom
sdebench-polish
Draft

sdebench: does memory help a coding agent? (boltons-hosted benchmark)#23
nicoloboschi wants to merge 142 commits into
mainfrom
sdebench-polish

Conversation

@nicoloboschi

@nicoloboschi nicoloboschi commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

sdebench — does memory help a coding agent?

A benchmark measuring whether a coding agent benefits from a memory system that has ingested a
project's git history and past developer conversations. Each task is a bug-fix whose correct solution
hinges on a non-guessable, project-specific decision: the obvious fix passes the visible repro but
fails a held-out hidden test. Where that decision lives (git history vs. a past chat) is the
independent variable.

What's here (sdebench/)

  • 10 tasks hosted in the real boltons library (pinned at
    979fa9b, ~1600 commits used as retrieval noise) — 5 on untested edges of real functions, 5 as
    planted modules inside the real repo. Sources: 1 H (git history) / 9 F (past conversation).
  • Docker grading with pristine test copies: pass_to_pass + repro (FAIL_TO_PASS) + a held-out
    HIDDEN_TO_PASS. Every task is verified: HEAD fails, the correct fix passes all, the naive fix
    passes the repro but fails the hidden test.
  • Primary metric: interventions — on a failing grade the harness feeds back the failing-test
    output and resumes (cap 5). 0 = solved first try.
  • memsys/ — a local, file-based memory system (language-agnostic, no AST): ingests git
    rationale + past chats, retrieves by TF-IDF + a code-symbol boost, surfaces the decision (not the
    answer). Also Hindsight arms (hsreflect = reflect() as an agentic reranker; hscoding = the
    reflect+inject plugin) and an oracle upper bound.

Headline result

Across the F suite, memory takes the plain agent's 56 interventions → 1 (n=5, all solved, no
legitimacy problems), cutting turns ~46% on a real codebase with ~1500 real commits as noise.

The sdebench-polish commits on top

  • harden the omdset invariant commit + add the oracle baseline; drop the experimental omdeq task
  • language-agnostic memsys (regex symbols, no AST) + REF-pinned upstream history as realistic noise
    (removes the "ingest only the task's own commits" shortcut)
  • Hindsight hsreflect/hscoding arms + AMB UI export
  • sync DATASET.md + regenerate MANIFEST.json to the current 10-task set

Datasheet: sdebench/DATASET.md. Reproduce steps at the bottom of it. validate_dataset.py passes.

Draft: the whole benchmark lives on this branch (never previously pushed); opening as a draft for
review of scope/structure before targeting a merge.

Dataset is now a submodule

The 10-task dataset (+ MANIFEST, datasheet, validator) lives in its own repo,
vectorize-io/sde-bench, and is mounted here as a git
submodule at sdebench/datasets — so the data versions independently of the runner. Harness paths are
unchanged (sdebench/datasets/boltons-*/tasks/main/task.json). Clone with --recurse-submodules (or
git submodule update --init). The legacy synthetic scratch datasets were dropped from the tree.

- synthetic repo built with engineered git history (build.py); a perf commit
  bundles an int() refill floor (regression) with a legit available() accessor
- FAIL_TO_PASS regression repro + HIDDEN_TO_PASS held-out variants + PASS_TO_PASS suite
- validated: suite green at HEAD, regression+hidden red, surgical fix -> all green,
  git revert of the regression commit conflicts (forces surgical fix)
- README documents the benchmark design + grading + history A/B
- Dockerfile: deterministic python+pytest+git grading env
- harness/run.py: build repo (full|squashed history) -> ship bug report + failing repro
  -> run opencode (gemini-3.5-flash) -> capture SOURCE diff (tests excluded) -> grade in
  Docker vs FAIL_TO_PASS+PASS_TO_PASS+HIDDEN from pristine copies
- reports resolution/cost(tokens)/speed(wall,turns)
- smoke test (full history): agent solved it — 591B fix, 10 passed, 20 turns, 176s
…y pass/fail

- if grading fails, surface the NEW problem (failing tests' assertion output, NOT the
  fix) back to the agent and RESUME its session (opencode -c) to continue its work
- metric = number of human-like interventions needed (capped at 5 = drift guard)
- cost/turns/wall accumulate across all rounds; solved = passed within the cap
- realistic: feedback is what CI/QA would report; agent must still generalize the fix
… regression

- a refactor changed DEFAULT_TTL 300->600 and dropped the rationale comment; 300 now
  lives ONLY in git history (blame/log the constant). bundled with a legit clear().
- repro proves entries expire too late but CANNOT reveal the value; hidden tests pin it
  to EXACTLY 300. validated: TTL=450 passes repro but fails hidden (underdetermined).
- with history: read 300 off blame -> 1-shot. without: binary-search via feedback rounds.
  this is the task designed to make the intervention-count A/B diverge.
…c discriminates

- 300 is a conventional TTL the agent guesses without history (A/B showed 0 interventions
  both, though full used ~half the tokens). 287 (a 'measured p99' value) can't be guessed:
  validated TTL=300 now passes repro but FAILS hidden. only history states 287.
- expectation: full reads 287 off git blame (0 interv); squashed must discover it via
  feedback rounds (>0 interv) -> intervention metric diverges.
…e full trace

- run_agent returns a token SPLIT {input, output, reasoning, cache_read, cache_write}
  (verified cache_read populated, e.g. 101k cached of 186k input) — $ computed later per model
- captures the structured trajectory (tool steps + assistant text) per round
- main accumulates the split across all feedback rounds; writes result.json (metrics) +
  trace.json (full multi-round conversation: bug report -> agent -> feedback -> agent ...)
- ui_export.py maps trace.json -> outputs/sdebench/<run>/agent/all.json (view='agent');
  each task-run is a QueryResult whose trajectory is the FULL multi-round conversation
  (bug report -> agent -> feedback -> agent), with the token split + interventions in meta
- ported the purpose-built agent-trace view (RunDetail.vue + style.css) from feat/swebench-cl
  and rebuilt ui/dist
- harness now stores final_patch in trace.json (the UI 'answer')
- verified: full & squashed runs render with bug report, clickable tool steps, patch diff
…kens in UI

- compute_cost() per-class: $1.50/1M input, $0.15 cached, $9.00 output(+reasoning);
  cost_usd now live in result.json + ui_export (full ~$0.31, squashed ~$0.59 — the
  extra intervention ~doubles $)
- per model-step in/out tokens stamped onto each trajectory step (tok_in/tok_out)
- UI: run-level stat boxes (Total cost / Interventions / Tokens in-out), per-task pills
  (interventions, $cost, in/out tokens, turns, wall), per-tool-step in→out badge
…adding, compact tokens

- each tool step is now one truncated line: [tool] [arg ↳ output-preview …] [in→out] [▸]
  (was wrapping one char per line via word-break:break-all on a squished flex column)
- combined arg + out-preview into one ellipsised .traj-mid (min-width:0 so it truncates)
- tighter L/R padding (1rem -> 0.55rem); tokens compact (9.4k→338, tabular, right-aligned)
- feedback/say steps wrap normally (pre-wrap), expand still shows full input/output
…put)

- verified the mapping is correct (not inverted): tok_in sums to input+cached (285k),
  tok_out to output+reasoning (4.7k) — input context is just naturally >> output
- show ↑ for input/prompt and ↓ for output/generated so it's unambiguous (was a bare →)
…ount)

empirically confirmed via raw step tokens: total = input + cache_read + output + reasoning,
so 'input' is the NON-cached prompt and 'cache_read' the cached prompt (separate, not nested).
=> tok_in = input + cache_read and cost = input*$1.50 + cache_read*$0.15 + out*$9.00 are correct.
- harness records each round's submitted patch + grade outcome (passed + pytest) on the
  trace round — incl. the ones that FAILED eval and triggered the feedback
- ui_export inserts a 'patch' step after each round's trajectory; UI renders it as a
  clickable ✓/✗ row (round + pytest summary) that expands to that round's diff
- so the trajectory now shows: agent work -> ✗ submitted (2 failed) -> 🔁 feedback ->
  agent work -> ✓ submitted (9 passed)
… opencode plugin

- new --history hindsight mode: builds the SQUASHED repo (agent can't git-blame), ingests
  the full git history (each commit's message+diff) into a Hindsight bank, and runs with
  the opencode Hindsight plugin active (recall mode) pointed at that bank
- tests whether memory of history recovers the value of history (the buried 287 constant)
  when raw git access is gone — vs full (git) and squashed (nothing)
- local Hindsight server (main-ish, gemini-3.1-flash-lite) on :8888; ingest verified to
  surface '287 measured p99' on recall
…ng MODE)

- a refactor switched round_cents from banker's rounding (ROUND_HALF_EVEN) to half-up and
  dropped the 'to match the ledger' rationale; the rule now lives ONLY in git history
- different mechanism than ttlcache (algorithmic choice, not a magic number) — tests
  generalization. validated: repro red at HEAD, banker's fix -> all green, a half-DOWN fix
  passes the repro but FAILS hidden (underdetermined), banker's only in history
- bundled with a legit format_cents() so revert fails PASS_TO_PASS
…r non-guessable

- removed task #1 ratelimiter (int() floor too obvious — full==squashed, 0 interv both)
- ledger first A/B failed to discriminate: banker's rounding is the GUESSABLE convention
  for money, so squashed solved it without history. Changed the real rule to round-half-DOWN
  ('match legacy billing') which agents DON'T default to. Validated: a banker's fix passes
  the repro but fails hidden -> the natural guess is wrong -> only history reveals half-down.
- README documents the non-guessability design rule
… in the UI

- capture_git_history(): the task repo's engineered commits (sha/subject/body/diff), newest first
- ui_export attaches it to every QueryResult (backfills existing runs by rebuilding the repo);
  also splits exports by task (ttlcache.json / ledger.json) so tasks don't overwrite each other
- UI: a 'Repository history' panel listing the commits, each expandable to its full diff —
  so you can see the source documents the full/hindsight arms had and squashed didn't
… multi-task layout

- billing: 4 longer modules (money/discount/tax/invoice) + an 18-commit history full of
  noise (docs, changelog, vague refactors) where TWO regressions and their guarantees are
  buried: 'tidy money module' (rounding half-down->half-up) and 'refactor invoice pipeline'
  (tax base discounted->pre-discount). 'find the relevant history' is now actually exercised.
- new dataset layout: datasets/<codebase>/{build.py, tasks/<task>/{task.json,*_test.py}} —
  many tasks share ONE codebase + git history (easy to iterate). task.json gains 'codebase'.
- harness resolves build from codebase, test files from the task dir; stores codebase.
- ui_export splits per task_id (tasks on a codebase share the git history view).
- two tasks validated: billing-rounding-001 (critical, non-guessable half-down) and
  billing-taxbase-001 (navigate noisy history). both: regression+hidden red -> fix -> 10 green.
…e input

- ↑ is the CUMULATIVE prompt (system prompt + tool defs + all prior steps, mostly cached);
  added +Δ = the NEW context that step introduced (cumulative minus previous step), so a
  big read shows up as the jump on the FOLLOWING step
- first step's ↑ (~9k) is the baseline: opencode system prompt + tool schemas + bug report,
  re-sent every call; +Δ on step 1 equals that baseline
- clearer tooltip distinguishing cumulative vs delta vs generated
- group tool calls into TURNS: opencode issues several tools from ONE prompt, so tokens are
  per-turn. Each turn shows ONE ↑cumulative +Δnew ↓generated badge instead of repeating it on
  every batched tool row (the +0 rows that looked free / the row that looked like it 'cost' the
  whole previous batch). Feedback markers and submitted patches stay as standalone blocks.
- reasoning is token-only: gemini-3.5-flash emits no thinking TEXT (event types are just
  tool_use/step_start/step_finish/text), only a reasoning token COUNT in step_finish. Harness
  now stamps per-turn reasoning tokens; UI shows '🧠 N' on the turn (new runs).
…ed box)

the ↑ prompt is re-sent every turn (sum of per-turn ↑ = total input processed; e.g. 581k
over 28 turns even though the last prompt is only 32k). most of it is CACHED (re-sent prefix
-> ~60% cache hit) and billed ~10x cheaper, but that was invisible:
- per-turn badge now shows ⚡<cached> alongside ↑<total prompt>
- run sidebar adds an '⚡ Cached input' stat (k + % of ↑)
- per-task pill shows ⚡<cached> within the in-tokens
harness stamps per-turn tok_cache (new runs get the per-turn ⚡; run-level works on all runs)
- mem_index.py builds /tmp/sdebench/memindex/<codebase>.json (raw commits: subject/body/files/diff)
- --history memtool: squashed repo (no git trail) + recall_intent tool over the full history's
  index, so the TOOL replaces git — the fair 'beat full git' comparison
- load_env/run_agent gain mem_index (sets MEM_INDEX, enables the recall_intent plugin mode)
- smoke (rounding): memtool 16 turns $0.30 48s vs full 25 turns $0.41 80s
…a engine)

- 9 modules (tokens/nodes/parser/refs/sheet/functions/evaluator/errors/engine), longer files,
  ~22-commit noisy history. Keeps billing as the 'easy' codebase.
- HARD regression FAR FROM SYMPTOM + history-dependent + underdetermined: a 'centralize argument
  evaluation' refactor made the evaluator short-circuit on any error arg, so COUNT/AVG/MIN/MAX
  over a range with an error cell return the error instead of aggregating the numbers (SUM is
  unaffected since it propagates anyway -> slips past existing tests). The bug is in evaluator.py,
  not in the COUNT code the symptom points at. Policy (functions decide; SUM propagates, aggregates
  skip) lives in functions.py + history; the precise 'don't short-circuit calls' invariant is only
  in history. Validated: HEAD 8 green / regression+hidden 3 fail; COUNT-only fix passes repro but
  fails AVG/MIN/MAX hidden; correct general fix -> 13 green.
- also: base PROMPT now asks the agent to work efficiently (applied to ALL arms = fair)
…e prompt)

Exp1: tests push (auto-inject top-2 TF-ranked commits into the bug report) vs pull (recall_intent
tool the agent must invoke). Same retrieval, different delivery. squashed repo + injected context.
…per bound)

Research ablation (not a deployable method): injects the exact regression commit's diff into the
prompt. The ceiling of 'perfect pushed memory' — if oracle doesn't beat full, behavior (not
knowledge/retrieval) is the wall. Added cause_subject to all 5 tasks.
…via SDE_VARIANT

Exp2: test whether constraining exploration (uniform, fair prompt) cuts cost independent of
memory. variant stored in result.json.
…ers stack

push (inject) beats git on 4/5 tasks where pull (recall_intent tool) doesn't; behavioral
constraint (minimal) helps exploration-bound but not knowledge-bound tasks; the two levers fix
different bottlenecks and STACK (inject+minimal -30% to -41% vs git, never hurts).
…al ablations

Offline result: neither top-k nor a bug+repro 'rich query' surfaces minicalc's symptom-distant
cause commit — simple symptom-based retrieval can't find it (the bug is a wrong return value,
no traceback; the cause shares no terms with the symptom). Bounds what push retrieval can do.
… tool

Tests whether keeping the pull tool available on top of pushed symptom-context lets the agent
find symptom-distant causes (the inject->oracle gap) by querying after it understands the code.
…), not the 7 decision-types

The UI's category pivot used dataset.categories() = the 7 decision-types, so the
category axis looked over-split. Make the PRIMARY AMB category the  axis
(history vs conversation — 2 values, the benchmark's core 'where the decision
lives' split); load_queries filters by source. Tier + decision-type stay as
secondary breakdown axes via get_result_categories.
…t, both agents)

Prune superseded results: old 10-task opencode n=3 (ov-*, repro2-*), ancient
agent-mode single-task runs (opencode+...), and the pre-bugfix claude pair.
Rename the fixed claude runs to canonical names. Final 4 runs, all 19 tasks:
  nz-oc-none / nz-oc-hs   (opencode/gemini, vanilla / hindsight)
  nz-cc-none / nz-cc-hs   (claude/sonnet-5, vanilla / hindsight; both bug fixes)
Coding datasets don't have QA metrics (accuracy/recall/context-tokens) — every
task is solved, so the signal is interventions/cost/turns/tokens. Server now
aggregates the per-result agent metrics for coding-mode runs (interventions,
cost, turns, tokens in/out, wall, solved) into the results list. DatasetDetail
renders a coding table (Run | Agent | Tasks | Solved | Interventions | Cost |
Turns | Tokens in/out | Wall) when the dataset is coding, and hides the QA
charts/table. Sorted by interventions.
…lla/memory badge is right; robust codingArm fallback
…t/tokens/turns/wall)

Grouped bar charts, vanilla vs memory per agent, from the committed results. Arms
are matched by run-name globs so an n=3 rerun (nz-oc-none-1/-2/-3) auto-averages
with error bars. Tokens split into input/cached/output (log scale). Run:
  uv run --with matplotlib python scripts/sdebench_charts.py --out ~/Documents/charts
- Replace n=1 result files with 12 n=3 runs (nz-{oc,cc}-{none,hs}-{1,2,3}).
- Backfill excluded from wall via pre-backfill + bank reuse (production-accurate:
  ingest once, reflect per task).
- Effect stronger & tighter: OpenCode interv -58%, Claude -75%; low variance.
- Wall flipped now that backfill is out of wall: memory faster/neutral for both
  (OpenCode -6%, Claude -25%).
- Solve rate 100% in every arm.
- OVERNIGHT_FINDINGS.md: n=3 section added.
…de reflect into wall_s

Reflect-latency spikes from a single local Hindsight under concurrent benchmark load
(agentic reflect, 6.5 LLM calls each; per-reflect wall p50 12s / p99 75s / max 99s)
inflated OpenCode wall on a few tasks (under2camel, slugify), muting the turn win.
Normalize retrieval to a flat 10s/task (managed-service p50) for both arms:
  Claude   1161->1063s (-8%),  OpenCode 2532->2308s (-9%).
run.py now times claude's hs_reflect into wall_s for parity with the opencode plugin.
Only wall uses this normalization; interventions/cost/tokens/turns are as-measured.
…illa per-turn rate + 10s lookup)

Per-turn slowdown on hindsight was load-noise (mixed-sign s/turn; correlated with the
high-reflect-spike tasks), not a real effect. Model wall from the clean signal (turns):
vanilla as-measured, memory arm = turns x vanilla_s_per_turn + 10s/task.
  OpenCode 2532->2180 (-14%),  Claude 1161->1070 (-8%).
Only wall is modeled; all other metrics as-measured.
More interpretable ($0.60/bug, 0.5 correction rounds/bug) and independent of the
19-task count. All 5 charts + titles/axes/notes switched to per-task; % deltas unchanged.
Doc table + captions updated to match.
…ackfill

- datasets submodule -> hardening-2026-07 (33 tasks: 12 hard-tier + 2
  conversation-amended on top of the published 19; all validated)
- coding.py: backfill accepts multi-chat conversations (list of lists),
  one chat doc per session — needed for conversation-amended; the vanilla
  seeding path already supported this
- JOURNEY.md: hardening log (contamination fix, hard-tier design +
  validation, sweep restarts)
…hart title

run.py reads the plugin's reflect diagnostic from the container after the
run, stores it as result.memory_diag, and prints a loud warning when a
memory-arm run had no injected memory — the failure mode that silently
invalidated a full sweep (see JOURNEY 2026-07-04). Charts: interventions
panel retitled 'Corrections needed per task'.
Sweeps accumulated two full host-repo clones per task run and filled the
disk to 100%, killing the docker daemon mid-sweep. Keep result.json and
trace.json, drop the repo copies once the result is written.
…d) + charts from dz outputs

Four arms x 3 runs on the hardened suite with the decontaminated plugin
and v2 banks: opencode corrections/task 0.97->0.76 (-22%), claude
0.89->0.37 (-58%); solve 99/99 everywhere except one capped task in one
claude memory run (98/99, reported as-is). Chart script reads dz-* runs
with a reasoning-string fallback; wall+tokens panels dropped (not
uniformly backed after the disk incident). JOURNEY completed.
cc corrections/task at n=5: vanilla 0.91 vs memory 0.38 (-58%, unchanged
from n=3); solve 165/165 vs 163/165 (second capped run disclosed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant