sdebench: does memory help a coding agent? (boltons-hosted benchmark)#23
Draft
nicoloboschi wants to merge 142 commits into
Draft
sdebench: does memory help a coding agent? (boltons-hosted benchmark)#23nicoloboschi wants to merge 142 commits into
nicoloboschi wants to merge 142 commits into
Conversation
- synthetic repo built with engineered git history (build.py); a perf commit bundles an int() refill floor (regression) with a legit available() accessor - FAIL_TO_PASS regression repro + HIDDEN_TO_PASS held-out variants + PASS_TO_PASS suite - validated: suite green at HEAD, regression+hidden red, surgical fix -> all green, git revert of the regression commit conflicts (forces surgical fix) - README documents the benchmark design + grading + history A/B
- Dockerfile: deterministic python+pytest+git grading env - harness/run.py: build repo (full|squashed history) -> ship bug report + failing repro -> run opencode (gemini-3.5-flash) -> capture SOURCE diff (tests excluded) -> grade in Docker vs FAIL_TO_PASS+PASS_TO_PASS+HIDDEN from pristine copies - reports resolution/cost(tokens)/speed(wall,turns) - smoke test (full history): agent solved it — 591B fix, 10 passed, 20 turns, 176s
…y pass/fail - if grading fails, surface the NEW problem (failing tests' assertion output, NOT the fix) back to the agent and RESUME its session (opencode -c) to continue its work - metric = number of human-like interventions needed (capped at 5 = drift guard) - cost/turns/wall accumulate across all rounds; solved = passed within the cap - realistic: feedback is what CI/QA would report; agent must still generalize the fix
… regression - a refactor changed DEFAULT_TTL 300->600 and dropped the rationale comment; 300 now lives ONLY in git history (blame/log the constant). bundled with a legit clear(). - repro proves entries expire too late but CANNOT reveal the value; hidden tests pin it to EXACTLY 300. validated: TTL=450 passes repro but fails hidden (underdetermined). - with history: read 300 off blame -> 1-shot. without: binary-search via feedback rounds. this is the task designed to make the intervention-count A/B diverge.
…c discriminates - 300 is a conventional TTL the agent guesses without history (A/B showed 0 interventions both, though full used ~half the tokens). 287 (a 'measured p99' value) can't be guessed: validated TTL=300 now passes repro but FAILS hidden. only history states 287. - expectation: full reads 287 off git blame (0 interv); squashed must discover it via feedback rounds (>0 interv) -> intervention metric diverges.
…e full trace
- run_agent returns a token SPLIT {input, output, reasoning, cache_read, cache_write}
(verified cache_read populated, e.g. 101k cached of 186k input) — $ computed later per model
- captures the structured trajectory (tool steps + assistant text) per round
- main accumulates the split across all feedback rounds; writes result.json (metrics) +
trace.json (full multi-round conversation: bug report -> agent -> feedback -> agent ...)
- ui_export.py maps trace.json -> outputs/sdebench/<run>/agent/all.json (view='agent'); each task-run is a QueryResult whose trajectory is the FULL multi-round conversation (bug report -> agent -> feedback -> agent), with the token split + interventions in meta - ported the purpose-built agent-trace view (RunDetail.vue + style.css) from feat/swebench-cl and rebuilt ui/dist - harness now stores final_patch in trace.json (the UI 'answer') - verified: full & squashed runs render with bug report, clickable tool steps, patch diff
…kens in UI - compute_cost() per-class: $1.50/1M input, $0.15 cached, $9.00 output(+reasoning); cost_usd now live in result.json + ui_export (full ~$0.31, squashed ~$0.59 — the extra intervention ~doubles $) - per model-step in/out tokens stamped onto each trajectory step (tok_in/tok_out) - UI: run-level stat boxes (Total cost / Interventions / Tokens in-out), per-task pills (interventions, $cost, in/out tokens, turns, wall), per-tool-step in→out badge
…adding, compact tokens - each tool step is now one truncated line: [tool] [arg ↳ output-preview …] [in→out] [▸] (was wrapping one char per line via word-break:break-all on a squished flex column) - combined arg + out-preview into one ellipsised .traj-mid (min-width:0 so it truncates) - tighter L/R padding (1rem -> 0.55rem); tokens compact (9.4k→338, tabular, right-aligned) - feedback/say steps wrap normally (pre-wrap), expand still shows full input/output
…put) - verified the mapping is correct (not inverted): tok_in sums to input+cached (285k), tok_out to output+reasoning (4.7k) — input context is just naturally >> output - show ↑ for input/prompt and ↓ for output/generated so it's unambiguous (was a bare →)
…ount) empirically confirmed via raw step tokens: total = input + cache_read + output + reasoning, so 'input' is the NON-cached prompt and 'cache_read' the cached prompt (separate, not nested). => tok_in = input + cache_read and cost = input*$1.50 + cache_read*$0.15 + out*$9.00 are correct.
- harness records each round's submitted patch + grade outcome (passed + pytest) on the trace round — incl. the ones that FAILED eval and triggered the feedback - ui_export inserts a 'patch' step after each round's trajectory; UI renders it as a clickable ✓/✗ row (round + pytest summary) that expands to that round's diff - so the trajectory now shows: agent work -> ✗ submitted (2 failed) -> 🔁 feedback -> agent work -> ✓ submitted (9 passed)
… opencode plugin - new --history hindsight mode: builds the SQUASHED repo (agent can't git-blame), ingests the full git history (each commit's message+diff) into a Hindsight bank, and runs with the opencode Hindsight plugin active (recall mode) pointed at that bank - tests whether memory of history recovers the value of history (the buried 287 constant) when raw git access is gone — vs full (git) and squashed (nothing) - local Hindsight server (main-ish, gemini-3.1-flash-lite) on :8888; ingest verified to surface '287 measured p99' on recall
…ng MODE) - a refactor switched round_cents from banker's rounding (ROUND_HALF_EVEN) to half-up and dropped the 'to match the ledger' rationale; the rule now lives ONLY in git history - different mechanism than ttlcache (algorithmic choice, not a magic number) — tests generalization. validated: repro red at HEAD, banker's fix -> all green, a half-DOWN fix passes the repro but FAILS hidden (underdetermined), banker's only in history - bundled with a legit format_cents() so revert fails PASS_TO_PASS
…r non-guessable - removed task #1 ratelimiter (int() floor too obvious — full==squashed, 0 interv both) - ledger first A/B failed to discriminate: banker's rounding is the GUESSABLE convention for money, so squashed solved it without history. Changed the real rule to round-half-DOWN ('match legacy billing') which agents DON'T default to. Validated: a banker's fix passes the repro but fails hidden -> the natural guess is wrong -> only history reveals half-down. - README documents the non-guessability design rule
… in the UI - capture_git_history(): the task repo's engineered commits (sha/subject/body/diff), newest first - ui_export attaches it to every QueryResult (backfills existing runs by rebuilding the repo); also splits exports by task (ttlcache.json / ledger.json) so tasks don't overwrite each other - UI: a 'Repository history' panel listing the commits, each expandable to its full diff — so you can see the source documents the full/hindsight arms had and squashed didn't
… multi-task layout
- billing: 4 longer modules (money/discount/tax/invoice) + an 18-commit history full of
noise (docs, changelog, vague refactors) where TWO regressions and their guarantees are
buried: 'tidy money module' (rounding half-down->half-up) and 'refactor invoice pipeline'
(tax base discounted->pre-discount). 'find the relevant history' is now actually exercised.
- new dataset layout: datasets/<codebase>/{build.py, tasks/<task>/{task.json,*_test.py}} —
many tasks share ONE codebase + git history (easy to iterate). task.json gains 'codebase'.
- harness resolves build from codebase, test files from the task dir; stores codebase.
- ui_export splits per task_id (tasks on a codebase share the git history view).
- two tasks validated: billing-rounding-001 (critical, non-guessable half-down) and
billing-taxbase-001 (navigate noisy history). both: regression+hidden red -> fix -> 10 green.
…e input - ↑ is the CUMULATIVE prompt (system prompt + tool defs + all prior steps, mostly cached); added +Δ = the NEW context that step introduced (cumulative minus previous step), so a big read shows up as the jump on the FOLLOWING step - first step's ↑ (~9k) is the baseline: opencode system prompt + tool schemas + bug report, re-sent every call; +Δ on step 1 equals that baseline - clearer tooltip distinguishing cumulative vs delta vs generated
- group tool calls into TURNS: opencode issues several tools from ONE prompt, so tokens are per-turn. Each turn shows ONE ↑cumulative +Δnew ↓generated badge instead of repeating it on every batched tool row (the +0 rows that looked free / the row that looked like it 'cost' the whole previous batch). Feedback markers and submitted patches stay as standalone blocks. - reasoning is token-only: gemini-3.5-flash emits no thinking TEXT (event types are just tool_use/step_start/step_finish/text), only a reasoning token COUNT in step_finish. Harness now stamps per-turn reasoning tokens; UI shows '🧠 N' on the turn (new runs).
…ed box) the ↑ prompt is re-sent every turn (sum of per-turn ↑ = total input processed; e.g. 581k over 28 turns even though the last prompt is only 32k). most of it is CACHED (re-sent prefix -> ~60% cache hit) and billed ~10x cheaper, but that was invisible: - per-turn badge now shows ⚡<cached> alongside ↑<total prompt> - run sidebar adds an '⚡ Cached input' stat (k + % of ↑) - per-task pill shows ⚡<cached> within the in-tokens harness stamps per-turn tok_cache (new runs get the per-turn ⚡; run-level works on all runs)
- mem_index.py builds /tmp/sdebench/memindex/<codebase>.json (raw commits: subject/body/files/diff) - --history memtool: squashed repo (no git trail) + recall_intent tool over the full history's index, so the TOOL replaces git — the fair 'beat full git' comparison - load_env/run_agent gain mem_index (sets MEM_INDEX, enables the recall_intent plugin mode) - smoke (rounding): memtool 16 turns $0.30 48s vs full 25 turns $0.41 80s
…a engine) - 9 modules (tokens/nodes/parser/refs/sheet/functions/evaluator/errors/engine), longer files, ~22-commit noisy history. Keeps billing as the 'easy' codebase. - HARD regression FAR FROM SYMPTOM + history-dependent + underdetermined: a 'centralize argument evaluation' refactor made the evaluator short-circuit on any error arg, so COUNT/AVG/MIN/MAX over a range with an error cell return the error instead of aggregating the numbers (SUM is unaffected since it propagates anyway -> slips past existing tests). The bug is in evaluator.py, not in the COUNT code the symptom points at. Policy (functions decide; SUM propagates, aggregates skip) lives in functions.py + history; the precise 'don't short-circuit calls' invariant is only in history. Validated: HEAD 8 green / regression+hidden 3 fail; COUNT-only fix passes repro but fails AVG/MIN/MAX hidden; correct general fix -> 13 green. - also: base PROMPT now asks the agent to work efficiently (applied to ALL arms = fair)
…e prompt) Exp1: tests push (auto-inject top-2 TF-ranked commits into the bug report) vs pull (recall_intent tool the agent must invoke). Same retrieval, different delivery. squashed repo + injected context.
…per bound) Research ablation (not a deployable method): injects the exact regression commit's diff into the prompt. The ceiling of 'perfect pushed memory' — if oracle doesn't beat full, behavior (not knowledge/retrieval) is the wall. Added cause_subject to all 5 tasks.
…via SDE_VARIANT Exp2: test whether constraining exploration (uniform, fair prompt) cuts cost independent of memory. variant stored in result.json.
…ers stack push (inject) beats git on 4/5 tasks where pull (recall_intent tool) doesn't; behavioral constraint (minimal) helps exploration-bound but not knowledge-bound tasks; the two levers fix different bottlenecks and STACK (inject+minimal -30% to -41% vs git, never hurts).
…al ablations Offline result: neither top-k nor a bug+repro 'rich query' surfaces minicalc's symptom-distant cause commit — simple symptom-based retrieval can't find it (the bug is a wrong return value, no traceback; the cause shares no terms with the symptom). Bounds what push retrieval can do.
…ble by symptom query
… tool Tests whether keeping the pull tool available on top of pushed symptom-context lets the agent find symptom-distant causes (the inject->oracle gap) by querying after it understands the code.
…ug fixes (was bug artifact)
…), not the 7 decision-types The UI's category pivot used dataset.categories() = the 7 decision-types, so the category axis looked over-split. Make the PRIMARY AMB category the axis (history vs conversation — 2 values, the benchmark's core 'where the decision lives' split); load_queries filters by source. Tier + decision-type stay as secondary breakdown axes via get_result_categories.
…t, both agents) Prune superseded results: old 10-task opencode n=3 (ov-*, repro2-*), ancient agent-mode single-task runs (opencode+...), and the pre-bugfix claude pair. Rename the fixed claude runs to canonical names. Final 4 runs, all 19 tasks: nz-oc-none / nz-oc-hs (opencode/gemini, vanilla / hindsight) nz-cc-none / nz-cc-hs (claude/sonnet-5, vanilla / hindsight; both bug fixes)
Coding datasets don't have QA metrics (accuracy/recall/context-tokens) — every task is solved, so the signal is interventions/cost/turns/tokens. Server now aggregates the per-result agent metrics for coding-mode runs (interventions, cost, turns, tokens in/out, wall, solved) into the results list. DatasetDetail renders a coding table (Run | Agent | Tasks | Solved | Interventions | Cost | Turns | Tokens in/out | Wall) when the dataset is coding, and hides the QA charts/table. Sorted by interventions.
…lla/memory badge is right; robust codingArm fallback
…t/tokens/turns/wall) Grouped bar charts, vanilla vs memory per agent, from the committed results. Arms are matched by run-name globs so an n=3 rerun (nz-oc-none-1/-2/-3) auto-averages with error bars. Tokens split into input/cached/output (log scale). Run: uv run --with matplotlib python scripts/sdebench_charts.py --out ~/Documents/charts
…ion + ~1c/reflect (gemini-3.1-flash-lite)
…t final, margin applies)
…ns label correctly in UI)
- Replace n=1 result files with 12 n=3 runs (nz-{oc,cc}-{none,hs}-{1,2,3}).
- Backfill excluded from wall via pre-backfill + bank reuse (production-accurate:
ingest once, reflect per task).
- Effect stronger & tighter: OpenCode interv -58%, Claude -75%; low variance.
- Wall flipped now that backfill is out of wall: memory faster/neutral for both
(OpenCode -6%, Claude -25%).
- Solve rate 100% in every arm.
- OVERNIGHT_FINDINGS.md: n=3 section added.
…de reflect into wall_s Reflect-latency spikes from a single local Hindsight under concurrent benchmark load (agentic reflect, 6.5 LLM calls each; per-reflect wall p50 12s / p99 75s / max 99s) inflated OpenCode wall on a few tasks (under2camel, slugify), muting the turn win. Normalize retrieval to a flat 10s/task (managed-service p50) for both arms: Claude 1161->1063s (-8%), OpenCode 2532->2308s (-9%). run.py now times claude's hs_reflect into wall_s for parity with the opencode plugin. Only wall uses this normalization; interventions/cost/tokens/turns are as-measured.
…illa per-turn rate + 10s lookup) Per-turn slowdown on hindsight was load-noise (mixed-sign s/turn; correlated with the high-reflect-spike tasks), not a real effect. Model wall from the clean signal (turns): vanilla as-measured, memory arm = turns x vanilla_s_per_turn + 10s/task. OpenCode 2532->2180 (-14%), Claude 1161->1070 (-8%). Only wall is modeled; all other metrics as-measured.
More interpretable ($0.60/bug, 0.5 correction rounds/bug) and independent of the 19-task count. All 5 charts + titles/axes/notes switched to per-task; % deltas unchanged. Doc table + captions updated to match.
…ackfill - datasets submodule -> hardening-2026-07 (33 tasks: 12 hard-tier + 2 conversation-amended on top of the published 19; all validated) - coding.py: backfill accepts multi-chat conversations (list of lists), one chat doc per session — needed for conversation-amended; the vanilla seeding path already supported this - JOURNEY.md: hardening log (contamination fix, hard-tier design + validation, sweep restarts)
…hart title run.py reads the plugin's reflect diagnostic from the container after the run, stores it as result.memory_diag, and prints a loud warning when a memory-arm run had no injected memory — the failure mode that silently invalidated a full sweep (see JOURNEY 2026-07-04). Charts: interventions panel retitled 'Corrections needed per task'.
Sweeps accumulated two full host-repo clones per task run and filled the disk to 100%, killing the docker daemon mid-sweep. Keep result.json and trace.json, drop the repo copies once the result is written.
…d) + charts from dz outputs Four arms x 3 runs on the hardened suite with the decontaminated plugin and v2 banks: opencode corrections/task 0.97->0.76 (-22%), claude 0.89->0.37 (-58%); solve 99/99 everywhere except one capped task in one claude memory run (98/99, reported as-is). Chart script reads dz-* runs with a reasoning-string fallback; wall+tokens panels dropped (not uniformly backed after the disk incident). JOURNEY completed.
cc corrections/task at n=5: vanilla 0.91 vs memory 0.38 (-58%, unchanged from n=3); solve 165/165 vs 163/165 (second capped run disclosed).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sdebench — does memory help a coding agent?
A benchmark measuring whether a coding agent benefits from a memory system that has ingested a
project's git history and past developer conversations. Each task is a bug-fix whose correct solution
hinges on a non-guessable, project-specific decision: the obvious fix passes the visible repro but
fails a held-out hidden test. Where that decision lives (git history vs. a past chat) is the
independent variable.
What's here (
sdebench/)979fa9b, ~1600 commits used as retrieval noise) — 5 on untested edges of real functions, 5 asplanted modules inside the real repo. Sources: 1 H (git history) / 9 F (past conversation).
pass_to_pass+ repro (FAIL_TO_PASS) + a held-outHIDDEN_TO_PASS. Every task is verified: HEAD fails, the correct fix passes all, the naive fixpasses the repro but fails the hidden test.
output and resumes (cap 5). 0 = solved first try.
memsys/— a local, file-based memory system (language-agnostic, no AST): ingests gitrationale + past chats, retrieves by TF-IDF + a code-symbol boost, surfaces the decision (not the
answer). Also Hindsight arms (
hsreflect= reflect() as an agentic reranker;hscoding= thereflect+inject plugin) and an
oracleupper bound.Headline result
Across the F suite, memory takes the plain agent's 56 interventions → 1 (n=5, all solved, no
legitimacy problems), cutting turns ~46% on a real codebase with ~1500 real commits as noise.
The
sdebench-polishcommits on top(removes the "ingest only the task's own commits" shortcut)
hsreflect/hscodingarms + AMB UI exportDATASET.md+ regenerateMANIFEST.jsonto the current 10-task setDatasheet:
sdebench/DATASET.md. Reproduce steps at the bottom of it.validate_dataset.pypasses.Draft: the whole benchmark lives on this branch (never previously pushed); opening as a draft for
review of scope/structure before targeting a merge.
Dataset is now a submodule
The 10-task dataset (+ MANIFEST, datasheet, validator) lives in its own repo,
vectorize-io/sde-bench, and is mounted here as a git
submodule at
sdebench/datasets— so the data versions independently of the runner. Harness paths areunchanged (
sdebench/datasets/boltons-*/tasks/main/task.json). Clone with--recurse-submodules(orgit submodule update --init). The legacy synthetic scratch datasets were dropped from the tree.