sde-bench

Does a coding agent benefit from a memory system? A benchmark of bug-fix tasks whose correct solution hinges on a non-guessable, project-specific decision: the obvious fix passes the visible repro but fails a held-out hidden test. Where that decision lives — a real commit in the project's git history (history) or a past developer conversation (conversation) — is the independent variable, and what a memory system can reach is the point.

This repo is the dataset (19 tasks hosted in the real boltons library). The runner/harness and the memory systems under test live in open-memory-benchmark, which consumes this repo as a git submodule.

Layout

boltons-<name>/
  build.py                 # materializes the task's codebase (a boltons fork at 979fa9b)
  tasks/main/
    task.json              # metadata: source (history/conversation), tier, category, module/function, policy, tests, conversations
    regression_test.py     # the visible repro (red at HEAD)
    hidden_test.py         # the held-out test the naive fix fails
  tasks/oracle/            # (some tasks) an upper-bound arm
MANIFEST.json              # task index + by_source/by_tier/by_category counts
DATASET.md                 # datasheet: axes (source/tier/category), tasks, grading, metric
DATASET_DESIGN.md          # design rationale
validate.py                # structural integrity check
gen/                       # the task generator (self-contained here) — traps, emitters, categories

Axes

Each task is classified on three orthogonal axes: source (history = git history / conversation = past chat), tier (real-function / planted), and category — the kind of non-guessable decision (mapping, set-membership, numeric-policy, ordering, collection-merge, filter-rule, invariant). The canonical category taxonomy is gen/categories.py.

Generate / add tasks

Full guide: GENERATING.md — the task model (the discrimination invariant), planted vs real-function, trap anatomy, and step-by-step for adding each kind.

In short: a task = a bug whose obvious fix passes the repro but fails a hidden test (the non-guessable policy). You author a trap (gen/realfn_traps.py for real-function tasks, gen/traps.py for planted), give it a category in gen/categories.py, then emit + validate:

git clone https://github.com/mahmoud/boltons ~/dev/_sdebench_hosts/boltons   # or set SDEBENCH_BOLTONS_HOST
python gen/emit_realfn.py     # (re)emit + validate the real-function tasks
python gen/emit_host.py       # (re)emit the planted tasks
python validate.py            # structural check

The generator is vendored here so the dataset is self-contained — planted tasks' build.py import gen/traps.py at build time, so gen/ must ship with the data.

Validate

python validate.py     # checks every boltons-* task: required fields, tests parse, manifest consistency

Use as a submodule

git submodule add https://github.com/vectorize-io/sde-bench.git <path>

See DATASET.md for the full datasheet (sources, tiers, grading, the interventions metric, and results). boltons is © Mahmoud Hashemi, BSD, used unmodified as a fixture; the traps, planted modules, tests, conversations, and datasheet are this project's.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
boltons-budget-history		boltons-budget-history
boltons-budget		boltons-budget
boltons-discount-history		boltons-discount-history
boltons-discount		boltons-discount
boltons-findhashtags-history		boltons-findhashtags-history
boltons-findhashtags		boltons-findhashtags
boltons-listmerge-history		boltons-listmerge-history
boltons-listmerge		boltons-listmerge
boltons-omdset		boltons-omdset
boltons-parseflag-history		boltons-parseflag-history
boltons-parseflag		boltons-parseflag
boltons-pluralize-history		boltons-pluralize-history
boltons-pluralize		boltons-pluralize
boltons-rounding-history		boltons-rounding-history
boltons-rounding		boltons-rounding
boltons-slugify-history		boltons-slugify-history
boltons-slugify		boltons-slugify
boltons-under2camel-history		boltons-under2camel-history
boltons-under2camel		boltons-under2camel
gen		gen
.gitignore		.gitignore
DATASET.md		DATASET.md
DATASET_DESIGN.md		DATASET_DESIGN.md
GENERATING.md		GENERATING.md
MANIFEST.json		MANIFEST.json
README.md		README.md
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sde-bench

Layout

Axes

Generate / add tasks

Validate

Use as a submodule

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

sde-bench

Layout

Axes

Generate / add tasks

Validate

Use as a submodule

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages