EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

News

June 2026. EvoCode-Bench has migrated to Harbor's official multi-step task format. It previously ran on our harbor_multiturn evaluation framework; that format and its runner are preserved under legacy/ for reproducibility of the paper's original evaluation. On the official format, each task is a sequence of [[steps]] run in one persistent container, with a per-step verifier after each step and trial-level reward aggregation.

EvoCode-Bench tests whether coding agents can keep a project working as user requests change. It contains 26 stateful coding tasks and 227 evaluated rounds (Harbor steps). Each task keeps the same workspace and agent session for 5-15 rounds, while cumulative executable tests check new requirements and still-active prior requirements.

Overview

Most coding benchmarks evaluate one specification followed by one final assessment. EvoCode-Bench instead evaluates an interactive coding session. Later rounds inherit earlier implementation decisions, dependencies, file layouts, API choices, and test behavior. Each round (Harbor step) is scored by a cumulative verifier, and the trial reward is the mean of the per-step rewards.

The benchmark is organized along two axes from the paper:

Engineering activity	Explorative	Contractual	Document-driven	Total
Construction	9 / 80	3 / 37	1 / 7	13 / 124
Spec Evolution	1 / 8	1 / 7	1 / 7	3 / 22
Review	3 / 21	1 / 7	1 / 9	5 / 37
Migration	3 / 29	1 / 7	1 / 8	5 / 44
Total	16 / 138	6 / 58	4 / 31	26 / 227

Each cell reports tasks / rounds. A round maps one-to-one to a Harbor step.

Task Format

EvoCode-Bench tasks use the Harbor official multi-step layout — one sub-directory per step under steps/, executed in the order declared by the [[steps]] array in task.toml:

task/
├── task.toml                       # metadata + [[steps]] list + reward strategy
├── environment/
│   └── Dockerfile                  # single container shared across all steps
└── steps/
    ├── round-1/
    │   ├── instruction.md          # this round's user request (WHAT, not HOW)
    │   ├── solution/solve.sh        # reference delta for this round
    │   └── tests/test.sh           # cumulative tests through this round
    ├── round-2/
    │   ├── instruction.md
    │   ├── solution/solve.sh
    │   └── tests/test.sh
    └── round-N/ ...

task.toml follows the official schema (schema_version = "1.2"):

schema_version = "1.2"
multi_step_reward_strategy = "mean"      # trial reward = mean of per-step rewards

[metadata]
name = "service-mesh-health-router"
difficulty = "hard"
category = "systems-networking"

[metadata.requirement_chain]
num_steps = 8

[[metadata.requirement_chain.steps]]
step = "round-1"
change_types = ["extension"]
# ... one entry per step (extension / correction / conflict)

[agent]
timeout_sec = 1800.0                      # global default; override per step via [steps.agent]

[verifier]
timeout_sec = 1800.0                      # global default; override per step via [steps.verifier]

[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 4096
storage_mb = 10240

[[steps]]
name = "round-1"                          # matches steps/round-1/

[[steps]]
name = "round-2"
# ... one [[steps]] entry per step, in execution order

The task format is built around three constraints:

Persistent workspace: the same Docker container carries files, dependencies, and generated artifacts across steps.
Continuous agent session: the agent receives a sequence of user requests rather than independent prompts.
Cumulative tests: round i verifies every still-active requirement from rounds 1..i, so regressions are caught immediately. Each step's tests/test.sh writes a binary reward to /logs/verifier/reward.txt.

Framework

EvoCode-Bench's standard multi-step evaluation runs on upstream Harbor — the same framework used by Terminal-Bench 2.0 — using its native multi-step support. No fork is required to run a full task (all steps).

uv tool install harbor      # or: pip install harbor
harbor run --help

Upstream Harbor's official multi-step runner provides:

native [[steps]] sequencing in the order declared in task.toml;
a single persistent Docker workspace shared across all steps;
a continuous agent session across steps;
a per-step verifier run against the cumulative test suite after each step;
trial-level reward aggregation via multi_step_reward_strategy (mean for EvoCode-Bench).

Single-Round Fast-Forward (SR) — solving a target round after fast-forwarding the earlier rounds with reference deltas — is not yet supported upstream. It is provided by our Harbor fork harbor-official-fast-forward, which adds --fast-forward-mode oracle-solution on top of official Harbor. (The legacy harbor_multiturn framework also supports SR.)

Capability	Upstream Harbor	`harbor-official-fast-forward` (our fork)	legacy `harbor_multiturn`
Full multi-step run (all steps)	✓	✓	✓
Single-round fast-forward (SR)	✗ (not yet)	✓	✓

Quick Start

1. Prerequisites

Python 3.11+ (the evaluation/*.py helpers use the stdlib tomllib).
Docker running, or a remote Daytona target.
A model endpoint for your agent.

Install the Harbor CLI:

# uv runs the Harbor CLI. See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install harbor      # or: pip install harbor

pip install harbor (upstream) runs full tasks (all steps). Single-Round Fast-Forward (SR) additionally needs our fork — see Single-Round Fast-Forward.

2. Prepare Tasks

Download the released EvoCode-Bench task directories from Hugging Face and place them under data/EvoCodeBench. If you already have the Terminal-X repository, the tasks are also available under Terminal-X/data/EvoCodeBench/.

3. Configure Model Endpoint

For the claude-code agent:

export AGENT_TYPE="claude-code"
export AGENT_MODEL="claude-opus-4-7"
export ANTHROPIC_BASE_URL="https://api.your-provider.com"
export ANTHROPIC_AUTH_TOKEN="sk-..."

For the terminus-2 agent (OpenAI-compatible):

export AGENT_TYPE="terminus-2"
export AGENT_MODEL="openai/gpt-5.5"
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"

4. Validate the Dataset

python evaluation/validate_dataset.py data/EvoCodeBench

The released benchmark should report 26 tasks and 227 steps.

5. Run One Task

# Agent (pass@1 by default; set AGENT_ATTEMPTS for pass@k)
AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agent

# Oracle verification (reference solutions; should score 1.0 on every step)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation oracle

# No-op baseline (empty submission; should score 0)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation nop

6. Run All Tasks

AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_all.sh data/EvoCodeBench agent

Each task writes Harbor outputs under:

data/EvoCodeBench/<task>/harbor_jobs/<model>/

Single-Round Fast-Forward (SR)

The paper reports SR as a complementary metric: the agent solves a target round after Harbor fast-forwards all previous rounds with reference deltas.

SR requires our Harbor fork — upstream Harbor does not yet support --fast-forward-mode. Point the runner at harbor-official-fast-forward:
git clone git@github.com:UniPat-AI/harbor-official-fast-forward.git
export HARBOR_BIN="uv --directory $(pwd)/harbor-official-fast-forward run harbor"

Solve only round 5 from a reference-completed prior state:

AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
    agent --start-step 5 --end-step 5

Solve rounds 3-7 after fast-forwarding rounds 1-2:

AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
    agent --start-step 3 --end-step 7

When --start-step > 1, the runner adds --fast-forward-mode oracle-solution so the earlier steps are prepared with the reference solutions (this flag exists only in the fork).

Metrics

Each step is scored with a binary reward — 1 if all of that step's key requirements pass, 0 otherwise — written by the verifier to /logs/verifier/reward.txt. Harbor aggregates a trial's per-step rewards into a trial-level reward via multi_step_reward_strategy = "mean".

The primary score is therefore the mean per-step reward:

per-task score = (passed steps) / (total steps) for the trial;
dataset score = mean of per-task scores across the 26 tasks.

For continuity with the paper, compute_metrics.py also derives the paper's metrics from the same per-step rewards:

MT@4: mean_t (1/N_t) sum_i max_{a<=4} r_{t,a,i} (best-of-4 per round, averaged);
SR: single-round pass rate after reference fast-forwarding earlier rounds;
Comp: fraction of tasks completed through the final round in at least one attempt.

python evaluation/compute_metrics.py \
  --tasks-dir data/EvoCodeBench \
  --results-dir data/EvoCodeBench \
  --model claude-opus-4-7          # score one agent; add --json for machine-readable output

--model selects the harbor_jobs/<model>/ results to score (the oracle and nop baselines are excluded by default).

Results

Paper results (original evaluation; MT@4 / SR / Comp as defined in the paper):

Agent	MT@4	SR	Comp
Claude-Opus-4.7	54.0	76.7	42.3
GPT-5.5	52.4	74.4	38.5
Claude-Opus-4.6	44.0	78.9	34.6

SR exceeds MT@4 by 22-40 points for most agents. Isolated round-solving is much easier than keeping the agent's own workspace correct across many rounds.

Relation to Terminal-X

EvoCode-Bench is the iteration component of Terminal-X, alongside DeepTerminalBench for single-shot depth and RoadmapBench for version upgrades. Terminal-X contains the combined benchmark suite and cross-dataset blog; this repository focuses on the EvoCode-Bench task format, evaluation protocol, and official-Harbor runner.

Citation

@misc{shen2026evocodebench,
  title = {EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions},
  author = {Haiyang Shen and Xuanzhong Chen and Wendong Xu and Yun Ma and Liang Chen and Kuan Li},
  year = {2026},
  eprint = {2605.24110},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE},
  url = {https://arxiv.org/abs/2605.24110}
}

License

Code in this repository is released under the MIT License. Dataset terms follow the dataset release metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
evaluation		evaluation
legacy		legacy
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

News

Overview

Task Format

Framework

Quick Start

1. Prerequisites

2. Prepare Tasks

3. Configure Model Endpoint

4. Validate the Dataset

5. Run One Task

6. Run All Tasks

Single-Round Fast-Forward (SR)

Metrics

Results

Relation to Terminal-X

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

News

Overview

Task Format

Framework

Quick Start

1. Prerequisites

2. Prepare Tasks

3. Configure Model Endpoint

4. Validate the Dataset

5. Run One Task

6. Run All Tasks

Single-Round Fast-Forward (SR)

Metrics

Results

Relation to Terminal-X

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages