Add local agent-driven verification pipeline (verify.py) by jodli · Pull Request #55 · jodli/CreativeMod

jodli · 2026-06-26T21:23:17Z

Artifacts | Task deep link | PR Walkthrough (alpha)

What problems was I solving

Verifying a change to creative-mod meant a human running debug.sh, watching a Factorio server boot, and eyeballing the log. That toolchain was interactive, assertion-free, and could block forever — unusable by an autonomous agent:

No greppable pass/fail verdict — success vs. failure had to be read out of free-form Factorio log spew, and exit codes are unreliable (Factorio can exit 0 while logging a prototype/control error).
The documented silent control-stage crash — a mid-require crash in control.lua leaves the game running with storage.creative_mode == nil — had no detector.
Unbounded/interactive flows could hang forever and leave orphaned factorio processes.

This PR replaces that with a single bounded tool, uv run verify.py <subcommand>, that loads the mod in the maintainer's local Factorio install, runs assertions, and exits 0/non-zero with a stable, greppable RESULT: line. The result: an agent can edit → verify → read result → iterate without a human, with every command guaranteed to terminate under a watchdog. This is local-only tooling (not CI); runtime behavior is unchanged except one added sentinel log() line.

What user-facing changes did I ship

verify.py (new) — the single entrypoint: static, load, behavior, all, debug, shell, doctor, each emitting one RESULT: <name>=PASS|FAIL (detail) line and matching exit code.
sandbox.py (new) — single source of .debug/ sandbox truth (tree, live-tree symlink, mod-list.json, generated config.ini, create/server/gui launchers).
control.lua — adds the verification sentinel log("CREATIVE_MOD_CONTROL_OK") (the only runtime change).
debug.sh, rcon.sh, rcon-shell.sh (deleted) — superseded by verify.py's debug/shell modes.
VERIFY.md (new) + skill/doc updates — teach the factorio-mod-dev skill that verify.py is the canonical check.

How I implemented it

Built in five phases, one commit each, cheap layer → expensive layer.

Phase 1 — skeleton, `doctor` + `static` (d5432f1)

verify.py — PEP 723 inline header (stdlib only, run via uv), self-locating paths from Path(__file__), the result(name, ok, detail) contract helper, argparse dispatcher for every subcommand, cmd_doctor (Factorio binary + --version, uv/jq on PATH) and cmd_static (same luacheck/stylua invocations as lint.yml, excluding the gitignored .debug/).

Phase 2 — `load` gate + sentinel + sandbox (e4d2dce)

sandbox.py — bootstrap_sandbox(), run_create() (returns captured log text directly so a stale factorio-current.log can't leak in), start_server() (own session so it can be reaped), start_gui().
control.lua — sentinel at the very bottom, after all requires + event registration.
verify.py cmd_load — run --create, scan the log: ^Error → data/control error; sentinel absent → control stage incomplete; else PASS.

Phase 3 — `behavior` + RCON-as-module + `all` (0b24105)

verify.py cmd_behavior — start the headless server, poll RCON (connect + auth handshake) for the ready signal, run the read-only assertion batch, then SIGTERM→SIGKILL reap the whole process group under a hard watchdog.
rcon.py — imported as a module by verify.py (CLI __main__ kept intact).
verify.py cmd_all — runs all three layers (no short-circuit) and aggregates into one RESULT: all=… line.

Phase 4 — `debug`/`shell` + retire wrappers (d7932f7)

verify.py cmd_shell/cmd_debug — bounded RCON pass-through (one-shot or stdin REPL, auto-/c) and a bounded headless debug session with a --gui manual escape hatch.
Deleted debug.sh / rcon.sh / rcon-shell.sh; .gitignore re-attributes .debug/ and ignores .humanlayer/tasks/ + __pycache__/.

Phase 5 — docs (f8cecee)

New VERIFY.md (canonical reference + replicable install), plus SKILL.md, DEBUGGING.md, and DEBUG.md updated to point at verify.py.

Deviations from the plan

No *plan*.md exists for this task; the planning artifact is the structure outline (04-structure-outline-verification-pipeline.md), whose five phases are all checked off. Comparing the final code to that outline:

Implemented as planned

All seven subcommands, the RESULT:/exit-code contract, the PEP 723 self-locating tool, the control-stage sentinel, the Python sandbox bootstrap, the RCON-polling ready signal, the all aggregation, the --gui escape hatch, deletion of the three shell wrappers, and the skill/doc updates all match the outline.

Deviations/surprises

behavior assertions drive the mod's remote interface, not storage.creative_mode directly. The outline's snippet read storage.creative_mode over RCON; in practice a bare RCON /c runs in the scenario script context, where that global is the scenario's storage (always nil). The shipped batch calls remote.call("creative-mode", "is_enabled") so the read executes in the mod's context — documented inline at cmd_behavior. Same intent (storage-initialized + default-disabled), corrected mechanism.

Additions not in plan

.gitignore also ignores .humanlayer/tasks/ and __pycache__/ (incidental hygiene for the new Python tooling and cloud artifacts).

Items planned but not implemented

None. Two manual-verification checkboxes in the outline (Phase 1 RESULT: readability, Phase 2 .debug/ layout parity) are left unchecked but the corresponding automated checks pass.

How to verify it

git fetch origin
git worktree add ../creative-mod-verify verification-pipeline
cd ../creative-mod-verify

Manual Testing

uv run verify.py doctor → exits 0, prints the detected Factorio version.
uv run verify.py static → matches a manual luacheck . + stylua --check ..
uv run verify.py load → RESULT: load=PASS, returns control (non-blocking).
uv run verify.py behavior → assert storage_initialized=PASS, no orphaned factorio process after exit (pgrep -af x64/factorio).
uv run verify.py all → combined RESULT: static=PASS load=PASS behavior=PASS all=PASS.
Break a prototype, re-run load → RESULT: load=FAIL (data/control error: …), exit non-zero; revert.

Automated Tests

uv run verify.py all   # static -> load -> behavior, one RESULT line, exit 0/non-zero

Description for the changelog

Add verify.py, a single bounded local verification tool (static/load/behavior/doctor/debug/shell) with a greppable RESULT: contract, replacing the interactive debug.sh/rcon*.sh wrappers.

Single uv-run entrypoint (PEP 723, stdlib-only) with self-locating paths, an argparse dispatcher for all seven subcommands, a shared RESULT:/exit-code contract, and two working layers: - doctor: preflight for Factorio binary/version, uv, jq - static: wraps luacheck + stylua --check with per-tool PASS/FAIL Also removes stray vendored serpent.lua (Factorio ships serpent built-in). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- control.lua: emit log("CREATIVE_MOD_CONTROL_OK") sentinel at end of control stage so a silent mid-require crash is detectable - sandbox.py: port debug.sh .debug/ sandbox bootstrap to Python (bootstrap_sandbox + run_create), prune stale versioned symlinks - verify.py: implement cmd_load (bootstrap -> --create -> scan captured output for ^Error and the sentinel); add --clean/--timeout flags --create always runs (it is the load test) and the scan uses the subprocess's own captured output to avoid stale prior-run log leakage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- sandbox.py: add start_server() launching the headless server with RCON + console-log in its own session for clean process-group reaping - verify.py: implement cmd_behavior (boot -> poll RCON ready -> read-only assertion batch -> SIGTERM/SIGKILL reap under watchdog) and cmd_all (static -> load -> behavior, aggregated RESULT, non-zero on any fail) Assertions drive the mod's own remote interface remote.call("creative-mode", "is_enabled") rather than a bare /c, since console commands run in scenario context and cannot see the mod's storage. A successful pcall proves storage init; the returned value proves the default-disabled state. rcon.py is imported unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- verify.py: implement cmd_shell (one-shot RCON send + non-blocking stdin REPL with /c auto-prefix; attaches to a running server or starts+reaps a bounded one) and cmd_debug (bounded headless session with --command one-shot and --gui manual escape hatch) - sandbox.py: add start_gui (full client via --load-game) and set stdin=DEVNULL on run_create/start_server so the headless server no longer drains the REPL's piped stdin - delete debug.sh, rcon.sh, rcon-shell.sh (superseded by verify.py) - rcon.py: update connection-refused hint to verify.py debug - .gitignore: reference verify.py; ignore __pycache__/ rcon.py is kept and imported as a module (its __main__ CLI intact). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- VERIFY.md (new): canonical reference - subcommands + flags, the RESULT:/exit-code contract, the read-only behavior batch, and the replicable local install (Factorio 2.1.7 at ../../bin/x64/factorio, uv/jq/stylua, luacheck built against Lua 5.3, .debug/ sandbox layout) - SKILL.md: replace the debug.sh/rcon*.sh block with a Verification loop section (layered model, reading RESULT/exit codes, debug/shell) - DEBUGGING.md / DEBUG.md: point at verify.py load/behavior/shell and note the mod-context remote.call vs scenario-context /c caveat Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jodli and others added 5 commits June 26, 2026 20:07

jodli merged commit 6fd19fb into master Jun 26, 2026
2 checks passed

jodli deleted the verification-pipeline branch June 26, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local agent-driven verification pipeline (verify.py)#55

Add local agent-driven verification pipeline (verify.py)#55
jodli merged 5 commits into
masterfrom
verification-pipeline

jodli commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jodli commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problems was I solving

What user-facing changes did I ship

How I implemented it

Phase 1 — skeleton, doctor + static (d5432f1)

Phase 2 — load gate + sentinel + sandbox (e4d2dce)

Phase 3 — behavior + RCON-as-module + all (0b24105)

Phase 4 — debug/shell + retire wrappers (d7932f7)

Phase 5 — docs (f8cecee)

Deviations from the plan

Implemented as planned

Deviations/surprises

Additions not in plan

Items planned but not implemented

How to verify it

Manual Testing

Automated Tests

Description for the changelog

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jodli commented Jun 26, 2026 •

edited

Loading

Phase 1 — skeleton, `doctor` + `static` (d5432f1)

Phase 2 — `load` gate + sentinel + sandbox (e4d2dce)

Phase 3 — `behavior` + RCON-as-module + `all` (0b24105)

Phase 4 — `debug`/`shell` + retire wrappers (d7932f7)