Skip to content

Add local agent-driven verification pipeline (verify.py)#55

Merged
jodli merged 5 commits into
masterfrom
verification-pipeline
Jun 26, 2026
Merged

Add local agent-driven verification pipeline (verify.py)#55
jodli merged 5 commits into
masterfrom
verification-pipeline

Conversation

@jodli

@jodli jodli commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Artifacts | Task deep link | PR Walkthrough (alpha)

What problems was I solving

Verifying a change to creative-mod meant a human running debug.sh, watching a Factorio server boot, and eyeballing the log. That toolchain was interactive, assertion-free, and could block forever — unusable by an autonomous agent:

  • No greppable pass/fail verdict — success vs. failure had to be read out of free-form Factorio log spew, and exit codes are unreliable (Factorio can exit 0 while logging a prototype/control error).
  • The documented silent control-stage crash — a mid-require crash in control.lua leaves the game running with storage.creative_mode == nil — had no detector.
  • Unbounded/interactive flows could hang forever and leave orphaned factorio processes.

This PR replaces that with a single bounded tool, uv run verify.py <subcommand>, that loads the mod in the maintainer's local Factorio install, runs assertions, and exits 0/non-zero with a stable, greppable RESULT: line. The result: an agent can edit → verify → read result → iterate without a human, with every command guaranteed to terminate under a watchdog. This is local-only tooling (not CI); runtime behavior is unchanged except one added sentinel log() line.

What user-facing changes did I ship

  • verify.py (new) — the single entrypoint: static, load, behavior, all, debug, shell, doctor, each emitting one RESULT: <name>=PASS|FAIL (detail) line and matching exit code.
  • sandbox.py (new) — single source of .debug/ sandbox truth (tree, live-tree symlink, mod-list.json, generated config.ini, create/server/gui launchers).
  • control.lua — adds the verification sentinel log("CREATIVE_MOD_CONTROL_OK") (the only runtime change).
  • debug.sh, rcon.sh, rcon-shell.sh (deleted) — superseded by verify.py's debug/shell modes.
  • VERIFY.md (new) + skill/doc updates — teach the factorio-mod-dev skill that verify.py is the canonical check.

How I implemented it

Built in five phases, one commit each, cheap layer → expensive layer.

Phase 1 — skeleton, doctor + static (d5432f1)

  • verify.py — PEP 723 inline header (stdlib only, run via uv), self-locating paths from Path(__file__), the result(name, ok, detail) contract helper, argparse dispatcher for every subcommand, cmd_doctor (Factorio binary + --version, uv/jq on PATH) and cmd_static (same luacheck/stylua invocations as lint.yml, excluding the gitignored .debug/).

Phase 2 — load gate + sentinel + sandbox (e4d2dce)

  • sandbox.pybootstrap_sandbox(), run_create() (returns captured log text directly so a stale factorio-current.log can't leak in), start_server() (own session so it can be reaped), start_gui().
  • control.lua — sentinel at the very bottom, after all requires + event registration.
  • verify.py cmd_load — run --create, scan the log: ^Errordata/control error; sentinel absent → control stage incomplete; else PASS.

Phase 3 — behavior + RCON-as-module + all (0b24105)

  • verify.py cmd_behavior — start the headless server, poll RCON (connect + auth handshake) for the ready signal, run the read-only assertion batch, then SIGTERM→SIGKILL reap the whole process group under a hard watchdog.
  • rcon.py — imported as a module by verify.py (CLI __main__ kept intact).
  • verify.py cmd_all — runs all three layers (no short-circuit) and aggregates into one RESULT: all=… line.

Phase 4 — debug/shell + retire wrappers (d7932f7)

Phase 5 — docs (f8cecee)

Deviations from the plan

No *plan*.md exists for this task; the planning artifact is the structure outline (04-structure-outline-verification-pipeline.md), whose five phases are all checked off. Comparing the final code to that outline:

Implemented as planned

  • All seven subcommands, the RESULT:/exit-code contract, the PEP 723 self-locating tool, the control-stage sentinel, the Python sandbox bootstrap, the RCON-polling ready signal, the all aggregation, the --gui escape hatch, deletion of the three shell wrappers, and the skill/doc updates all match the outline.

Deviations/surprises

  • behavior assertions drive the mod's remote interface, not storage.creative_mode directly. The outline's snippet read storage.creative_mode over RCON; in practice a bare RCON /c runs in the scenario script context, where that global is the scenario's storage (always nil). The shipped batch calls remote.call("creative-mode", "is_enabled") so the read executes in the mod's context — documented inline at cmd_behavior. Same intent (storage-initialized + default-disabled), corrected mechanism.

Additions not in plan

  • .gitignore also ignores .humanlayer/tasks/ and __pycache__/ (incidental hygiene for the new Python tooling and cloud artifacts).

Items planned but not implemented

  • None. Two manual-verification checkboxes in the outline (Phase 1 RESULT: readability, Phase 2 .debug/ layout parity) are left unchecked but the corresponding automated checks pass.

How to verify it

git fetch origin
git worktree add ../creative-mod-verify verification-pipeline
cd ../creative-mod-verify

Manual Testing

  • uv run verify.py doctor → exits 0, prints the detected Factorio version.
  • uv run verify.py static → matches a manual luacheck . + stylua --check ..
  • uv run verify.py loadRESULT: load=PASS, returns control (non-blocking).
  • uv run verify.py behaviorassert storage_initialized=PASS, no orphaned factorio process after exit (pgrep -af x64/factorio).
  • uv run verify.py all → combined RESULT: static=PASS load=PASS behavior=PASS all=PASS.
  • Break a prototype, re-run loadRESULT: load=FAIL (data/control error: …), exit non-zero; revert.

Automated Tests

uv run verify.py all   # static -> load -> behavior, one RESULT line, exit 0/non-zero

Description for the changelog

Add verify.py, a single bounded local verification tool (static/load/behavior/doctor/debug/shell) with a greppable RESULT: contract, replacing the interactive debug.sh/rcon*.sh wrappers.

jodli and others added 5 commits June 26, 2026 20:07
Single uv-run entrypoint (PEP 723, stdlib-only) with self-locating
paths, an argparse dispatcher for all seven subcommands, a shared
RESULT:/exit-code contract, and two working layers:
- doctor: preflight for Factorio binary/version, uv, jq
- static: wraps luacheck + stylua --check with per-tool PASS/FAIL

Also removes stray vendored serpent.lua (Factorio ships serpent built-in).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- control.lua: emit log("CREATIVE_MOD_CONTROL_OK") sentinel at end of
  control stage so a silent mid-require crash is detectable
- sandbox.py: port debug.sh .debug/ sandbox bootstrap to Python
  (bootstrap_sandbox + run_create), prune stale versioned symlinks
- verify.py: implement cmd_load (bootstrap -> --create -> scan captured
  output for ^Error and the sentinel); add --clean/--timeout flags

--create always runs (it is the load test) and the scan uses the
subprocess's own captured output to avoid stale prior-run log leakage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- sandbox.py: add start_server() launching the headless server with
  RCON + console-log in its own session for clean process-group reaping
- verify.py: implement cmd_behavior (boot -> poll RCON ready -> read-only
  assertion batch -> SIGTERM/SIGKILL reap under watchdog) and cmd_all
  (static -> load -> behavior, aggregated RESULT, non-zero on any fail)

Assertions drive the mod's own remote interface
remote.call("creative-mode", "is_enabled") rather than a bare /c, since
console commands run in scenario context and cannot see the mod's
storage. A successful pcall proves storage init; the returned value
proves the default-disabled state. rcon.py is imported unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- verify.py: implement cmd_shell (one-shot RCON send + non-blocking
  stdin REPL with /c auto-prefix; attaches to a running server or
  starts+reaps a bounded one) and cmd_debug (bounded headless session
  with --command one-shot and --gui manual escape hatch)
- sandbox.py: add start_gui (full client via --load-game) and set
  stdin=DEVNULL on run_create/start_server so the headless server no
  longer drains the REPL's piped stdin
- delete debug.sh, rcon.sh, rcon-shell.sh (superseded by verify.py)
- rcon.py: update connection-refused hint to verify.py debug
- .gitignore: reference verify.py; ignore __pycache__/

rcon.py is kept and imported as a module (its __main__ CLI intact).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- VERIFY.md (new): canonical reference - subcommands + flags, the
  RESULT:/exit-code contract, the read-only behavior batch, and the
  replicable local install (Factorio 2.1.7 at ../../bin/x64/factorio,
  uv/jq/stylua, luacheck built against Lua 5.3, .debug/ sandbox layout)
- SKILL.md: replace the debug.sh/rcon*.sh block with a Verification
  loop section (layered model, reading RESULT/exit codes, debug/shell)
- DEBUGGING.md / DEBUG.md: point at verify.py load/behavior/shell and
  note the mod-context remote.call vs scenario-context /c caveat

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jodli jodli merged commit 6fd19fb into master Jun 26, 2026
2 checks passed
@jodli jodli deleted the verification-pipeline branch June 26, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant