Add local agent-driven verification pipeline (verify.py)#55
Merged
Conversation
Single uv-run entrypoint (PEP 723, stdlib-only) with self-locating paths, an argparse dispatcher for all seven subcommands, a shared RESULT:/exit-code contract, and two working layers: - doctor: preflight for Factorio binary/version, uv, jq - static: wraps luacheck + stylua --check with per-tool PASS/FAIL Also removes stray vendored serpent.lua (Factorio ships serpent built-in). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- control.lua: emit log("CREATIVE_MOD_CONTROL_OK") sentinel at end of
control stage so a silent mid-require crash is detectable
- sandbox.py: port debug.sh .debug/ sandbox bootstrap to Python
(bootstrap_sandbox + run_create), prune stale versioned symlinks
- verify.py: implement cmd_load (bootstrap -> --create -> scan captured
output for ^Error and the sentinel); add --clean/--timeout flags
--create always runs (it is the load test) and the scan uses the
subprocess's own captured output to avoid stale prior-run log leakage.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- sandbox.py: add start_server() launching the headless server with
RCON + console-log in its own session for clean process-group reaping
- verify.py: implement cmd_behavior (boot -> poll RCON ready -> read-only
assertion batch -> SIGTERM/SIGKILL reap under watchdog) and cmd_all
(static -> load -> behavior, aggregated RESULT, non-zero on any fail)
Assertions drive the mod's own remote interface
remote.call("creative-mode", "is_enabled") rather than a bare /c, since
console commands run in scenario context and cannot see the mod's
storage. A successful pcall proves storage init; the returned value
proves the default-disabled state. rcon.py is imported unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- verify.py: implement cmd_shell (one-shot RCON send + non-blocking stdin REPL with /c auto-prefix; attaches to a running server or starts+reaps a bounded one) and cmd_debug (bounded headless session with --command one-shot and --gui manual escape hatch) - sandbox.py: add start_gui (full client via --load-game) and set stdin=DEVNULL on run_create/start_server so the headless server no longer drains the REPL's piped stdin - delete debug.sh, rcon.sh, rcon-shell.sh (superseded by verify.py) - rcon.py: update connection-refused hint to verify.py debug - .gitignore: reference verify.py; ignore __pycache__/ rcon.py is kept and imported as a module (its __main__ CLI intact). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- VERIFY.md (new): canonical reference - subcommands + flags, the RESULT:/exit-code contract, the read-only behavior batch, and the replicable local install (Factorio 2.1.7 at ../../bin/x64/factorio, uv/jq/stylua, luacheck built against Lua 5.3, .debug/ sandbox layout) - SKILL.md: replace the debug.sh/rcon*.sh block with a Verification loop section (layered model, reading RESULT/exit codes, debug/shell) - DEBUGGING.md / DEBUG.md: point at verify.py load/behavior/shell and note the mod-context remote.call vs scenario-context /c caveat Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Artifacts | Task deep link | PR Walkthrough (alpha)
What problems was I solving
Verifying a change to creative-mod meant a human running
debug.sh, watching a Factorio server boot, and eyeballing the log. That toolchain was interactive, assertion-free, and could block forever — unusable by an autonomous agent:0while logging a prototype/control error).requirecrash incontrol.lualeaves the game running withstorage.creative_mode == nil— had no detector.factorioprocesses.This PR replaces that with a single bounded tool,
uv run verify.py <subcommand>, that loads the mod in the maintainer's local Factorio install, runs assertions, and exits0/non-zero with a stable, greppableRESULT:line. The result: an agent can edit → verify → read result → iterate without a human, with every command guaranteed to terminate under a watchdog. This is local-only tooling (not CI); runtime behavior is unchanged except one added sentinellog()line.What user-facing changes did I ship
static,load,behavior,all,debug,shell,doctor, each emitting oneRESULT: <name>=PASS|FAIL (detail)line and matching exit code..debug/sandbox truth (tree, live-tree symlink,mod-list.json, generatedconfig.ini, create/server/gui launchers).log("CREATIVE_MOD_CONTROL_OK")(the only runtime change).verify.py'sdebug/shellmodes.factorio-mod-devskill thatverify.pyis the canonical check.How I implemented it
Built in five phases, one commit each, cheap layer → expensive layer.
Phase 1 — skeleton,
doctor+static(d5432f1)uv), self-locating paths fromPath(__file__), theresult(name, ok, detail)contract helper, argparse dispatcher for every subcommand,cmd_doctor(Factorio binary +--version,uv/jqon PATH) andcmd_static(sameluacheck/styluainvocations aslint.yml, excluding the gitignored.debug/).Phase 2 —
loadgate + sentinel + sandbox (e4d2dce)bootstrap_sandbox(),run_create()(returns captured log text directly so a stalefactorio-current.logcan't leak in),start_server()(own session so it can be reaped),start_gui().cmd_load— run--create, scan the log:^Error→data/control error; sentinel absent →control stage incomplete; else PASS.Phase 3 —
behavior+ RCON-as-module +all(0b24105)cmd_behavior— start the headless server, poll RCON (connect + auth handshake) for the ready signal, run the read-only assertion batch, then SIGTERM→SIGKILL reap the whole process group under a hard watchdog.verify.py(CLI__main__kept intact).cmd_all— runs all three layers (no short-circuit) and aggregates into oneRESULT: all=…line.Phase 4 —
debug/shell+ retire wrappers (d7932f7)cmd_shell/cmd_debug— bounded RCON pass-through (one-shot or stdin REPL, auto-/c) and a bounded headless debug session with a--guimanual escape hatch..debug/and ignores.humanlayer/tasks/+__pycache__/.Phase 5 — docs (f8cecee)
verify.py.Deviations from the plan
No
*plan*.mdexists for this task; the planning artifact is the structure outline (04-structure-outline-verification-pipeline.md), whose five phases are all checked off. Comparing the final code to that outline:Implemented as planned
RESULT:/exit-code contract, the PEP 723 self-locating tool, the control-stage sentinel, the Python sandbox bootstrap, the RCON-polling ready signal, theallaggregation, the--guiescape hatch, deletion of the three shell wrappers, and the skill/doc updates all match the outline.Deviations/surprises
storage.creative_modedirectly. The outline's snippet readstorage.creative_modeover RCON; in practice a bare RCON/cruns in the scenario script context, where that global is the scenario's storage (alwaysnil). The shipped batch callsremote.call("creative-mode", "is_enabled")so the read executes in the mod's context — documented inline at cmd_behavior. Same intent (storage-initialized + default-disabled), corrected mechanism.Additions not in plan
.gitignorealso ignores.humanlayer/tasks/and__pycache__/(incidental hygiene for the new Python tooling and cloud artifacts).Items planned but not implemented
RESULT:readability, Phase 2.debug/layout parity) are left unchecked but the corresponding automated checks pass.How to verify it
git fetch origin git worktree add ../creative-mod-verify verification-pipeline cd ../creative-mod-verifyManual Testing
uv run verify.py doctor→ exits0, prints the detected Factorio version.uv run verify.py static→ matches a manualluacheck .+stylua --check ..uv run verify.py load→RESULT: load=PASS, returns control (non-blocking).uv run verify.py behavior→assert storage_initialized=PASS, no orphanedfactorioprocess after exit (pgrep -af x64/factorio).uv run verify.py all→ combinedRESULT: static=PASS load=PASS behavior=PASS all=PASS.load→RESULT: load=FAIL (data/control error: …), exit non-zero; revert.Automated Tests
uv run verify.py all # static -> load -> behavior, one RESULT line, exit 0/non-zeroDescription for the changelog
Add
verify.py, a single bounded local verification tool (static/load/behavior/doctor/debug/shell) with a greppableRESULT:contract, replacing the interactivedebug.sh/rcon*.shwrappers.