Analysis
During a routine install of four workflows via gh aw add <path>@main in rapid sequence, three workflows were pinned to their resolved commit SHA in the emitted source: frontmatter line, but one was pinned to @main. At runtime, the "Check workflow lock file" step failed the asymmetric one with:
ERR_CONFIG: Lock file '.github/workflows/implementer-agent.lock.yml' is outdated!
The workflow file '.github/workflows/implementer-agent.md' frontmatter has changed.
The symptom: the hash stored in the .lock.yml did not match the SHA-256 computed from the on-disk .md frontmatter at runtime. Since the .md on disk had not been edited between install-time commit and runtime check (verified via git log -- .md showing no changes), the conclusion is that the lockfile was generated from a different canonical representation than what got written to the .md. The .md/.lock.yml pair was inconsistent from the moment gh aw add produced them.
Reproducing run: verkyyi/agentfolio workflow run 24679888189. Recovery PR where gh aw compile regenerated a consistent lockfile: verkyyi/agentfolio#107.
Root cause
pkg/cli/fetch.go — silent SHA-resolution fallback
// https://github.com/github/gh-aw/blob/main/pkg/cli/fetch.go#L86-L92
commitSHA, err := parser.ResolveRefToSHAForHost(owner, repo, ref, spec.Host)
if err != nil {
remoteWorkflowLog.Printf("Failed to resolve ref to SHA: %v", err)
// Continue without SHA - we can still fetch the content
commitSHA = ""
} else {
remoteWorkflowLog.Printf("Resolved ref %s to SHA: %s", ref, commitSHA)
...
}
When ResolveRefToSHAForHost fails for any reason (transient rate-limit under rapid sequential installs is the most plausible cause here — four API calls to the same host within a few seconds), the error is logged to a verbose-only channel and commitSHA is silently zeroed. The install continues as if nothing happened.
pkg/cli/spec.go — fallback to @ref
// https://github.com/github/gh-aw/blob/main/pkg/cli/spec.go#L398-L416
func buildSourceStringWithCommitSHA(workflow *WorkflowSpec, commitSHA string) string {
if workflow.RepoSlug == "" || workflow.WorkflowPath == "" {
return ""
}
workflowPath := strings.TrimPrefix(workflow.WorkflowPath, "./")
source := workflow.RepoSlug + "/" + workflowPath
if commitSHA != "" {
source += "@" + commitSHA
} else if workflow.Version != "" {
// Fallback to the version if no commit SHA is available
source += "@" + workflow.Version
}
return source
}
With an empty commitSHA, the source: line is written with the mutable ref (@main), not a pinned SHA. This by itself is not a hash-mismatch problem — the .md on disk and the .lock.yml it seeded should still agree if both are computed from the same final content.
The asymmetry
Evidence from my install commit (3c36033):
spec-agent.md: source: ...spec-agent.md@cb66d12806d7f00d220f11e964bc27dfec672913
planner-agent.md: source: ...planner-agent.md@cb66d12806d7f00d220f11e964bc27dfec672913
reviewer-agent.md: source: ...reviewer-agent.md@cb66d12806d7f00d220f11e964bc27dfec672913
implementer-agent.md: source: ...implementer-agent.md@main ← outlier
All four were installed with identical gh aw add <path>@main invocations in a single shell session. Only one received the fallback — consistent with one transient API failure in ResolveRefToSHAForHost.
The inconsistency between the .md (which shipped with @main fallback) and the .lock.yml (whose stored hash was computed from a state with the resolved SHA) is the symptom that caused ERR_CONFIG at runtime. The most plausible mechanism is that the compile step uses in-memory processed content that was computed before the fallback was written into the final .md content — the two paths diverge.
Recovery was mechanical: gh aw compile rehashes from the on-disk .md (which has @main) and writes a matching hash. The runtime check then passes.
Implementation plan
Please implement the following changes:
1. Harden ResolveRefToSHAForHost call path (pkg/cli/fetch.go)
- Change the error-swallowing fallback at line 86–92 from silent to explicit.
- Options (pick one, my preference in order):
- (a) Retry with exponential backoff on transient failures (1s, 3s, 9s) before giving up. Classes of "transient" = HTTP 429, HTTP 5xx, network timeout.
- (b) Fail the install loudly when SHA resolution fails. Emit a user-visible error via
console.FormatErrorMessage and return a non-zero exit. The user can retry with the same command.
- (c) Fall back to
@ref but mark the .md as "unpinned" and warn loudly; require an explicit --allow-unpinned flag to proceed.
- Recommended default: (a) + (b). Retry first; if all retries fail, surface an error with the specific failure (network vs. rate-limit) and the exact command to retry.
- Error message format (per the project's error-message style guide):
"[what's wrong]. [what's expected]. [example]".
- Example:
"failed to resolve 'main' to commit SHA after 3 retries. Expected the GitHub API to return a commit SHA for the ref. Check rate limits or try: gh aw add <path>@<exact-sha>."
2. Ensure .md and .lock.yml hash agreement (pkg/cli/add_command.go)
- After
addSourceToWorkflow and processIncludesWithWorkflowSpec mutate content, write the final content to disk FIRST (os.WriteFile(destFile, ...)) and then invoke compile by reading from disk, not from the in-memory variable.
- This guarantees the lockfile's canonical-JSON hash is computed from the exact byte sequence that will exist on the user's machine at runtime.
- Equivalent alternative: pass the on-disk path (not the in-memory buffer) to the compile pipeline so hash computation reads the same bytes the runtime stale-check will.
3. Add integration test covering the failure mode (pkg/cli/add_integration_test.go)
- Test case: simulate a transient
ResolveRefToSHAForHost failure (inject a mock that fails once then succeeds, or a mock that always fails).
- Assert the install either:
- (a) fails loudly with a non-zero exit and a specific error message, OR
- (b) if fallback is permitted, the
.md content and the .lock.yml's stored hash agree.
- Test: after the install step completes successfully, read the generated
.md and .lock.yml from disk, compute the frontmatter hash via computeFrontmatterHash from actions/setup/js/frontmatter_hash_pure.cjs, and assert it matches the hash embedded in the lockfile.
4. Add regression test for hash parity (pkg/workflow/stale_check_test.go)
- Generate a
.md with both @<SHA> and @<ref> forms in the source: line.
- Compile to
.lock.yml in each case.
- Assert the stored hash matches the hash recomputed from the source
.md in both cases.
- This would have caught the bug: the current test suite evidently has coverage for the hash itself but not for the invariant that install output = runtime input.
5. Document the failure mode (actions/setup/md/stale_lock_file_failed.md)
- Add a section to the "How to investigate the mismatch" block explaining that a mismatch can occur on fresh installs (not only on post-edit workflows) if SHA resolution failed transiently. Link to the retry command.
- User remediation is unchanged (
gh aw compile), but understanding the install-time origin helps users trust the fix.
Test cases
Valid (install succeeds, pair agrees)
gh aw add <path>@main when GitHub API is healthy → .md has @<SHA>, lockfile hash matches. ✓
gh aw add <path>@<exact-sha> → .md has @<SHA> verbatim, no resolution needed, lockfile hash matches. ✓
gh aw add <local-path> → no SHA lookup, source line has no @<ref>, lockfile hash matches. ✓
Invalid (install fails with clear error)
gh aw add <path>@main when ResolveRefToSHAForHost returns HTTP 429 three times in a row → retry once, give up, fail with "failed to resolve 'main' to commit SHA after 3 retries. ..." (exit non-zero, no .md or .lock.yml written).
gh aw add <path>@main when the upstream repo doesn't exist → fail with "repository 'owner/repo' not found. ..." (immediate, no retry).
Regression guard
- Simulated transient failure (fail once, succeed on retry) → install succeeds,
.md has @<SHA>, lockfile hash matches.
- Simulated permanent failure with
--allow-unpinned flag (if option (c) is chosen) → .md has @main, lockfile is computed from that .md, hash matches.
Labels
bug, cli
Follow-up guidelines
- Use
console formatting from pkg/console for CLI output.
- Follow the Error Message Style Guide.
- Run
make agent-finish before completing.
- The fix is small but structural — it touches the install path that every workflow in the ecosystem depends on, so please add both the specific regression test (test case 4) and the integration test (test case 3) rather than just the unit-level retry mock.
Analysis
During a routine install of four workflows via
gh aw add <path>@mainin rapid sequence, three workflows were pinned to their resolved commit SHA in the emittedsource:frontmatter line, but one was pinned to@main. At runtime, the "Check workflow lock file" step failed the asymmetric one with:The symptom: the hash stored in the
.lock.ymldid not match the SHA-256 computed from the on-disk.mdfrontmatter at runtime. Since the.mdon disk had not been edited between install-time commit and runtime check (verified viagit log -- .mdshowing no changes), the conclusion is that the lockfile was generated from a different canonical representation than what got written to the.md. The.md/.lock.ymlpair was inconsistent from the momentgh aw addproduced them.Reproducing run:
verkyyi/agentfolioworkflow run 24679888189. Recovery PR wheregh aw compileregenerated a consistent lockfile: verkyyi/agentfolio#107.Root cause
pkg/cli/fetch.go— silent SHA-resolution fallbackWhen
ResolveRefToSHAForHostfails for any reason (transient rate-limit under rapid sequential installs is the most plausible cause here — four API calls to the same host within a few seconds), the error is logged to a verbose-only channel andcommitSHAis silently zeroed. The install continues as if nothing happened.pkg/cli/spec.go— fallback to@refWith an empty
commitSHA, thesource:line is written with the mutable ref (@main), not a pinned SHA. This by itself is not a hash-mismatch problem — the.mdon disk and the.lock.ymlit seeded should still agree if both are computed from the same final content.The asymmetry
Evidence from my install commit (
3c36033):All four were installed with identical
gh aw add <path>@maininvocations in a single shell session. Only one received the fallback — consistent with one transient API failure inResolveRefToSHAForHost.The inconsistency between the
.md(which shipped with@mainfallback) and the.lock.yml(whose stored hash was computed from a state with the resolved SHA) is the symptom that caused ERR_CONFIG at runtime. The most plausible mechanism is that the compile step uses in-memory processed content that was computed before the fallback was written into the final.mdcontent — the two paths diverge.Recovery was mechanical:
gh aw compilerehashes from the on-disk.md(which has@main) and writes a matching hash. The runtime check then passes.Implementation plan
Please implement the following changes:
1. Harden
ResolveRefToSHAForHostcall path (pkg/cli/fetch.go)console.FormatErrorMessageand return a non-zero exit. The user can retry with the same command.@refbut mark the.mdas "unpinned" and warn loudly; require an explicit--allow-unpinnedflag to proceed."[what's wrong]. [what's expected]. [example]"."failed to resolve 'main' to commit SHA after 3 retries. Expected the GitHub API to return a commit SHA for the ref. Check rate limits or try: gh aw add <path>@<exact-sha>."2. Ensure
.mdand.lock.ymlhash agreement (pkg/cli/add_command.go)addSourceToWorkflowandprocessIncludesWithWorkflowSpecmutatecontent, write the finalcontentto disk FIRST (os.WriteFile(destFile, ...)) and then invoke compile by reading from disk, not from the in-memory variable.3. Add integration test covering the failure mode (
pkg/cli/add_integration_test.go)ResolveRefToSHAForHostfailure (inject a mock that fails once then succeeds, or a mock that always fails)..mdcontent and the.lock.yml's stored hash agree..mdand.lock.ymlfrom disk, compute the frontmatter hash viacomputeFrontmatterHashfromactions/setup/js/frontmatter_hash_pure.cjs, and assert it matches the hash embedded in the lockfile.4. Add regression test for hash parity (
pkg/workflow/stale_check_test.go).mdwith both@<SHA>and@<ref>forms in thesource:line..lock.ymlin each case..mdin both cases.5. Document the failure mode (
actions/setup/md/stale_lock_file_failed.md)gh aw compile), but understanding the install-time origin helps users trust the fix.Test cases
Valid (install succeeds, pair agrees)
gh aw add <path>@mainwhen GitHub API is healthy →.mdhas@<SHA>, lockfile hash matches. ✓gh aw add <path>@<exact-sha>→.mdhas@<SHA>verbatim, no resolution needed, lockfile hash matches. ✓gh aw add <local-path>→ no SHA lookup, source line has no@<ref>, lockfile hash matches. ✓Invalid (install fails with clear error)
gh aw add <path>@mainwhenResolveRefToSHAForHostreturns HTTP 429 three times in a row → retry once, give up, fail with"failed to resolve 'main' to commit SHA after 3 retries. ..."(exit non-zero, no.mdor.lock.ymlwritten).gh aw add <path>@mainwhen the upstream repo doesn't exist → fail with"repository 'owner/repo' not found. ..."(immediate, no retry).Regression guard
.mdhas@<SHA>, lockfile hash matches.--allow-unpinnedflag (if option (c) is chosen) →.mdhas@main, lockfile is computed from that.md, hash matches.Labels
bug,cliFollow-up guidelines
consoleformatting frompkg/consolefor CLI output.make agent-finishbefore completing.