Skip to content

gh aw add: silent fallback to @ref when SHA resolution fails produces asymmetric .md/.lock.yml that fails ERR_CONFIG at runtime #27407

@verkyyi

Description

@verkyyi

Analysis

During a routine install of four workflows via gh aw add <path>@main in rapid sequence, three workflows were pinned to their resolved commit SHA in the emitted source: frontmatter line, but one was pinned to @main. At runtime, the "Check workflow lock file" step failed the asymmetric one with:

ERR_CONFIG: Lock file '.github/workflows/implementer-agent.lock.yml' is outdated!
The workflow file '.github/workflows/implementer-agent.md' frontmatter has changed.

The symptom: the hash stored in the .lock.yml did not match the SHA-256 computed from the on-disk .md frontmatter at runtime. Since the .md on disk had not been edited between install-time commit and runtime check (verified via git log -- .md showing no changes), the conclusion is that the lockfile was generated from a different canonical representation than what got written to the .md. The .md/.lock.yml pair was inconsistent from the moment gh aw add produced them.

Reproducing run: verkyyi/agentfolio workflow run 24679888189. Recovery PR where gh aw compile regenerated a consistent lockfile: verkyyi/agentfolio#107.

Root cause

pkg/cli/fetch.go — silent SHA-resolution fallback

// https://github.com/github/gh-aw/blob/main/pkg/cli/fetch.go#L86-L92
commitSHA, err := parser.ResolveRefToSHAForHost(owner, repo, ref, spec.Host)
if err != nil {
    remoteWorkflowLog.Printf("Failed to resolve ref to SHA: %v", err)
    // Continue without SHA - we can still fetch the content
    commitSHA = ""
} else {
    remoteWorkflowLog.Printf("Resolved ref %s to SHA: %s", ref, commitSHA)
    ...
}

When ResolveRefToSHAForHost fails for any reason (transient rate-limit under rapid sequential installs is the most plausible cause here — four API calls to the same host within a few seconds), the error is logged to a verbose-only channel and commitSHA is silently zeroed. The install continues as if nothing happened.

pkg/cli/spec.go — fallback to @ref

// https://github.com/github/gh-aw/blob/main/pkg/cli/spec.go#L398-L416
func buildSourceStringWithCommitSHA(workflow *WorkflowSpec, commitSHA string) string {
    if workflow.RepoSlug == "" || workflow.WorkflowPath == "" {
        return ""
    }
    workflowPath := strings.TrimPrefix(workflow.WorkflowPath, "./")
    source := workflow.RepoSlug + "/" + workflowPath
    if commitSHA != "" {
        source += "@" + commitSHA
    } else if workflow.Version != "" {
        // Fallback to the version if no commit SHA is available
        source += "@" + workflow.Version
    }
    return source
}

With an empty commitSHA, the source: line is written with the mutable ref (@main), not a pinned SHA. This by itself is not a hash-mismatch problem — the .md on disk and the .lock.yml it seeded should still agree if both are computed from the same final content.

The asymmetry

Evidence from my install commit (3c36033):

spec-agent.md:        source: ...spec-agent.md@cb66d12806d7f00d220f11e964bc27dfec672913
planner-agent.md:     source: ...planner-agent.md@cb66d12806d7f00d220f11e964bc27dfec672913
reviewer-agent.md:    source: ...reviewer-agent.md@cb66d12806d7f00d220f11e964bc27dfec672913
implementer-agent.md: source: ...implementer-agent.md@main                      ← outlier

All four were installed with identical gh aw add <path>@main invocations in a single shell session. Only one received the fallback — consistent with one transient API failure in ResolveRefToSHAForHost.

The inconsistency between the .md (which shipped with @main fallback) and the .lock.yml (whose stored hash was computed from a state with the resolved SHA) is the symptom that caused ERR_CONFIG at runtime. The most plausible mechanism is that the compile step uses in-memory processed content that was computed before the fallback was written into the final .md content — the two paths diverge.

Recovery was mechanical: gh aw compile rehashes from the on-disk .md (which has @main) and writes a matching hash. The runtime check then passes.

Implementation plan

Please implement the following changes:

1. Harden ResolveRefToSHAForHost call path (pkg/cli/fetch.go)

  • Change the error-swallowing fallback at line 86–92 from silent to explicit.
  • Options (pick one, my preference in order):
    • (a) Retry with exponential backoff on transient failures (1s, 3s, 9s) before giving up. Classes of "transient" = HTTP 429, HTTP 5xx, network timeout.
    • (b) Fail the install loudly when SHA resolution fails. Emit a user-visible error via console.FormatErrorMessage and return a non-zero exit. The user can retry with the same command.
    • (c) Fall back to @ref but mark the .md as "unpinned" and warn loudly; require an explicit --allow-unpinned flag to proceed.
  • Recommended default: (a) + (b). Retry first; if all retries fail, surface an error with the specific failure (network vs. rate-limit) and the exact command to retry.
  • Error message format (per the project's error-message style guide): "[what's wrong]. [what's expected]. [example]".
    • Example: "failed to resolve 'main' to commit SHA after 3 retries. Expected the GitHub API to return a commit SHA for the ref. Check rate limits or try: gh aw add <path>@<exact-sha>."

2. Ensure .md and .lock.yml hash agreement (pkg/cli/add_command.go)

  • After addSourceToWorkflow and processIncludesWithWorkflowSpec mutate content, write the final content to disk FIRST (os.WriteFile(destFile, ...)) and then invoke compile by reading from disk, not from the in-memory variable.
  • This guarantees the lockfile's canonical-JSON hash is computed from the exact byte sequence that will exist on the user's machine at runtime.
  • Equivalent alternative: pass the on-disk path (not the in-memory buffer) to the compile pipeline so hash computation reads the same bytes the runtime stale-check will.

3. Add integration test covering the failure mode (pkg/cli/add_integration_test.go)

  • Test case: simulate a transient ResolveRefToSHAForHost failure (inject a mock that fails once then succeeds, or a mock that always fails).
  • Assert the install either:
    • (a) fails loudly with a non-zero exit and a specific error message, OR
    • (b) if fallback is permitted, the .md content and the .lock.yml's stored hash agree.
  • Test: after the install step completes successfully, read the generated .md and .lock.yml from disk, compute the frontmatter hash via computeFrontmatterHash from actions/setup/js/frontmatter_hash_pure.cjs, and assert it matches the hash embedded in the lockfile.

4. Add regression test for hash parity (pkg/workflow/stale_check_test.go)

  • Generate a .md with both @<SHA> and @<ref> forms in the source: line.
  • Compile to .lock.yml in each case.
  • Assert the stored hash matches the hash recomputed from the source .md in both cases.
  • This would have caught the bug: the current test suite evidently has coverage for the hash itself but not for the invariant that install output = runtime input.

5. Document the failure mode (actions/setup/md/stale_lock_file_failed.md)

  • Add a section to the "How to investigate the mismatch" block explaining that a mismatch can occur on fresh installs (not only on post-edit workflows) if SHA resolution failed transiently. Link to the retry command.
  • User remediation is unchanged (gh aw compile), but understanding the install-time origin helps users trust the fix.

Test cases

Valid (install succeeds, pair agrees)

  • gh aw add <path>@main when GitHub API is healthy → .md has @<SHA>, lockfile hash matches. ✓
  • gh aw add <path>@<exact-sha>.md has @<SHA> verbatim, no resolution needed, lockfile hash matches. ✓
  • gh aw add <local-path> → no SHA lookup, source line has no @<ref>, lockfile hash matches. ✓

Invalid (install fails with clear error)

  • gh aw add <path>@main when ResolveRefToSHAForHost returns HTTP 429 three times in a row → retry once, give up, fail with "failed to resolve 'main' to commit SHA after 3 retries. ..." (exit non-zero, no .md or .lock.yml written).
  • gh aw add <path>@main when the upstream repo doesn't exist → fail with "repository 'owner/repo' not found. ..." (immediate, no retry).

Regression guard

  • Simulated transient failure (fail once, succeed on retry) → install succeeds, .md has @<SHA>, lockfile hash matches.
  • Simulated permanent failure with --allow-unpinned flag (if option (c) is chosen) → .md has @main, lockfile is computed from that .md, hash matches.

Labels

bug, cli

Follow-up guidelines

  • Use console formatting from pkg/console for CLI output.
  • Follow the Error Message Style Guide.
  • Run make agent-finish before completing.
  • The fix is small but structural — it touches the install path that every workflow in the ecosystem depends on, so please add both the specific regression test (test case 4) and the integration test (test case 3) rather than just the unit-level retry mock.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions