Skip to content

fix(build-cds-containers): drop buildkitd-config on ubuntu-latest runners#49

Closed
nvjaxzin wants to merge 1 commit into
mainfrom
fix/build-cds-containers-revert-mirror-config
Closed

fix(build-cds-containers): drop buildkitd-config on ubuntu-latest runners#49
nvjaxzin wants to merge 1 commit into
mainfrom
fix/build-cds-containers-revert-mirror-config

Conversation

@nvjaxzin

Copy link
Copy Markdown
Contributor

Summary

Hotfix for the post-merge failure on `main` introduced by #48. Drops the `buildkitd-config: /etc/buildkit/buildkitd.toml` input from `Set up Docker Buildx` in `build-cds-containers.yml`.

Root cause

The jobs in this workflow run on `runs-on: ubuntu-latest` (GitHub-hosted runners), not `nv-gha-runners`. `/etc/buildkit/buildkitd.toml` is only pre-populated on the latter. As a result, every matrix variant of `build-and-push-images` failed in the post-merge `Build CDS Containers` run with:

```
##[error]config file /etc/buildkit/buildkitd.toml not found
```

Failed run: https://github.com/NVIDIA/dsx-github-actions/actions/runs/26478414457

Why this revert is safe

The fix in #48 targeted Docker Hub anonymous rate limits on self-hosted runners. GitHub-hosted runners pull Docker Hub images using the runner's pre-configured Docker Hub auth, which is not subject to the anonymous rate limit. So this workflow never needed the BuildKit mirror config in the first place.

What stays

Lesson learned (for the audit)

When applying `buildkitd-config: /etc/buildkit/buildkitd.toml`, confirm the surrounding job's `runs-on:` is an nv-gha-runner. The path is platform-specific.

Test plan

  • `python -m yaml` parses `.github/workflows/build-cds-containers.yml` cleanly.
  • `actionlint v1.7.7` exits clean on `.github/workflows/build-cds-containers.yml`.
  • Post-merge `Build CDS Containers` run on `main` succeeds.

Tracks: nvbug 6225636.

cc @huaweic-nv @mmou-nv @abegnoche @lachen-nv

PR #48 added 'with: buildkitd-config: /etc/buildkit/buildkitd.toml'
to the Set up Docker Buildx step in this workflow. That config file
is pre-populated on nv-gha-runners but does not exist on GitHub-hosted
runners. The build-and-push-images job (and others in this workflow)
run on 'ubuntu-latest', so post-merge to main every matrix variant
failed with:

  ##[error]config file /etc/buildkit/buildkitd.toml not found

The original rationale for the buildkitd-config setting does not apply
to GitHub-hosted runners: those runners pull from Docker Hub with the
runner's pre-configured authentication and do not hit the anonymous
rate limit that motivated the change in the first place.

This revert is scoped only to this workflow. The change in
.github/actions/docker-build/action.yml (the composite action consumed
by nv-gha-runners-based consumers) is correct and stays in place.

Tracks: nvbug 6225636.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Brian R. Jackson <brijackson@nvidia.com>
@nvjaxzin

Copy link
Copy Markdown
Contributor Author

Alternative architectural approach drafted at #50 — moves this workflow to nv-gha-runners (where the BuildKit mirror config does meaningful work) instead of reverting it. Reviewers should pick one of #49 or #50, not both. See #50's body for the trade-off comparison.

@nvjaxzin nvjaxzin closed this in 1be5ca1 May 26, 2026
nvjaxzin added a commit that referenced this pull request May 26, 2026
…ha-runners

refactor(build-cds-containers): run on nv-gha-runners (supersedes #49)
@github-actions github-actions Bot locked and limited conversation to collaborators May 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants