refactor(build-cds-containers): run on nv-gha-runners (supersedes #49)#50
Merged
Merged
Conversation
Move all five jobs in build-cds-containers.yml from runs-on: ubuntu-latest to runs-on: linux-amd64-cpu4 (NVIDIA self-hosted runners). With the workflow on nv-gha-runners, the buildkitd-config introduced in #48 does meaningful work: BuildKit routes docker.io pulls through dockerhub.nvidia.com instead of going anonymously to Docker Hub. Three of the four matrix images today pull base images from Docker Hub: - grafana-backup-tool: ysde/docker-grafana-backup-tool:1.4.2-slim - go-dev-1.24-debian: golang:1.24.3 - go-dev-1.24-alpine: golang:1.24.3-alpine (The tools image bases on nvcr.io and is unaffected either way.) Rationale: 1. Consistency. Every other docker-build path in this repo (composite action, reusable workflow, security-container-scan examples) assumes nv-gha-runners. This workflow being on ubuntu-latest was the odd one out. 2. Per the NVIDIA GHA platform best practice, BuildKit pulls from nv-gha-runners should go through the dockerhub.nvidia.com Artifactory mirror. That is exactly what the buildkitd-config already in this step provides. 3. Avoid future surprise. If anyone adds another Docker Hub base image to this matrix, the build is now insulated from anonymous rate-limit failures (nvbug 6225636). 4. Org policy. NVIDIA-owned CI on NVIDIA OSS repos generally belongs on NVIDIA-provisioned runners. This change supersedes #49 (the revert hotfix). If this PR is the chosen direction, close #49. Tracks: nvbug 6225636. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Brian R. Jackson <brijackson@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
3 tasks
Contributor
Author
|
/ok to test |
abegnoche
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary -- alternative to #49
This PR is the architectural alternative to the revert in #49. It moves all five jobs in
build-cds-containers.ymlfromruns-on: ubuntu-latesttoruns-on: linux-amd64-cpu4(NVIDIA self-hosted runners), and keeps thebuildkitd-config: /etc/buildkit/buildkitd.tomlsetting introduced in #48. With the workflow on nv-gha-runners, that config does meaningful work — BuildKit routesdocker.iopulls throughdockerhub.nvidia.com(Artifactory pull-through cache) instead of going anonymously to Docker Hub.Reviewers: please pick either this PR or #49, not both.
Why this is worth considering
Three of the four matrix builds pull their base image from Docker Hub today:
FROMtoolsnvcr.io/nvidia/base/ubuntu:22.04_20240212grafana-backup-toolysde/docker-grafana-backup-tool:1.4.2-slimgo-dev-1.24-debiangolang:1.24.3go-dev-1.24-alpinegolang:1.24.3-alpineThat means the rate-limit risk that motivated #48 applies here too — it just happens to be masked today because GitHub-hosted runners get authenticated Docker Hub pulls. Moving to nv-gha-runners makes the workflow consistent with the rest of this repo and protected from the rate-limit class of failures.
Trade-offs vs. #49
mainruns-on:line edits, 0 new logicdocker runtests)Open questions for reviewers
ubuntu-latestoriginally intentional (e.g., to keep the workflow runnable by external contributors, or to decouple from internal infra)?nv-gha-runnerssupport the workflow's auxiliary steps as-is? Specifically:cache-from: type=gha/cache-to: type=gha,mode=max— GHA cache backend (should be runner-agnostic).docker/login-action@v3againstghcr.iowith${{ secrets.GITHUB_TOKEN }}.docker runinvocations in thetest-go-dev-imageandtest-tools-imagejobs against the freshly built images.linux-amd64-cpu4is 4 CPUs; comparable to GitHub-hosted standard tier. Disk size should be fine for these images. Sanity-check before merge.What stays unchanged
.github/actions/docker-build/action.yml— composite action consumed bynv-gha-runners-based downstream repos. Correct as-is from fix(docker-build): route BuildKit pulls through the NVIDIA Docker Hub mirror #48..github/actions/security-container-scan/README.md— examples targetlinux-amd64-cpu4. Correct as-is from fix(docker-build): route BuildKit pulls through the NVIDIA Docker Hub mirror #48.Test plan
Static validation performed locally:
python -m yamlparses.github/workflows/build-cds-containers.ymlcleanly.actionlint v1.7.7exits clean (one pre-existing warning aboutlinux-amd64-cpu4being unrecognized by upstream actionlint — this is a known limitation for self-hosted labels, no actual issue).Live validation (requires runners; needs a vetter to allow
copy-pr-bot):Build CDS Containersruns onpull-request/**mirror (path filter matches because the workflow file itself changed).nv-gha-runners.test-go-dev-image,test-tools-image) succeed.Tracks: nvbug 6225636.
cc @huaweic-nv @mmou-nv @abegnoche @lachen-nv