Skip to content

refactor(build-cds-containers): run on nv-gha-runners (supersedes #49)#50

Merged
nvjaxzin merged 1 commit into
mainfrom
refactor/build-cds-containers-nv-gha-runners
May 26, 2026
Merged

refactor(build-cds-containers): run on nv-gha-runners (supersedes #49)#50
nvjaxzin merged 1 commit into
mainfrom
refactor/build-cds-containers-nv-gha-runners

Conversation

@nvjaxzin

@nvjaxzin nvjaxzin commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary -- alternative to #49

This PR is the architectural alternative to the revert in #49. It moves all five jobs in build-cds-containers.yml from runs-on: ubuntu-latest to runs-on: linux-amd64-cpu4 (NVIDIA self-hosted runners), and keeps the buildkitd-config: /etc/buildkit/buildkitd.toml setting introduced in #48. With the workflow on nv-gha-runners, that config does meaningful work — BuildKit routes docker.io pulls through dockerhub.nvidia.com (Artifactory pull-through cache) instead of going anonymously to Docker Hub.

Reviewers: please pick either this PR or #49, not both.

Why this is worth considering

Three of the four matrix builds pull their base image from Docker Hub today:

Image FROM Hits docker.io?
tools nvcr.io/nvidia/base/ubuntu:22.04_20240212 No
grafana-backup-tool ysde/docker-grafana-backup-tool:1.4.2-slim Yes
go-dev-1.24-debian golang:1.24.3 Yes
go-dev-1.24-alpine golang:1.24.3-alpine Yes

That means the rate-limit risk that motivated #48 applies here too — it just happens to be masked today because GitHub-hosted runners get authenticated Docker Hub pulls. Moving to nv-gha-runners makes the workflow consistent with the rest of this repo and protected from the rate-limit class of failures.

Trade-offs vs. #49

Concern #49 (revert) This PR (refactor)
Unbreaks main Yes, immediately Yes, by superseding
Code change -2 lines 5 runs-on: line edits, 0 new logic
Consistency with rest of repo Workflow stays the odd one out on GitHub-hosted Aligned with composite action, reusable workflow, README examples
Rate-limit protection on docker.io pulls None (relies on GitHub-hosted's pre-auth) Full (mirror config now applies)
Risk Zero Slightly higher — nv-gha-runners must support the workflow's other deps (GHA cache backend, GHCR push, docker run tests)
Reviewer cost Trivial Needs a yes on platform direction

Open questions for reviewers

  1. Was ubuntu-latest originally intentional (e.g., to keep the workflow runnable by external contributors, or to decouple from internal infra)?
  2. Do nv-gha-runners support the workflow's auxiliary steps as-is? Specifically:
    • cache-from: type=gha / cache-to: type=gha,mode=max — GHA cache backend (should be runner-agnostic).
    • docker/login-action@v3 against ghcr.io with ${{ secrets.GITHUB_TOKEN }}.
    • docker run invocations in the test-go-dev-image and test-tools-image jobs against the freshly built images.
  3. Resource sizinglinux-amd64-cpu4 is 4 CPUs; comparable to GitHub-hosted standard tier. Disk size should be fine for these images. Sanity-check before merge.

What stays unchanged

Test plan

Static validation performed locally:

  • python -m yaml parses .github/workflows/build-cds-containers.yml cleanly.
  • actionlint v1.7.7 exits clean (one pre-existing warning about linux-amd64-cpu4 being unrecognized by upstream actionlint — this is a known limitation for self-hosted labels, no actual issue).

Live validation (requires runners; needs a vetter to allow copy-pr-bot):

  • Build CDS Containers runs on pull-request/** mirror (path filter matches because the workflow file itself changed).
  • All four matrix variants succeed on nv-gha-runners.
  • Test jobs (test-go-dev-image, test-tools-image) succeed.

Tracks: nvbug 6225636.

cc @huaweic-nv @mmou-nv @abegnoche @lachen-nv

Move all five jobs in build-cds-containers.yml from runs-on:
ubuntu-latest to runs-on: linux-amd64-cpu4 (NVIDIA self-hosted
runners). With the workflow on nv-gha-runners, the buildkitd-config
introduced in #48 does meaningful work: BuildKit routes docker.io
pulls through dockerhub.nvidia.com instead of going anonymously to
Docker Hub.

Three of the four matrix images today pull base images from
Docker Hub:

  - grafana-backup-tool: ysde/docker-grafana-backup-tool:1.4.2-slim
  - go-dev-1.24-debian:  golang:1.24.3
  - go-dev-1.24-alpine:  golang:1.24.3-alpine

(The tools image bases on nvcr.io and is unaffected either way.)

Rationale:

  1. Consistency. Every other docker-build path in this repo
     (composite action, reusable workflow, security-container-scan
     examples) assumes nv-gha-runners. This workflow being on
     ubuntu-latest was the odd one out.

  2. Per the NVIDIA GHA platform best practice, BuildKit pulls
     from nv-gha-runners should go through the dockerhub.nvidia.com
     Artifactory mirror. That is exactly what the buildkitd-config
     already in this step provides.

  3. Avoid future surprise. If anyone adds another Docker Hub base
     image to this matrix, the build is now insulated from anonymous
     rate-limit failures (nvbug 6225636).

  4. Org policy. NVIDIA-owned CI on NVIDIA OSS repos generally
     belongs on NVIDIA-provisioned runners.

This change supersedes #49 (the revert hotfix). If this PR is the
chosen direction, close #49.

Tracks: nvbug 6225636.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Brian R. Jackson <brijackson@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@nvjaxzin

Copy link
Copy Markdown
Contributor Author

/ok to test

@nvjaxzin nvjaxzin marked this pull request as ready for review May 26, 2026 22:39
@nvjaxzin nvjaxzin merged commit f1ef91c into main May 26, 2026
2 checks passed
@nvjaxzin nvjaxzin deleted the refactor/build-cds-containers-nv-gha-runners branch May 26, 2026 22:41
@github-actions github-actions Bot locked and limited conversation to collaborators May 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants