Skip to content

Commit ec1c174

Browse files
arcticflyclaude
andauthored
ci: Auto-build and upload uv cache on miss (#608)
* ci: Auto-build and upload uv cache on miss Instead of failing CI when the prebuilt uv cache is missing (requiring a manual rebuild on a separate machine), gracefully fall back to building from scratch and uploading the cache for future runs. - Change permissions to contents: write for release asset uploads - Convert hard failures in cache restore to warnings with cache-hit output - Add upload step that archives the uv cache after uv sync and uploads via the existing build_and_push_uv_cache.sh script (--skip-build) - Re-check before upload to avoid races when concurrent CI runs both miss the cache - Use continue-on-error so upload failures never break quality checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Split into cache-status/build-cache/quality-checks jobs Move cache building to a separate `build-cache` job that runs on a larger runner (`art-cache-builder`) only when the cache is missing. This avoids OOM on the 16GB `art-large-runner` during cold builds. - `cache-status`: lightweight check for existing cache (art-large-runner) - `build-cache`: builds and uploads cache on miss (art-cache-builder, >=32GB) - `quality-checks`: restores cache and runs checks (art-large-runner) On cache hit, build-cache is skipped and quality-checks runs immediately. On cache miss, quality-checks waits for build-cache to finish first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Trigger CI run * ci: Limit uv concurrency in build-cache to avoid OOM Restrict parallel downloads (4), installs (1), and native build jobs (2) to keep peak memory usage within the 64GB runner limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Use Docker Buildx for cache builds to avoid OOM Docker Buildx manages memory via overlay layers and doesn't get OOM-killed like bare uv sync does. This matches the pre-#560 approach and works on the existing art-large-runner (16GB) without needing a larger runner. - Add docker/ci-uv-cache.Dockerfile to build the uv cache in Docker - build-cache job uses Buildx with GHA cache, then extracts the archive and uploads via the existing build_and_push_uv_cache.sh script - Remove dependency on art-cache-builder runner Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: Update CONTRIBUTING.md for automatic cache rebuilds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Run build-cache on art-cache-builder (16-core/64GB) The 4-core/16GB art-large-runner thrashes on the Docker build due to the large packages (torch, vllm, cudnn, etc.). Use a dedicated larger runner only for cache builds to finish faster and more reliably. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Retrigger CI with clean cache state * ci: Set cuDNN paths in Dockerfile for transformer-engine build transformer-engine-torch needs cudnn.h which is provided by the pip nvidia-cudnn package. Set CUDNN_PATH and related env vars pointing to the venv location so the native extension can find the headers during compilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Replace Docker Buildx with direct uv sync on art-cache-builder With 64GB RAM on art-cache-builder, we can run build_and_push_uv_cache.sh directly without Docker. Simpler, avoids Dockerfile env var complications (cuDNN paths, etc.), and reuses the existing script that already handles all the build details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: Retrigger all checks * ci: Remove container from cache-status job The cache-status job only needs python3 and curl (both on the runner natively) to compute a fingerprint and check the API. Removing the pytorch container avoids a slow image pull on every CI run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5593918 commit ec1c174

2 files changed

Lines changed: 113 additions & 43 deletions

File tree

.github/workflows/prek.yml

Lines changed: 108 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,80 @@ on:
66
branches: [main]
77

88
permissions:
9-
contents: read
9+
contents: write
10+
11+
env:
12+
CI_BASE_IMAGE: "pytorch/pytorch:2.9.0-cuda12.8-cudnn9-devel"
13+
CI_PYTHON_MM: "3.11"
14+
CI_UV_CACHE_RELEASE_TAG: "prek-uv-cache"
15+
CI_UV_CACHE_ASSET_PREFIX: "prek-uv-cache"
16+
UV_CACHE_DIR: "/root/.cache/uv"
17+
UV_LINK_MODE: "copy"
18+
TORCH_CUDA_ARCH_LIST: "8.0"
1019

1120
jobs:
12-
quality-checks:
21+
cache-status:
1322
runs-on: art-large-runner
23+
outputs:
24+
cache-hit: ${{ steps.check.outputs.cache-hit }}
25+
fingerprint: ${{ steps.fingerprint.outputs.fingerprint }}
26+
steps:
27+
- name: Checkout code
28+
uses: actions/checkout@v4
29+
30+
- name: Compute expected uv cache fingerprint
31+
id: fingerprint
32+
run: |
33+
fp="$(python3 scripts/ci/compute_uv_fingerprint.py \
34+
--pyproject pyproject.toml \
35+
--uv-lock uv.lock \
36+
--base-image "${CI_BASE_IMAGE}" \
37+
--python-mm "${CI_PYTHON_MM}")"
38+
echo "fingerprint=${fp}" >> "${GITHUB_OUTPUT}"
39+
echo "Expected uv cache fingerprint: ${fp}"
40+
41+
- name: Check if uv cache exists
42+
id: check
43+
env:
44+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
45+
run: |
46+
fingerprint="${{ steps.fingerprint.outputs.fingerprint }}"
47+
part_prefix="${CI_UV_CACHE_ASSET_PREFIX}-${fingerprint}.tar.zst.part-"
48+
release_api="https://api.github.com/repos/${GITHUB_REPOSITORY}/releases/tags/${CI_UV_CACHE_RELEASE_TAG}"
49+
50+
release_json="$(curl -fsSL \
51+
-H "Authorization: Bearer ${GITHUB_TOKEN}" \
52+
-H "Accept: application/vnd.github+json" \
53+
"${release_api}" || true)"
54+
55+
if [ -z "${release_json}" ]; then
56+
echo "Cache release '${CI_UV_CACHE_RELEASE_TAG}' not found."
57+
echo "cache-hit=false" >> "${GITHUB_OUTPUT}"
58+
exit 0
59+
fi
60+
61+
hit="$(RELEASE_JSON="${release_json}" PART_PREFIX="${part_prefix}" python3 -c "
62+
import json, os, re
63+
payload = json.loads(os.environ['RELEASE_JSON'])
64+
prefix = os.environ['PART_PREFIX']
65+
pattern = re.compile(r'^' + re.escape(prefix) + r'(\d{3})$')
66+
parts = sorted(
67+
int(m.group(1))
68+
for a in payload.get('assets', [])
69+
for m in [pattern.match(a.get('name', ''))]
70+
if m and a.get('id') is not None
71+
)
72+
print('true' if parts and parts == list(range(len(parts))) else 'false')
73+
")"
74+
echo "cache-hit=${hit}" >> "${GITHUB_OUTPUT}"
75+
echo "Cache hit: ${hit}"
76+
77+
build-cache:
78+
needs: cache-status
79+
if: needs.cache-status.outputs.cache-hit != 'true'
80+
runs-on: art-cache-builder
1481
container:
1582
image: pytorch/pytorch:2.9.0-cuda12.8-cudnn9-devel
16-
env:
17-
CI_BASE_IMAGE: "pytorch/pytorch:2.9.0-cuda12.8-cudnn9-devel"
18-
CI_PYTHON_MM: "3.11"
19-
CI_UV_CACHE_RELEASE_TAG: "prek-uv-cache"
20-
CI_UV_CACHE_ASSET_PREFIX: "prek-uv-cache"
21-
UV_CACHE_DIR: "/root/.cache/uv"
22-
UV_LINK_MODE: "copy"
23-
TORCH_CUDA_ARCH_LIST: "8.0"
24-
2583
steps:
2684
- name: Install CI dependencies
2785
run: |
@@ -31,30 +89,60 @@ jobs:
3189
curl -LsSf https://astral.sh/uv/install.sh | sh
3290
echo "/root/.local/bin" >> "${GITHUB_PATH}"
3391
92+
- name: Install gh CLI
93+
env:
94+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
95+
run: |
96+
GH_DL_URL="$(curl -fsSL \
97+
-H "Authorization: Bearer ${GH_TOKEN}" \
98+
https://api.github.com/repos/cli/cli/releases/latest \
99+
| python3 -c "import json,sys;r=json.load(sys.stdin);print([a['browser_download_url'] for a in r['assets'] if a['name'].endswith('_linux_amd64.tar.gz')][0])")"
100+
curl -fsSL "${GH_DL_URL}" | tar xz --strip-components=1 -C /usr/local
101+
gh version
102+
34103
- name: Checkout code
35104
uses: actions/checkout@v4
36105

37106
- name: Mark workspace as a safe git directory
38107
run: |
39108
git config --global --add safe.directory "${GITHUB_WORKSPACE}"
40109
41-
- name: Compute expected uv cache fingerprint
42-
id: expected-uv-fingerprint
110+
- name: Build and upload uv cache
111+
env:
112+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
43113
run: |
44-
fp="$(python3 scripts/ci/compute_uv_fingerprint.py \
45-
--pyproject pyproject.toml \
46-
--uv-lock uv.lock \
114+
bash scripts/ci/build_and_push_uv_cache.sh \
47115
--base-image "${CI_BASE_IMAGE}" \
48-
--python-mm "${CI_PYTHON_MM}")"
49-
echo "fingerprint=${fp}" >> "${GITHUB_OUTPUT}"
50-
echo "Expected uv cache fingerprint: ${fp}"
116+
--python-mm "${CI_PYTHON_MM}"
117+
118+
quality-checks:
119+
needs: [cache-status, build-cache]
120+
if: ${{ !failure() && !cancelled() }}
121+
runs-on: art-large-runner
122+
container:
123+
image: pytorch/pytorch:2.9.0-cuda12.8-cudnn9-devel
124+
steps:
125+
- name: Install CI dependencies
126+
run: |
127+
apt-get update
128+
apt-get install -y --no-install-recommends ca-certificates curl git zstd
129+
rm -rf /var/lib/apt/lists/*
130+
curl -LsSf https://astral.sh/uv/install.sh | sh
131+
echo "/root/.local/bin" >> "${GITHUB_PATH}"
132+
133+
- name: Checkout code
134+
uses: actions/checkout@v4
135+
136+
- name: Mark workspace as a safe git directory
137+
run: |
138+
git config --global --add safe.directory "${GITHUB_WORKSPACE}"
51139
52140
- name: Restore prebuilt uv cache
53141
env:
54142
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
55143
run: |
56144
release_api="https://api.github.com/repos/${GITHUB_REPOSITORY}/releases/tags/${CI_UV_CACHE_RELEASE_TAG}"
57-
fingerprint="${{ steps.expected-uv-fingerprint.outputs.fingerprint }}"
145+
fingerprint="${{ needs.cache-status.outputs.fingerprint }}"
58146
part_prefix="${CI_UV_CACHE_ASSET_PREFIX}-${fingerprint}.tar.zst.part-"
59147
60148
release_json="$(curl -fsSL \
@@ -64,14 +152,12 @@ jobs:
64152
65153
if [ -z "${release_json}" ]; then
66154
echo "::error::Missing cache release '${CI_UV_CACHE_RELEASE_TAG}'."
67-
echo "::error::Build and upload cache with: bash scripts/ci/build_and_push_uv_cache.sh"
68155
exit 1
69156
fi
70157
71158
part_selection_file="/tmp/uv-cache-part-selection.txt"
72159
if ! RELEASE_JSON="${release_json}" PART_PREFIX="${part_prefix}" python3 -c "import json, os, re, sys; payload=json.loads(os.environ['RELEASE_JSON']); part_prefix=os.environ['PART_PREFIX']; pattern=re.compile(r'^' + re.escape(part_prefix) + r'(\\d{3})$'); parts=[]; [parts.append((int(m.group(1)), int(a.get('id')), a.get('name'))) for a in payload.get('assets', []) for m in [pattern.match(a.get('name', ''))] if m and a.get('id') is not None]; parts.sort(key=lambda x: x[0]); indices=[p[0] for p in parts]; expected=list(range(len(parts))); print('\\n'.join(f'{asset_id} {name}' for _, asset_id, name in parts)) if parts and indices == expected else (_ for _ in ()).throw(SystemExit(2 if not parts else 3))" > "${part_selection_file}"; then
73160
echo "::error::No complete uv cache part set found for prefix '${part_prefix}'."
74-
echo "::error::Build and upload cache with: bash scripts/ci/build_and_push_uv_cache.sh"
75161
exit 1
76162
fi
77163

CONTRIBUTING.md

Lines changed: 5 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -37,35 +37,19 @@ uv run prek run pytest
3737

3838
These checks are automatically run in CI for all pull requests. If your PR fails these checks, re-run the corresponding `prek` hook locally and commit any fixes.
3939

40-
### CI uv Cache Refresh
40+
### CI uv Cache
4141

4242
The PR `prek` workflow uses a prebuilt full `uv` cache (stored as a GitHub release asset) to avoid rebuilding heavy dependencies on every run.
4343

44-
To refresh the cache after dependency changes, ensure your branch is rebased or merged with main, or checkout the PR merge branch, then run:
44+
The cache is keyed by a fingerprint computed from `pyproject.toml`, `uv.lock`, the base Docker image, and the Python version. When dependencies change, the fingerprint changes and CI automatically rebuilds the cache using Docker Buildx and uploads it for future runs. The first CI run after a dependency change will be slower while the cache is built.
4545

46-
```bash
47-
bash scripts/ci/build_and_push_uv_cache.sh
48-
```
49-
50-
This command builds a full cache archive locally (using `uv sync --frozen --all-extras --group dev --no-install-project`) and uploads a fingerprinted part set:
51-
52-
- `prek-uv-cache-<fingerprint>.tar.zst.part-000`
53-
- `prek-uv-cache-<fingerprint>.tar.zst.part-001`
54-
- ...
55-
56-
The script also prunes old immutable cache assets (keeps newest 4 by default).
57-
It requires GitHub CLI authentication (`gh auth login`) and should be run in an environment compatible with CI (same base CUDA image/toolchain).
58-
59-
You can override native-build parallelism while preparing cache:
46+
To manually rebuild the cache (e.g., if the automatic build fails), run:
6047

6148
```bash
62-
bash scripts/ci/build_and_push_uv_cache.sh --build-jobs 2
49+
bash scripts/ci/build_and_push_uv_cache.sh
6350
```
6451

65-
By default, `--build-jobs auto` is used and resolves from available CPU and memory.
66-
By default, cache parts are split at `1900 MiB`; override with `--part-size-mb <n>` if needed.
67-
68-
CI computes the expected cache fingerprint from `pyproject.toml`, `uv.lock`, base image, Python version, and cache asset layout contract. If no matching cache part set exists, CI fails fast and tells you to refresh cache with the script above.
52+
This requires GitHub CLI authentication (`gh auth login`) and should be run in an environment compatible with CI (same base CUDA image/toolchain).
6953

7054
### Release Process
7155

0 commit comments

Comments
 (0)