[codex] Add Empire AI NVL72 training launcher by jder · Pull Request #777 · m2lines/Samudra

jder · 2026-06-23T20:15:42Z

Summary

add an Empire AI Beta NVL72 Slurm/Pyxis launcher for half-degree OM4 training
use the public m2lines ARM64 container manifest digest that works with Empire AI's Pyxis/Enroot stack
default W&B runs to the ocean_emulators entity instead of the stale samudra entity

Context

The NVL72 nodes use ARM CPUs and the cluster's Pyxis/Enroot stack did not import GHCR OCI index tags reliably, so the launcher pins the ARM64 manifest digest. The script supports a 2-GPU smoke mode and a 9-node / 36-GPU full mode, validates the expected data layout under ~/data/om4_halfdeg, and passes W&B entity/project overrides explicitly.

Validation

bash -n scripts/empireai_nvl72_train.sbatch
uv run python - <<'PY' ... WandBConfig().entity == "ocean_emulators" ... PY
pre-commit during commit: ruff, ruff-format, mypy, validate-schemas, detect-secrets, reuse lint
Empire AI smoke job 20654: 2 GPUs on 1 node, completed 0:0
Empire AI smoke job 20673: 8 GPUs on 2 nodes, completed 0:0

jder added 9 commits June 23, 2026 16:14

Add Empire AI NVL72 training launcher

02e9624

Use SMOKE flag for Empire AI launcher

21dc50e

Require explicit Empire AI GPU allocation

5b798bb

Set Empire AI NCCL interface defaults

f84c0c1

Bump PhysicsNeMo container to 26.05

4fefe0a

Keep CUDA 13 wheels from PhysicsNeMo base

3779505

Update typing extensions for PhysicsNeMo arm64 tests

438b661

Warm Pyxis image cache before Empire AI training

ee7d66a

Add Empire AI NCCL validation diagnostics

448c47c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Add Empire AI NVL72 training launcher#777

[codex] Add Empire AI NVL72 training launcher#777
jder wants to merge 9 commits into
mainfrom
codex/empireai-nvl72-launcher

jder commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jder commented Jun 23, 2026

Summary

Context

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant