Skip to content

[codex] Add Empire AI NVL72 training launcher#777

Draft
jder wants to merge 9 commits into
mainfrom
codex/empireai-nvl72-launcher
Draft

[codex] Add Empire AI NVL72 training launcher#777
jder wants to merge 9 commits into
mainfrom
codex/empireai-nvl72-launcher

Conversation

@jder

@jder jder commented Jun 23, 2026

Copy link
Copy Markdown
Member

Summary

  • add an Empire AI Beta NVL72 Slurm/Pyxis launcher for half-degree OM4 training
  • use the public m2lines ARM64 container manifest digest that works with Empire AI's Pyxis/Enroot stack
  • default W&B runs to the ocean_emulators entity instead of the stale samudra entity

Context

The NVL72 nodes use ARM CPUs and the cluster's Pyxis/Enroot stack did not import GHCR OCI index tags reliably, so the launcher pins the ARM64 manifest digest. The script supports a 2-GPU smoke mode and a 9-node / 36-GPU full mode, validates the expected data layout under ~/data/om4_halfdeg, and passes W&B entity/project overrides explicitly.

Validation

  • bash -n scripts/empireai_nvl72_train.sbatch
  • uv run python - <<'PY' ... WandBConfig().entity == "ocean_emulators" ... PY
  • pre-commit during commit: ruff, ruff-format, mypy, validate-schemas, detect-secrets, reuse lint
  • Empire AI smoke job 20654: 2 GPUs on 1 node, completed 0:0
  • Empire AI smoke job 20673: 8 GPUs on 2 nodes, completed 0:0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant