Skip to content

[codex] Add checkpoint resume parity test#778

Draft
fomo-bot wants to merge 5 commits into
mainfrom
codex/preemptible-checkpoint-resume-test
Draft

[codex] Add checkpoint resume parity test#778
fomo-bot wants to merge 5 commits into
mainfrom
codex/preemptible-checkpoint-resume-test

Conversation

@fomo-bot

Copy link
Copy Markdown
Collaborator

Summary

  • Make non-distributed grouped batch sampling deterministic by epoch/seed, so resumed training sees the same batch order as uninterrupted training.
  • Add sampler coverage for epoch-stable shuffling without crossing group boundaries.
  • Add a CUDA-marked v2 checkpoint/resume parity test that builds a tiny OM4-shaped Zarr dataset, runs a small configs/samudra_om4_v2/train.yaml model, checkpoints after epoch 1, resumes, and asserts epoch-2 train and validation losses match the uninterrupted run.

Notes

  • The test enables deterministic PyTorch algorithms while comparing CUDA losses so numeric drift does not mask checkpoint/resume regressions.
  • Local CUDA is now resolving to torch 2.9.1+cu130 from the cu130 uv source on this aarch64 host.

Validation

  • uv run pytest -m cuda tests/test_trainer.py::test_checkpoint_resume_matches_continuous_tiny_v2_cuda -q
  • uv run pytest tests/test_samplers.py -q
  • uv run pytest -m "not cuda" tests/test_trainer.py::test_checkpoint_ema tests/test_trainer.py::test_checkpoint_dynamic_loss_state tests/test_trainer.py::test_checkpoint_inference -q
  • uvx ruff check src/samudra/utils/samplers.py tests/test_samplers.py tests/test_trainer.py
  • uvx ruff format --check src/samudra/utils/samplers.py tests/test_samplers.py tests/test_trainer.py

@jder

jder commented Jun 24, 2026

Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b02bd56a16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

def __iter__(self):
# Create batch samplers for each group
batch_sampler = itertools.chain(*self._samplers)
rng = random.Random(self.seed + self.epoch)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Thread configured seed into sampler RNG

In non-distributed training, Trainer constructs EquivalenceGroupBatchSampler.from_datasets(...) without passing a seed after calling set_seed(cfg.experiment.rand_seed). Creating a fresh random.Random(self.seed + self.epoch) here therefore forces the default seed 0 for every run, so changing experiment.rand_seed no longer changes the train/validation batch order as it did when the sampler used the globally seeded random/SubsetRandomSampler state. This makes seed-sweep experiments reuse identical data orders; please pass the configured seed into this sampler or derive it from the seeded global RNG.

Useful? React with 👍 / 👎.

@jder jder left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fomo-bot please take a look at the review feedback

Comment thread src/samudra/utils/samplers.py Outdated
batch_size: int,
shuffle: bool = True,
drop_last: bool = False,
seed: int = 0,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just get rid of this default value here + below; using it is probably a bug.

for group in self.groups
]

def set_epoch(self, epoch: int) -> None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get called in the actual checkpoint resume path? (ie not during tests)

Comment thread tests/test_trainer.py Outdated
trainer.run()


def _write_tiny_om4_zarr(data_root: Path) -> None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use one of the existing mock/remote/etc datasets please

Comment thread tests/test_trainer.py Outdated
Comment on lines +121 to +158
def _tiny_v2_config(tmp_path: Path, data_root: Path, run_name: str) -> TrainConfig:
config_path = Path(__file__).parents[1] / "configs/samudra_om4_v2/train.yaml"
cfg = TrainConfig.from_yaml_and_cli(
[
str(config_path),
"--experiment.data_root",
str(data_root),
"--experiment.base_output_dir",
str(tmp_path / "runs"),
"--experiment.name",
run_name,
"--backend",
"cuda",
"--train_time.start",
"1975-01-03",
"--train_time.end",
"1975-02-12",
"--val_time.start",
"1975-02-12",
"--val_time.end",
"1975-03-24",
]
)
cfg.epochs = 2
cfg.save_freq = 1
cfg.batch_size = 1
cfg.steps = [1]
cfg.step_transition = []
cfg.data.concurrent_compute = False
assert isinstance(cfg.data.loading, CpuDataLoadingConfig)
cfg.data.loading.num_workers = 0
cfg.data.loading.persistent_workers = False

assert hasattr(cfg.model, "unet")
cfg.model.unet.ch_width = [8, 8, 8, 8]
cfg.model.unet.dilation = [1, 1, 1, 1]
cfg.model.unet.n_layers = [1, 1, 1, 1]
return cfg

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into a test config file. eg reuse train_samudra_mini or make a train_samudra_2_mini if there are differences.

Comment thread tests/test_trainer.py Outdated
Comment on lines +165 to +168
if cfg.resume_ckpt_path is None:
trainer.best_val_loss = 1e8
trainer.best_inf_loss = 1e8
trainer.init_data_loaders(cur_step=cfg.steps[0])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to reach into trainer like this?

Comment thread tests/test_trainer.py
Comment on lines +198 to +200
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.use_deterministic_algorithms(True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this duplicated from _initialize_manual_run?

@fomo-bot

Copy link
Copy Markdown
Collaborator Author

Addressed the review feedback locally.

Changed samplers.py:48 so non-distributed grouped samplers require an explicit seed, and threaded cfg.experiment.rand_seed through the real trainer sampler construction in train.py:130. I also fixed the actual resume path so Trainer.run() no longer resets checkpoint-restored best_val_loss / best_inf_loss.

Reworked the CUDA resume parity test in test_trainer.py:124 to use the existing mock-om4 fixture, a dedicated test config at train_samudra_mini_resume.yaml:1, and the real Trainer.run() checkpoint/resume path. Added sampler coverage that different configured seeds produce different shuffled orders in test_samplers.py:127.

Validation:

  • uvx ruff check ... passed
  • uvx ruff format --check ... passed
  • uv run pytest tests/test_samplers.py -q passed
  • uv run pytest -m cuda tests/test_trainer.py::test_checkpoint_resume_matches_continuous_samudra_mini_cuda -q passed
  • uv run pytest -m "not cuda" tests/test_trainer.py::test_checkpoint_ema tests/test_trainer.py::test_checkpoint_dynamic_loss_state tests/test_trainer.py::test_checkpoint_inference -q passed
  • uv run pytest tests/test_datasets.py -q -k "not samudra_multi" passed

Full tests/test_datasets.py still has unrelated setup errors for train_samudra_multi.yaml because flash_perceiver is not installed. I did not post replies or resolve GitHub review threads.

Codex work

Codex transcript

- I’ll use the `github:gh-address-comments` workflow here, since this is review-thread feedback on a PR. I’m going to load its instructions, inspect the unresolved review context, then make any concrete fixes the review calls for.
- I’ve got the PR and branch context. Next I’m pulling both the PR metadata/diff and the thread-aware review state, because inline review resolution state matters here.
- The unresolved feedback clusters around two areas: sampler seeding and the new checkpoint-resume test shape. I’m reading the trainer/test/config paths now so the fix follows the existing test harness instead of adding another custom path.
- A few of the comments are answered by code that already exists: the real `Trainer.run()` path does call `set_epoch` on both train and validation samplers before each epoch. The test currently bypasses that path, which is what made the review ask about it, so I’m going to reshap...[truncated]
- The fix I’m converging on is: require seed at sampler construction, pass `cfg.experiment.rand_seed` from `Trainer`, add sampler coverage for different seeds, and replace the ad hoc tiny Zarr/test config mutation with an existing mock-backed config file plus `Trainer.run()`. Tha...[truncated]
- There’s also a prior automated review saying the sampler seed default made every non-distributed training run use seed `0`. I’m treating that and jder’s “remove the default value” comment as the same root issue and making sampler seeds explicit at the real call sites an...[truncated]
- I’m about to edit four files: sampler API/call sites, sampler tests, the trainer checkpoint-resume test, and a small test config. I’m keeping this scoped to the reviewed behavior and avoiding GitHub replies or thread resolution.
- One important shape issue: `run()` does not return losses, so the stronger test will compare the final continuous checkpoint against the final resumed checkpoint. That covers model, optimizer, EMA, loss state, counters, and saved validation state after the real training path.
- The one small trainer internals tweak I may keep is the existing `num_batches_seen = 1` workaround used by the CUDA smoke test to avoid the torchinfo summary path on constrained GPUs. It does not bypass sampler epoching or checkpoint resume; it just prevents an expensive summary ...[truncated]
- While converting the test to `Trainer.run()`, I spotted a real resume-path issue: `run()` resets `best_val_loss` and `best_inf_loss` after `load_checkpoint()` has restored them. I’m going to move that initialization so fresh runs start with sentinel values and resumed runs keep...[truncated]
- I’m applying the scoped code changes now: explicit non-distributed sampler seed, trainer seed plumbing, preserving loaded best losses during `run()`, and replacing the custom CUDA resume test setup with a mock-backed test config.
- The sampler constructor and factories now require a keyword seed instead of silently falling back to zero. I’m wiring that through the inner distributed sampler and the trainer call sites next.
- `Trainer` now stores the configured random seed once, passes it into all train/validation grouped samplers, and no longer overwrites checkpoint-restored best losses when `run()` starts.
- I’m updating the tests to make their seed choices explicit, then I’ll add coverage that two configured seeds produce different shuffled orders for the non-distributed sampler.
- The direct call sites are now covered; I’m checking doc examples and the updated signatures so type/lint failures don’t come from stale snippets.
- Next I’m adding a dedicated test YAML for the resume-parity case. It uses the existing `mock-om4` fixture data layout, keeps inference disabled to focus on train/validation resume behavior, and saves every epoch so the first run can resume from epoch 1.
- I’m replacing the old manual epoch helper with a test that runs three real trainer executions: uninterrupted 2 epochs, interrupted 1 epoch, then resumed to epoch 2. The assertion compares the final saved checkpoints.
- The old manual helper block is still present after the new test, so I’m removing it now. That also removes the ad hoc Zarr dependency and the direct sampler poking the review called out.
- The code edits are in place. I’m going to run targeted formatting/lint checks and the sampler tests first, then I’ll decide whether the CUDA resume test is practical to run on this host.
- Sampler tests pass. Ruff format flagged `tests/test_trainer.py` for formatting and, unsurprisingly, cannot parse the YAML config when passed as a Python file, so I’ll format the Python targets only and rerun the checks.

@jder

jder commented Jun 24, 2026

Copy link
Copy Markdown
Member

@fomo-bot please push your work up

@fomo-bot

Copy link
Copy Markdown
Collaborator Author

Pushed the review-fix commit to the existing PR branch.



Commit: f205a33e (Address checkpoint resume review feedback)

PR: #778



Local branch is clean and aligned with origin/codex/preemptible-checkpoint-resume-test.

Codex work

Codex transcript

- I’ll use the `github:yeet` workflow here because this is a request to publish the local PR fixes. I’ll verify the worktree, commit the scoped changes, and push the existing branch.
- I’m checking the current branch, auth, and diff scope before staging. Since this is an existing PR branch, I’ll push the new commit to that branch rather than opening another PR.
- The worktree only contains the review-feedback files from the previous fix, plus the new test config. I’m staging those explicit paths and committing them with a focused message.
- The staged scope matches the intended review fixes. I’m committing now, then I’ll push to `origin/codex/preemptible-checkpoint-resume-test`.
- The commit failed because this checkout has no local Git author identity configured. I’m going to set a repository-local identity for `fomo-bot` only, then retry the same commit.
- Commit `f205a33e` was created. I’m pushing it to the existing PR branch now.
- The push succeeded. I’m doing a final status check so I can confirm the local branch is clean and aligned with the remote.
- Pushed the review-fix commit to the existing PR branch.

Commit: `f205a33e` (`Address checkpoint resume review feedback`)  
PR: https://github.com/m2lines/Samudra/pull/778

Local branch is clean and aligned with `origin/codex/preemptible-checkpoint-resume-test`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants