[codex] Add checkpoint resume parity test by fomo-bot · Pull Request #778 · m2lines/Samudra

fomo-bot · 2026-06-23T20:17:27Z

Summary

Make non-distributed grouped batch sampling deterministic by epoch/seed, so resumed training sees the same batch order as uninterrupted training.
Add sampler coverage for epoch-stable shuffling without crossing group boundaries.
Add a CUDA-marked v2 checkpoint/resume parity test that builds a tiny OM4-shaped Zarr dataset, runs a small configs/samudra_om4_v2/train.yaml model, checkpoints after epoch 1, resumes, and asserts epoch-2 train and validation losses match the uninterrupted run.

Notes

The test enables deterministic PyTorch algorithms while comparing CUDA losses so numeric drift does not mask checkpoint/resume regressions.
Local CUDA is now resolving to torch 2.9.1+cu130 from the cu130 uv source on this aarch64 host.

Validation

uv run pytest -m cuda tests/test_trainer.py::test_checkpoint_resume_matches_continuous_tiny_v2_cuda -q
uv run pytest tests/test_samplers.py -q
uv run pytest -m "not cuda" tests/test_trainer.py::test_checkpoint_ema tests/test_trainer.py::test_checkpoint_dynamic_loss_state tests/test_trainer.py::test_checkpoint_inference -q
uvx ruff check src/samudra/utils/samplers.py tests/test_samplers.py tests/test_trainer.py
uvx ruff format --check src/samudra/utils/samplers.py tests/test_samplers.py tests/test_trainer.py

jder · 2026-06-24T00:19:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b02bd56a16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-24T00:23:53Z

    def __iter__(self):
-        # Create batch samplers for each group
-        batch_sampler = itertools.chain(*self._samplers)
+        rng = random.Random(self.seed + self.epoch)


Thread configured seed into sampler RNG

In non-distributed training, Trainer constructs EquivalenceGroupBatchSampler.from_datasets(...) without passing a seed after calling set_seed(cfg.experiment.rand_seed). Creating a fresh random.Random(self.seed + self.epoch) here therefore forces the default seed 0 for every run, so changing experiment.rand_seed no longer changes the train/validation batch order as it did when the sampler used the globally seeded random/SubsetRandomSampler state. This makes seed-sweep experiments reuse identical data orders; please pass the configured seed into this sampler or derive it from the seeded global RNG.

Useful? React with 👍 / 👎.

jder

@fomo-bot please take a look at the review feedback

jder · 2026-06-24T13:13:34Z

        batch_size: int,
        shuffle: bool = True,
        drop_last: bool = False,
+        seed: int = 0,


Let's just get rid of this default value here + below; using it is probably a bug.

jder · 2026-06-24T13:25:16Z

            for group in self.groups
        ]

+    def set_epoch(self, epoch: int) -> None:


Does this get called in the actual checkpoint resume path? (ie not during tests)

jder · 2026-06-24T13:25:38Z

    trainer.run()


+def _write_tiny_om4_zarr(data_root: Path) -> None:


Can we use one of the existing mock/remote/etc datasets please

jder · 2026-06-24T13:27:06Z

+def _tiny_v2_config(tmp_path: Path, data_root: Path, run_name: str) -> TrainConfig:
+    config_path = Path(__file__).parents[1] / "configs/samudra_om4_v2/train.yaml"
+    cfg = TrainConfig.from_yaml_and_cli(
+        [
+            str(config_path),
+            "--experiment.data_root",
+            str(data_root),
+            "--experiment.base_output_dir",
+            str(tmp_path / "runs"),
+            "--experiment.name",
+            run_name,
+            "--backend",
+            "cuda",
+            "--train_time.start",
+            "1975-01-03",
+            "--train_time.end",
+            "1975-02-12",
+            "--val_time.start",
+            "1975-02-12",
+            "--val_time.end",
+            "1975-03-24",
+        ]
+    )
+    cfg.epochs = 2
+    cfg.save_freq = 1
+    cfg.batch_size = 1
+    cfg.steps = [1]
+    cfg.step_transition = []
+    cfg.data.concurrent_compute = False
+    assert isinstance(cfg.data.loading, CpuDataLoadingConfig)
+    cfg.data.loading.num_workers = 0
+    cfg.data.loading.persistent_workers = False
+
+    assert hasattr(cfg.model, "unet")
+    cfg.model.unet.ch_width = [8, 8, 8, 8]
+    cfg.model.unet.dilation = [1, 1, 1, 1]
+    cfg.model.unet.n_layers = [1, 1, 1, 1]
+    return cfg


Let's move this into a test config file. eg reuse train_samudra_mini or make a train_samudra_2_mini if there are differences.

jder · 2026-06-24T13:28:20Z

+    if cfg.resume_ckpt_path is None:
+        trainer.best_val_loss = 1e8
+        trainer.best_inf_loss = 1e8
+    trainer.init_data_loaders(cur_step=cfg.steps[0])


Why do we need to reach into trainer like this?

jder · 2026-06-24T13:28:58Z

+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.deterministic = True
+    torch.use_deterministic_algorithms(True)


Why is this duplicated from _initialize_manual_run?

fomo-bot · 2026-06-24T13:48:29Z

Addressed the review feedback locally.

Changed samplers.py:48 so non-distributed grouped samplers require an explicit seed, and threaded cfg.experiment.rand_seed through the real trainer sampler construction in train.py:130. I also fixed the actual resume path so Trainer.run() no longer resets checkpoint-restored best_val_loss / best_inf_loss.

Reworked the CUDA resume parity test in test_trainer.py:124 to use the existing mock-om4 fixture, a dedicated test config at train_samudra_mini_resume.yaml:1, and the real Trainer.run() checkpoint/resume path. Added sampler coverage that different configured seeds produce different shuffled orders in test_samplers.py:127.

Validation:

uvx ruff check ... passed
uvx ruff format --check ... passed
uv run pytest tests/test_samplers.py -q passed
uv run pytest -m cuda tests/test_trainer.py::test_checkpoint_resume_matches_continuous_samudra_mini_cuda -q passed
uv run pytest -m "not cuda" tests/test_trainer.py::test_checkpoint_ema tests/test_trainer.py::test_checkpoint_dynamic_loss_state tests/test_trainer.py::test_checkpoint_inference -q passed
uv run pytest tests/test_datasets.py -q -k "not samudra_multi" passed

Full tests/test_datasets.py still has unrelated setup errors for train_samudra_multi.yaml because flash_perceiver is not installed. I did not post replies or resolve GitHub review threads.

Codex work

Repository: m2lines/Samudra
Trigger: [codex] Add checkpoint resume parity test #778 (review)
Request ID: 5d33b4b3-7ee5-4775-abe2-37951f76bb43
Accepted: true

Codex transcript

- I’ll use the `github:gh-address-comments` workflow here, since this is review-thread feedback on a PR. I’m going to load its instructions, inspect the unresolved review context, then make any concrete fixes the review calls for.
- I’ve got the PR and branch context. Next I’m pulling both the PR metadata/diff and the thread-aware review state, because inline review resolution state matters here.
- The unresolved feedback clusters around two areas: sampler seeding and the new checkpoint-resume test shape. I’m reading the trainer/test/config paths now so the fix follows the existing test harness instead of adding another custom path.
- A few of the comments are answered by code that already exists: the real `Trainer.run()` path does call `set_epoch` on both train and validation samplers before each epoch. The test currently bypasses that path, which is what made the review ask about it, so I’m going to reshap...[truncated]
- The fix I’m converging on is: require seed at sampler construction, pass `cfg.experiment.rand_seed` from `Trainer`, add sampler coverage for different seeds, and replace the ad hoc tiny Zarr/test config mutation with an existing mock-backed config file plus `Trainer.run()`. Tha...[truncated]
- There’s also a prior automated review saying the sampler seed default made every non-distributed training run use seed `0`. I’m treating that and jder’s “remove the default value” comment as the same root issue and making sampler seeds explicit at the real call sites an...[truncated]
- I’m about to edit four files: sampler API/call sites, sampler tests, the trainer checkpoint-resume test, and a small test config. I’m keeping this scoped to the reviewed behavior and avoiding GitHub replies or thread resolution.
- One important shape issue: `run()` does not return losses, so the stronger test will compare the final continuous checkpoint against the final resumed checkpoint. That covers model, optimizer, EMA, loss state, counters, and saved validation state after the real training path.
- The one small trainer internals tweak I may keep is the existing `num_batches_seen = 1` workaround used by the CUDA smoke test to avoid the torchinfo summary path on constrained GPUs. It does not bypass sampler epoching or checkpoint resume; it just prevents an expensive summary ...[truncated]
- While converting the test to `Trainer.run()`, I spotted a real resume-path issue: `run()` resets `best_val_loss` and `best_inf_loss` after `load_checkpoint()` has restored them. I’m going to move that initialization so fresh runs start with sentinel values and resumed runs keep...[truncated]
- I’m applying the scoped code changes now: explicit non-distributed sampler seed, trainer seed plumbing, preserving loaded best losses during `run()`, and replacing the custom CUDA resume test setup with a mock-backed test config.
- The sampler constructor and factories now require a keyword seed instead of silently falling back to zero. I’m wiring that through the inner distributed sampler and the trainer call sites next.
- `Trainer` now stores the configured random seed once, passes it into all train/validation grouped samplers, and no longer overwrites checkpoint-restored best losses when `run()` starts.
- I’m updating the tests to make their seed choices explicit, then I’ll add coverage that two configured seeds produce different shuffled orders for the non-distributed sampler.
- The direct call sites are now covered; I’m checking doc examples and the updated signatures so type/lint failures don’t come from stale snippets.
- Next I’m adding a dedicated test YAML for the resume-parity case. It uses the existing `mock-om4` fixture data layout, keeps inference disabled to focus on train/validation resume behavior, and saves every epoch so the first run can resume from epoch 1.
- I’m replacing the old manual epoch helper with a test that runs three real trainer executions: uninterrupted 2 epochs, interrupted 1 epoch, then resumed to epoch 2. The assertion compares the final saved checkpoints.
- The old manual helper block is still present after the new test, so I’m removing it now. That also removes the ad hoc Zarr dependency and the direct sampler poking the review called out.
- The code edits are in place. I’m going to run targeted formatting/lint checks and the sampler tests first, then I’ll decide whether the CUDA resume test is practical to run on this host.
- Sampler tests pass. Ruff format flagged `tests/test_trainer.py` for formatting and, unsurprisingly, cannot parse the YAML config when passed as a Python file, so I’ll format the Python targets only and rerun the checks.

jder · 2026-06-24T13:49:58Z

@fomo-bot please push your work up

fomo-bot · 2026-06-24T13:51:38Z

Pushed the review-fix commit to the existing PR branch.

Commit: f205a33e (Address checkpoint resume review feedback)

PR: #778

Local branch is clean and aligned with origin/codex/preemptible-checkpoint-resume-test.

Codex work

Repository: m2lines/Samudra
Trigger: [codex] Add checkpoint resume parity test #778 (comment)
Request ID: 9d5bf133-24ae-44f8-ae80-5d2ba0ad3247
Accepted: true

Codex transcript

- I’ll use the `github:yeet` workflow here because this is a request to publish the local PR fixes. I’ll verify the worktree, commit the scoped changes, and push the existing branch.
- I’m checking the current branch, auth, and diff scope before staging. Since this is an existing PR branch, I’ll push the new commit to that branch rather than opening another PR.
- The worktree only contains the review-feedback files from the previous fix, plus the new test config. I’m staging those explicit paths and committing them with a focused message.
- The staged scope matches the intended review fixes. I’m committing now, then I’ll push to `origin/codex/preemptible-checkpoint-resume-test`.
- The commit failed because this checkout has no local Git author identity configured. I’m going to set a repository-local identity for `fomo-bot` only, then retry the same commit.
- Commit `f205a33e` was created. I’m pushing it to the existing PR branch now.
- The push succeeded. I’m doing a final status check so I can confirm the local branch is clean and aligned with the remote.
- Pushed the review-fix commit to the existing PR branch.

Commit: `f205a33e` (`Address checkpoint resume review feedback`)  
PR: https://github.com/m2lines/Samudra/pull/778

Local branch is clean and aligned with `origin/codex/preemptible-checkpoint-resume-test`.

Add checkpoint resume parity test

b02bd56

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

jder reviewed Jun 24, 2026

View reviewed changes

Address checkpoint resume review feedback

f205a33

fomo-bot and others added 3 commits June 26, 2026 09:37

Add small v2 checkpoint resume config

6b4621f

Remove SamudraMini resume parity test

67a954a

Add preemptible Slurm training harness

be41099

		trainer.run()


		def _write_tiny_om4_zarr(data_root: Path) -> None:

Uh oh!

Conversation

fomo-bot commented Jun 23, 2026

Summary

Notes

Validation

Uh oh!

jder commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jder left a comment

Choose a reason for hiding this comment

Uh oh!

jder Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jder Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jder Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jder Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jder Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jder Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

fomo-bot commented Jun 24, 2026

Uh oh!

jder commented Jun 24, 2026

Uh oh!

fomo-bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants