[codex] Add checkpoint resume parity test#778
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b02bd56a16
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def __iter__(self): | ||
| # Create batch samplers for each group | ||
| batch_sampler = itertools.chain(*self._samplers) | ||
| rng = random.Random(self.seed + self.epoch) |
There was a problem hiding this comment.
Thread configured seed into sampler RNG
In non-distributed training, Trainer constructs EquivalenceGroupBatchSampler.from_datasets(...) without passing a seed after calling set_seed(cfg.experiment.rand_seed). Creating a fresh random.Random(self.seed + self.epoch) here therefore forces the default seed 0 for every run, so changing experiment.rand_seed no longer changes the train/validation batch order as it did when the sampler used the globally seeded random/SubsetRandomSampler state. This makes seed-sweep experiments reuse identical data orders; please pass the configured seed into this sampler or derive it from the seeded global RNG.
Useful? React with 👍 / 👎.
| batch_size: int, | ||
| shuffle: bool = True, | ||
| drop_last: bool = False, | ||
| seed: int = 0, |
There was a problem hiding this comment.
Let's just get rid of this default value here + below; using it is probably a bug.
| for group in self.groups | ||
| ] | ||
|
|
||
| def set_epoch(self, epoch: int) -> None: |
There was a problem hiding this comment.
Does this get called in the actual checkpoint resume path? (ie not during tests)
| trainer.run() | ||
|
|
||
|
|
||
| def _write_tiny_om4_zarr(data_root: Path) -> None: |
There was a problem hiding this comment.
Can we use one of the existing mock/remote/etc datasets please
| def _tiny_v2_config(tmp_path: Path, data_root: Path, run_name: str) -> TrainConfig: | ||
| config_path = Path(__file__).parents[1] / "configs/samudra_om4_v2/train.yaml" | ||
| cfg = TrainConfig.from_yaml_and_cli( | ||
| [ | ||
| str(config_path), | ||
| "--experiment.data_root", | ||
| str(data_root), | ||
| "--experiment.base_output_dir", | ||
| str(tmp_path / "runs"), | ||
| "--experiment.name", | ||
| run_name, | ||
| "--backend", | ||
| "cuda", | ||
| "--train_time.start", | ||
| "1975-01-03", | ||
| "--train_time.end", | ||
| "1975-02-12", | ||
| "--val_time.start", | ||
| "1975-02-12", | ||
| "--val_time.end", | ||
| "1975-03-24", | ||
| ] | ||
| ) | ||
| cfg.epochs = 2 | ||
| cfg.save_freq = 1 | ||
| cfg.batch_size = 1 | ||
| cfg.steps = [1] | ||
| cfg.step_transition = [] | ||
| cfg.data.concurrent_compute = False | ||
| assert isinstance(cfg.data.loading, CpuDataLoadingConfig) | ||
| cfg.data.loading.num_workers = 0 | ||
| cfg.data.loading.persistent_workers = False | ||
|
|
||
| assert hasattr(cfg.model, "unet") | ||
| cfg.model.unet.ch_width = [8, 8, 8, 8] | ||
| cfg.model.unet.dilation = [1, 1, 1, 1] | ||
| cfg.model.unet.n_layers = [1, 1, 1, 1] | ||
| return cfg |
There was a problem hiding this comment.
Let's move this into a test config file. eg reuse train_samudra_mini or make a train_samudra_2_mini if there are differences.
| if cfg.resume_ckpt_path is None: | ||
| trainer.best_val_loss = 1e8 | ||
| trainer.best_inf_loss = 1e8 | ||
| trainer.init_data_loaders(cur_step=cfg.steps[0]) |
There was a problem hiding this comment.
Why do we need to reach into trainer like this?
| torch.backends.cudnn.benchmark = False | ||
| torch.backends.cudnn.deterministic = True | ||
| torch.use_deterministic_algorithms(True) |
There was a problem hiding this comment.
Why is this duplicated from _initialize_manual_run?
|
Addressed the review feedback locally. Changed samplers.py:48 so non-distributed grouped samplers require an explicit Reworked the CUDA resume parity test in test_trainer.py:124 to use the existing Validation:
Full Codex work
Codex transcript |
|
@fomo-bot please push your work up |
|
Pushed the review-fix commit to the existing PR branch. Codex work
Codex transcript |
Summary
configs/samudra_om4_v2/train.yamlmodel, checkpoints after epoch 1, resumes, and asserts epoch-2 train and validation losses match the uninterrupted run.Notes
torch 2.9.1+cu130from the cu130 uv source on this aarch64 host.Validation
uv run pytest -m cuda tests/test_trainer.py::test_checkpoint_resume_matches_continuous_tiny_v2_cuda -quv run pytest tests/test_samplers.py -quv run pytest -m "not cuda" tests/test_trainer.py::test_checkpoint_ema tests/test_trainer.py::test_checkpoint_dynamic_loss_state tests/test_trainer.py::test_checkpoint_inference -quvx ruff check src/samudra/utils/samplers.py tests/test_samplers.py tests/test_trainer.pyuvx ruff format --check src/samudra/utils/samplers.py tests/test_samplers.py tests/test_trainer.py