Add model checkpoint update developer skill#924
Conversation
Greptile SummaryThis PR adds a
|
| Filename | Overview |
|---|---|
| .claude/skills-dev/developer-model-checkpoint-update/SKILL.md | New developer skill guiding checkpoint restart implementation; the iterator restore pattern in Step 4 uses super().create_iterator() with a mid-rollout state in a way that contradicts the FCN3 reference implementation and would silently produce wrong outputs in models with initialization side effects. |
Reviews (1): Last reviewed commit: "Add model checkpoint update developer sk..." | Re-trigger Greptile
| Pattern for an iterator: | ||
|
|
||
| ```python | ||
| restored = False | ||
| if self.restart.checkpoint_state_loaded and self.restart.x is not None: | ||
| x = self.restart.x.to(x.device) | ||
| coords = OrderedDict( | ||
| (key, np.asarray(value).copy()) | ||
| for key, value in zip(self.restart.coord_keys, self.restart.coord_values) | ||
| ) | ||
| restored = True | ||
|
|
||
| iterator = super().create_iterator(x, coords) | ||
| if restored: | ||
| next(iterator) # consume the saved boundary internally | ||
|
|
||
| for x_out, coords_out in iterator: | ||
| self._save_checkpoint_state(x_out, coords_out) | ||
| yield x_out, coords_out |
There was a problem hiding this comment.
super().create_iterator() with mid-rollout state triggers re-initialization
The pattern calls super().create_iterator(restored_x, restored_coords) where restored_x is a mid-rollout tensor, not a true initial condition. Any initialization logic the parent runs in create_iterator (e.g., _reset_internal_state in FCN3, normalization stats, warm-up buffers) will be applied to the wrong state.
FCN3 — the reference implementation for this skill — deliberately avoids this pattern. Its _default_generator checks checkpoint_state_loaded inside the generator and conditionally skips the initial-condition yield, so the parent's initialization path is never invoked with a restored boundary. A developer following the pattern shown here and then running into a model with setup side effects in its create_iterator will produce silently wrong outputs, not a visible error.
Consider replacing the super().create_iterator() skeleton with the FCN3-style alternative (restore inside the generator, conditional yield x, coords only when not restored), or add an explicit callout that the super() pattern must only be used when the parent's iterator has no initialization side effects.
| ```python | ||
| checkpoint = Checkpoint("model", path=tmp_path, mode="append", state_policy="full") | ||
|
|
||
| with checkpoint as ckpt: | ||
| model = MyRestartableModel(...) | ||
| x1, coords1 = next(model.create_iterator(x0, coords0)) | ||
| ckpt.write(lead_time=coords1["lead_time"][-1]) | ||
| ckpt.flush() | ||
|
|
||
| with checkpoint.select(-1) as ckpt: | ||
| restarted = MyRestartableModel(...) | ||
| out, coords = next(restarted.create_iterator(x0, coords0)) | ||
|
|
||
| assert ... | ||
| ``` |
There was a problem hiding this comment.
Redundant
ckpt.flush() call after ckpt.write() with default flush_interval=1
Checkpoint defaults to flush_interval=1, so ckpt.write() already commits atomically on every call. The explicit ckpt.flush() after ckpt.write() in the skeleton is a no-op in this configuration. While harmless, it suggests to readers that write() alone is insufficient and may cause confusion when they see the pattern and omit flush() thinking it is optional.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
ada5c50 to
29da9e2
Compare
1ecfed1 to
c0fa0c6
Compare
29da9e2 to
2cd1502
Compare
c0fa0c6 to
9df903d
Compare
2cd1502 to
b610945
Compare
9df903d to
e152432
Compare
b610945 to
ae5b4b3
Compare
e152432 to
832dc17
Compare
ae5b4b3 to
a745cdc
Compare
832dc17 to
72ecf3d
Compare
a745cdc to
494846c
Compare
72ecf3d to
46f35d8
Compare
494846c to
6cd8881
Compare
46f35d8 to
125670e
Compare
Earth2Studio Pull Request
Description
Fourth PR in the checkpointing stack.
Adds a
.claude/skills-devdeveloper skill nameddeveloper-model-checkpoint-update. The skill guides developers through adding checkpoint restart support to an existing Earth2Studio model or component by:bind_checkpoint_stateminimal,state, andfullcheckpoint policiesStack
Validation
python3smoke check for required skill frontmatter/contentgit diff --checkChecklist
Dependencies
Depends on #923.