Skip to content

Add model checkpoint update developer skill#924

Open
NickGeneva wants to merge 1 commit into
codex/checkpoint-fcn3from
codex/checkpoint-dev-skill
Open

Add model checkpoint update developer skill#924
NickGeneva wants to merge 1 commit into
codex/checkpoint-fcn3from
codex/checkpoint-dev-skill

Conversation

@NickGeneva

Copy link
Copy Markdown
Collaborator

Earth2Studio Pull Request

Description

Fourth PR in the checkpointing stack.

Adds a .claude/skills-dev developer skill named developer-model-checkpoint-update. The skill guides developers through adding checkpoint restart support to an existing Earth2Studio model or component by:

  • inspecting call, iterator, device, hook, and random-state behavior before editing
  • implementing a small pickle-free dataclass bound with bind_checkpoint_state
  • supporting minimal, state, and full checkpoint policies
  • staging live tensors on the checkpoint device when relevant
  • restoring iterator boundaries correctly for rollout continuation
  • adding one focused unit test that proves restart behavior

Stack

  1. Add checkpoint utilities and workflow support #912: checkpoint utilities and built-in workflow support
  2. Add Gaussian perturbation checkpoint state #922: Gaussian perturbation checkpoint state
  3. Add FCN3 checkpoint state #923: FCN3 checkpoint state
  4. This PR: developer skill for adding model checkpoint support

Validation

  • python3 smoke check for required skill frontmatter/content
  • git diff --check

Checklist

  • I am familiar with the Contributing Guidelines.
  • The documentation/developer guidance is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Depends on #923.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a .claude/skills-dev guidance document (SKILL.md) that walks developers through adding checkpoint restart support to Earth2Studio prognostic models — covering dataclass state design, bind_checkpoint_state usage, the three checkpoint policies (minimal/state/full), tensor staging on the checkpoint device, and unit test structure.

  • The iterator restore pattern in Step 4 calls super().create_iterator(restored_x, restored_coords) with a mid-rollout tensor and then discards the first yield via next(iterator). This contradicts the FCN3 reference implementation, which handles restoration inside its own generator and avoids triggering the parent's initialization path with non-initial-condition inputs.
  • The Step 7 test skeleton includes a redundant ckpt.flush() immediately after ckpt.write() when using the default flush_interval=1, which may mislead developers about the necessity of the explicit flush call.

Confidence Score: 3/5

The skill file is developer-facing documentation; the core iterator restore pattern it teaches conflicts with the FCN3 reference implementation and could produce silently incorrect behavior when applied to models that initialize state in create_iterator.

The iterator restore pattern in Step 4 passes a mid-rollout state to super().create_iterator() and discards its first yield — but the FCN3 implementation deliberately avoids this by handling restore inside its own generator. Any developer following the skill verbatim on a model with setup side effects in create_iterator would get wrong outputs without a visible error.

.claude/skills-dev/developer-model-checkpoint-update/SKILL.md — specifically Step 4 (iterator restore pattern) and the Step 7 test skeleton.

Important Files Changed

Filename Overview
.claude/skills-dev/developer-model-checkpoint-update/SKILL.md New developer skill guiding checkpoint restart implementation; the iterator restore pattern in Step 4 uses super().create_iterator() with a mid-rollout state in a way that contradicts the FCN3 reference implementation and would silently produce wrong outputs in models with initialization side effects.

Reviews (1): Last reviewed commit: "Add model checkpoint update developer sk..." | Re-trigger Greptile

Comment on lines +159 to +177
Pattern for an iterator:

```python
restored = False
if self.restart.checkpoint_state_loaded and self.restart.x is not None:
x = self.restart.x.to(x.device)
coords = OrderedDict(
(key, np.asarray(value).copy())
for key, value in zip(self.restart.coord_keys, self.restart.coord_values)
)
restored = True

iterator = super().create_iterator(x, coords)
if restored:
next(iterator) # consume the saved boundary internally

for x_out, coords_out in iterator:
self._save_checkpoint_state(x_out, coords_out)
yield x_out, coords_out

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 super().create_iterator() with mid-rollout state triggers re-initialization

The pattern calls super().create_iterator(restored_x, restored_coords) where restored_x is a mid-rollout tensor, not a true initial condition. Any initialization logic the parent runs in create_iterator (e.g., _reset_internal_state in FCN3, normalization stats, warm-up buffers) will be applied to the wrong state.

FCN3 — the reference implementation for this skill — deliberately avoids this pattern. Its _default_generator checks checkpoint_state_loaded inside the generator and conditionally skips the initial-condition yield, so the parent's initialization path is never invoked with a restored boundary. A developer following the pattern shown here and then running into a model with setup side effects in its create_iterator will produce silently wrong outputs, not a visible error.

Consider replacing the super().create_iterator() skeleton with the FCN3-style alternative (restore inside the generator, conditional yield x, coords only when not restored), or add an explicit callout that the super() pattern must only be used when the parent's iterator has no initialization side effects.

Comment on lines +263 to +277
```python
checkpoint = Checkpoint("model", path=tmp_path, mode="append", state_policy="full")

with checkpoint as ckpt:
model = MyRestartableModel(...)
x1, coords1 = next(model.create_iterator(x0, coords0))
ckpt.write(lead_time=coords1["lead_time"][-1])
ckpt.flush()

with checkpoint.select(-1) as ckpt:
restarted = MyRestartableModel(...)
out, coords = next(restarted.create_iterator(x0, coords0))

assert ...
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant ckpt.flush() call after ckpt.write() with default flush_interval=1

Checkpoint defaults to flush_interval=1, so ckpt.write() already commits atomically on every call. The explicit ckpt.flush() after ckpt.write() in the skeleton is a no-op in this configuration. While harmless, it suggests to readers that write() alone is insufficient and may cause confusion when they see the pattern and omit flush() thinking it is optional.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from ada5c50 to 29da9e2 Compare June 15, 2026 17:31
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from 1ecfed1 to c0fa0c6 Compare June 15, 2026 17:31
@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from 29da9e2 to 2cd1502 Compare June 15, 2026 17:45
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from c0fa0c6 to 9df903d Compare June 15, 2026 17:45
@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from 2cd1502 to b610945 Compare June 15, 2026 17:48
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from 9df903d to e152432 Compare June 15, 2026 17:48
@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from b610945 to ae5b4b3 Compare June 15, 2026 17:51
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from e152432 to 832dc17 Compare June 15, 2026 17:51
@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from ae5b4b3 to a745cdc Compare June 15, 2026 17:54
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from 832dc17 to 72ecf3d Compare June 15, 2026 17:54
@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from a745cdc to 494846c Compare June 15, 2026 17:56
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from 72ecf3d to 46f35d8 Compare June 15, 2026 17:56
@NickGeneva NickGeneva force-pushed the codex/checkpoint-fcn3 branch from 494846c to 6cd8881 Compare June 15, 2026 22:30
@NickGeneva NickGeneva force-pushed the codex/checkpoint-dev-skill branch from 46f35d8 to 125670e Compare June 15, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant