Skip to content

Add checkpoint utilities and workflow support#912

Merged
NickGeneva merged 35 commits into
NVIDIA:mainfrom
NickGeneva:codex/checkpoint-catalog
Jun 25, 2026
Merged

Add checkpoint utilities and workflow support#912
NickGeneva merged 35 commits into
NVIDIA:mainfrom
NickGeneva:codex/checkpoint-catalog

Conversation

@NickGeneva

@NickGeneva NickGeneva commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Earth2Studio Pull Request

Description

Root PR for the checkpointing stack.

Adds lightweight checkpoint/session utilities for restartable inference workflows:

  • Checkpoint and CheckpointSession for cataloged restart rows
  • bind_checkpoint_state for pickle-free dataclass component state
  • append/overwrite modes, atomic commit staging, per-rank checkpoint directories, and no-op checkpoint sessions
  • deterministic, diagnostic, and ensemble workflow integration
  • restart-aware Persistence support for filtered-output workflow tests

Follow-up stacked PRs add perturbation/model opt-ins and developer guidance.

Stack

  1. This PR: checkpoint utilities, built-in workflow support, Persistence support, and U-Cast restart support/example
  2. Add Gaussian perturbation checkpoint state #922: Gaussian perturbation checkpoint state
  3. Add FCN3 checkpoint state #923: FCN3 checkpoint state
  4. Add model checkpoint update developer skill #924: developer skill for adding model checkpoint support

Validation

  • uv run ruff check earth2studio/models/px/ucast.py test/models/px/test_ucast.py examples/01_getting_started/04_checkpoint_restart.py earth2studio/models/px/persistence.py test/models/px/test_persistence.py test/utils/test_checkpoint.py
  • uv run pytest test/models/px/test_ucast.py::test_ucast_checkpoint_state_round_trip -q
  • uv run pytest test/models/px/test_ucast.py -q -m "not package"
  • uv run pytest test/models/px/test_persistence.py::test_persistence_checkpoint_state_round_trip -q
  • uv run pytest test/utils/test_checkpoint.py -q
  • uv run python examples/01_getting_started/04_checkpoint_restart.py
  • git diff --check

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.
  • Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

No new dependencies.

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a lightweight checkpoint catalog (earth2studio.utils.checkpoint) for restarting long-running inference workflows without pickle, plus resume support in the deterministic, diagnostic, and ensemble built-in workflows. Checkpoints store small restart metadata (latest lead time, optional artifacts, bound dataclass state) in atomically-committed, rank-separated directories; forecast arrays remain owned by the IO backend.

  • Checkpoint / CheckpointSelection: catalog with select, write, flush, and context-manager support; pickle-free serialization covering numpy arrays, tensors, and standard Python types; atomic staging via temp-rename; mode=\"overwrite\"/\"append\" and keep_last pruning.
  • Workflow integration: deterministic and diagnostic workflows resume from the last completed lead_time by reading from the IO backend; ensemble workflow tracks per-ensemble_batch progress independently; the diagnostic fix saves prognostic step coords before diagnostic coord mapping so lead_time is never dropped.
  • Tests and docs: 13 tests covering round-trip hydration, schema mismatch, flush intervals, distributed rank detection, and all three workflow resume paths, plus a developer guide and a Persistence-based getting-started example.

Confidence Score: 5/5

Safe to merge; the new checkpoint machinery is well-tested and the workflow changes are backward-compatible (checkpoint parameter defaults to None).

The core checkpoint logic is correct and the three workflow resume paths are exercised by integration tests that pass. The two findings are quality nits: the schema hash uses repr(field.type) which can produce different strings if a user's dataclass module switches annotation styles, and AssertionError is caught in the IO read path where only data errors were intended. Neither causes incorrect behavior in normal use.

The _schema_hash function in earth2studio/utils/checkpoint.py and the AssertionError catch in earth2studio/run.py are worth a second look before the next checkpoint format version bump.

Important Files Changed

Filename Overview
earth2studio/utils/checkpoint.py New 957-line checkpoint module implementing catalog storage, atomic commits, pickle-free serialization, and context-variable-based state binding; _schema_hash uses field.type!r which is sensitive to from future import annotations in the user's dataclass module
earth2studio/run.py Deterministic, diagnostic, and ensemble workflows extended with checkpoint resume logic; _read_restart_from_io catches AssertionError which can hide internal IO bugs
test/utils/test_checkpoint.py Comprehensive 603-line test suite covering round-trip hydration, schema mismatch errors, flush intervals, keep_last pruning, distributed rank detection, workflow integration, and the diagnostic lead-time tracking fix
examples/01_getting_started/04_checkpoint_restart.py New end-to-end example demonstrating a mid-run deterministic forecast restart using Persistence + ZarrBackend; clean and well-annotated
docs/userguide/developer/checkpointing.md New developer guide covering basic use, catalog selection, custom loops, workflow resume, component state binding, serialization rules, distributed layout, and storage format

Reviews (2): Last reviewed commit: "Fix diagnostic checkpoint lead time trac..." | Re-trigger Greptile

Comment thread earth2studio/run.py
@NickGeneva

Copy link
Copy Markdown
Collaborator Author

@greptile-apps

@NickGeneva NickGeneva force-pushed the codex/checkpoint-catalog branch from b7624e6 to 2cfe7c1 Compare June 15, 2026 16:14
@NickGeneva NickGeneva changed the title Adding Inference Workflow Checkpointing Add checkpoint utilities and workflow support Jun 15, 2026
Comment thread earth2studio/models/px/fcn.py
@NickGeneva NickGeneva requested a review from pzharrington June 16, 2026 18:56
Comment thread earth2studio/utils/checkpoint.py
Comment thread earth2studio/utils/checkpoint.py Outdated
Comment thread earth2studio/run.py Outdated
Comment thread docs/userguide/advanced/checkpointing.md
@NickGeneva NickGeneva added the ! - Release PRs or Issues releating to a release label Jun 22, 2026
Comment thread CHANGELOG.md Outdated
@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

2 similar comments
@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva NickGeneva force-pushed the codex/checkpoint-catalog branch 5 times, most recently from c0a347a to 6408acd Compare June 24, 2026 21:52
Comment thread earth2studio/run.py
@NickGeneva NickGeneva force-pushed the codex/checkpoint-catalog branch from 6408acd to 997b447 Compare June 24, 2026 22:26
Comment thread earth2studio/utils/checkpoint.py
@NickGeneva NickGeneva force-pushed the codex/checkpoint-catalog branch from 997b447 to a4bdde3 Compare June 24, 2026 22:29
@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@pzharrington pzharrington left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@NickGeneva

Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@NickGeneva NickGeneva merged commit 8d7ba92 into NVIDIA:main Jun 25, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

! - Release PRs or Issues releating to a release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants