Add checkpoint utilities and workflow support#912
Conversation
Greptile SummaryThis PR adds a lightweight checkpoint catalog (
|
| Filename | Overview |
|---|---|
| earth2studio/utils/checkpoint.py | New 957-line checkpoint module implementing catalog storage, atomic commits, pickle-free serialization, and context-variable-based state binding; _schema_hash uses field.type!r which is sensitive to from future import annotations in the user's dataclass module |
| earth2studio/run.py | Deterministic, diagnostic, and ensemble workflows extended with checkpoint resume logic; _read_restart_from_io catches AssertionError which can hide internal IO bugs |
| test/utils/test_checkpoint.py | Comprehensive 603-line test suite covering round-trip hydration, schema mismatch errors, flush intervals, keep_last pruning, distributed rank detection, workflow integration, and the diagnostic lead-time tracking fix |
| examples/01_getting_started/04_checkpoint_restart.py | New end-to-end example demonstrating a mid-run deterministic forecast restart using Persistence + ZarrBackend; clean and well-annotated |
| docs/userguide/developer/checkpointing.md | New developer guide covering basic use, catalog selection, custom loops, workflow resume, component state binding, serialization rules, distributed layout, and storage format |
Reviews (2): Last reviewed commit: "Fix diagnostic checkpoint lead time trac..." | Re-trigger Greptile
b7624e6 to
2cfe7c1
Compare
…talog # Conflicts: # CHANGELOG.md
|
/blossom-ci |
|
/blossom-ci |
|
/blossom-ci |
2 similar comments
|
/blossom-ci |
|
/blossom-ci |
|
/blossom-ci |
c0a347a to
6408acd
Compare
6408acd to
997b447
Compare
997b447 to
a4bdde3
Compare
|
/blossom-ci |
|
/blossom-ci |
Earth2Studio Pull Request
Description
Root PR for the checkpointing stack.
Adds lightweight checkpoint/session utilities for restartable inference workflows:
CheckpointandCheckpointSessionfor cataloged restart rowsbind_checkpoint_statefor pickle-free dataclass component statePersistencesupport for filtered-output workflow testsFollow-up stacked PRs add perturbation/model opt-ins and developer guidance.
Stack
Validation
uv run ruff check earth2studio/models/px/ucast.py test/models/px/test_ucast.py examples/01_getting_started/04_checkpoint_restart.py earth2studio/models/px/persistence.py test/models/px/test_persistence.py test/utils/test_checkpoint.pyuv run pytest test/models/px/test_ucast.py::test_ucast_checkpoint_state_round_trip -quv run pytest test/models/px/test_ucast.py -q -m "not package"uv run pytest test/models/px/test_persistence.py::test_persistence_checkpoint_state_round_trip -quv run pytest test/utils/test_checkpoint.py -quv run python examples/01_getting_started/04_checkpoint_restart.pygit diff --checkChecklist
Dependencies
No new dependencies.