feat: add Qureg checkpointing via ADIOS2 (#747)#780
Conversation
Adds saveQuregToFile() and createQuregFromFile() to write a Qureg to disk and restore it later, behind the optional CMake flag ENABLE_CHECKPOINTING (which requires ADIOS2). The file records only the Qureg dimension (numQubits, isDensityMatrix) and its amplitudes - never the incidental deployment fields, nor derivable fields like numAmps - so a Qureg may be restored under a different deployment than it was saved with. Amplitudes are written as an ADIOS2 global array of interleaved (real, imag) reals, with each node contributing only its local slice, so the implementation streams without excessive memory and is distributed- and GPU-ready: GPU state is synced to host before writing and back after reading, and the global-array selection lets any node count read back its own portion. Also adds a validation error when the API is called in a build without checkpointing, reports isCheckpointingCompiled in the environment info (alongside isOmpCompiled, isGpuCompiled, etc), a guarded Catch2 test (tests/unit/checkpoint.cpp) exercising statevector and density-matrix round-trips, and documents the build flag in docs/compile.md.
| #ifdef ENABLE_CHECKPOINTING | ||
| bool isCompiled = true; | ||
| #else | ||
| bool isCompiled = false; | ||
| #endif |
There was a problem hiding this comment.
For this file to see the ENABLE_CHECKPOINTING (and other) preprocessors, you must include
#include "quest/include/config.h"Presently, the undefined macro will default to 0, setting isCompiled=false, and making this validation always trigger. This makes me suspect you did not compile and run the tests yourself before submitting this PR. Please see test instructions here
|
For ease of testing, I've made QuEST's build download adios2 when not locally installed |
|
Looks like the CI is successfully downloading adios2, but then some jobs fail to compile! Strangely, the logs (like this one) don't show an error - compilation just stops. Meanwhile, other jobs on the same platforms (like this one) compile fine! Quite irksome, but anyways suggests you should have a go at running your new unit tests yourself, as guided here. The new CMake provision to download adios2 should make that easier. You can compile the tests and run them locally, even with MPI and few processors, via: I'll dig into your implementation once we're assured it compiles and works! |
…INTING) QuEST defines all compile-time feature macros centrally in config.h (generated from config.h.in). The checkpointing flag was instead passed as a raw target_compile_definitions, so validation.cpp (which doesn't include config.h) saw it undefined and always reported 'not compiled' under the project's normal build path. Add #cmakedefine01 QUEST_COMPILE_CHECKPOINTING to config.h.in, set it from the ENABLE_CHECKPOINTING option, link ADIOS2 to the QuEST target, and switch the sources/tests to #include config.h + #if QUEST_COMPILE_CHECKPOINTING. Remove the per-target compile-definition hacks. Verified: ON build -> config.h has =1 and tests/tests '[checkpoint]' passes (CPU, CPU+OMP); default OFF build has =0 and compiles without ADIOS2.
|
@TysonRayJones You're right, thanks for catching this. I did build and run the tests locally before opening - but via Fixed: added Verified locally:
This was a single-node (non-MPI) build, so I ran the tests serially rather than under Still to do - I'll take these on next:
I'll report back here with results on each. Thanks for your patience while I get this to a properly-tested state. |
Summary
Implements
Quregcheckpointing (issue #747): two new API functions for writing aQuregto disk and restoring it later.This is useful for long-running HPC jobs vulnerable to timeout or failure - an evolving
Quregcan be periodically written to disk and resumed in a later process.Design
Following the approach suggested in the issue, checkpointing is built upon ADIOS2 and gated behind a new CMake option
ENABLE_CHECKPOINTING(OFF by default), so the ADIOS2 dependency is only required when the feature is requested.Quregdimension (numQubits,isDensityMatrix) and the full amplitude set. Incidental deployment fields (multithreading, GPU-acceleration, distribution) are not saved, nor are derivable fields (numAmps, etc). AQuregsaved by one deployment can therefore be restored by any other -createQuregFromFile()creates theQuregwith automatically chosen deployments and a precision marker (sizeof(qreal)) guards against restoring into a mismatched-precision build.(real, imag)reals (reinterpret_castof the contiguousqcompbuffer), keeping the format agnostic to precision and to ADIOS2's complex-type support.start = 2·rank·numAmpsPerNode,count = 2·numAmpsPerNode) of the global array, so the implementation streams without excessive memory and is distributed-ready. GPU-resident state is synced to host before writing (syncQuregFromGpu) and back after reading (syncQuregToGpu).ENABLE_CHECKPOINTINGraises a clear validation error (rather than failing to link), viavalidate_quregCheckpointingIsCompiled.The new API functions live in the existing C-and-C++-agnostic partition of
qureg.h(they pass noqcompby value, so remain C-ABI-safe).Scope
This first pass targets and is verified for CPU, single-node (the ADIOS2 build used here has MPI off). The code is written deployment-agnostically against the global-array abstraction, so enabling distribution (rebuild ADIOS2 + QuEST with MPI) and GPU should work unchanged; I'm happy to extend/verify those if preferred.
Testing
tests/unit/checkpoint.cpp(guarded byENABLE_CHECKPOINTING): statevector and density-matrix round-trips assert the restoredQuregmatches dimension and amplitudes../tests/tests "[checkpoint]"→ all assertions pass (CPU and CPU+OMP).maxAmpDiff = 0) for both statevector and density matrix.cmake .. -D ENABLE_CHECKPOINTING=ON -D CMAKE_PREFIX_PATH=$HOME/.local→ clean compile + link againstadios2::cxx.Build
cmake .. -D ENABLE_CHECKPOINTING=ON -D CMAKE_PREFIX_PATH=/path/to/adios2/prefix cmake --build . --parallelDocumented in
docs/compile.md(new "Checkpointing" section).Notes / open questions
.bpdirectory). Happy to adjust naming/engine conventions.AI-Assisted Contribution Disclosure
I used an AI assistant (Claude) to help explore the QuEST architecture, discuss the design, and review the code and tests. I traced the
Quregstruct, thequreg.cpp/validation.cpppatterns, and the amplitude-access and sync routines myself, made the implementation decisions (interleaved-reals storage to dodge ADIOS2's lack of a long-double-complex type; the global-array slice scheme for memory-efficient, distribution-ready I/O; gating via a validation error rather than a link error; where the API and validation belong), and verified all behaviour locally with the tests above plus bit-exact round-trip and build checks. I can explain and stand behind every line.