Handle pickling for generic pydantic models, fixes #210#211
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #211 +/- ##
==========================================
- Coverage 95.37% 95.32% -0.05%
==========================================
Files 142 143 +1
Lines 11404 11608 +204
Branches 620 633 +13
==========================================
+ Hits 10876 11065 +189
- Misses 399 412 +13
- Partials 129 131 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f4c4745 to
840d895
Compare
Signed-off-by: Nijat K <nijat.khanbabayev@gmail.com>
840d895 to
7404c07
Compare
|
Thanks for the very thorough writeup — the problem statement and the cross-process repro/test harness are great. I'd like to propose a much smaller implementation that I believe covers a strict superset of cases. Proposal: ~6 lines on the metaclassAdd a import operator
class _SerializeAsAnyMeta(ModelMetaclass):
# ... existing __new__ ...
def __reduce__(cls):
md = getattr(cls, "__pydantic_generic_metadata__", None)
if md and md.get("origin") is not None and md.get("args"):
return (operator.getitem, (md["origin"], md["args"]))
return super().__reduce__()The existing Why this is sufficientPickle saves an instance by first saving its class. For a class object, pickle dispatches on Concretely, the cases the PR description enumerates are all handled by the recursive pickle pass alone:
Bonus: it also covers the field-value case the PR explicitly skipsThe PR notes that Verified empiricallyI ran a fresh-process reproduction (producer process serializes; cold consumer process only imports origin classes, never materializes the specializations) against the metaclass-only fix for: What I'd suggest keeping from this PRThe implementation can shrink significantly, but the test file is the most valuable part of the PR and should stay essentially as-is. The subprocess harness in |
TLDRThe reconstruction recipe in your suggestion is right: rebuilding as That means a repro can pass for a different reason: if When I isolate the metaclass-only patch against the ccflow subprocess matrix, the cases from your comment still fail in a cold consumer ( So I think the simplification has the right rebuild idea, but not the right hook point. Could you share your repro script? I’d like to check whether it also passes on main / without the patch, or whether there’s another reducer path active that I’m missing. Thanks! This is very helpful. I tried reducing the branch to the metaclass-only implementation and I think I found the discrepancy, but I would like to compare against your repro before drawing a firm conclusion. The rebuild recipe itself looks right: if Where I am getting stuck is that, in my isolated repros, Here is a standalone pydantic-shaped version of that check: import base64
import subprocess
import sys
import tempfile
from pathlib import Path
models_code = """
from typing import Generic, TypeVar
from pydantic import BaseModel
from pydantic._internal._model_construction import ModelMetaclass
T = TypeVar("T")
CALLS = []
class ReproMeta(ModelMetaclass):
def __reduce__(cls):
CALLS.append(("__reduce__", cls.__name__))
md = getattr(cls, "__pydantic_generic_metadata__", None)
if md and md.get("origin") is not None and md.get("args"):
import operator
return (operator.getitem, (md["origin"], md["args"]))
return super().__reduce__()
def __reduce_ex__(cls, protocol):
CALLS.append(("__reduce_ex__", cls.__name__, protocol))
return super().__reduce_ex__(protocol)
class Box(BaseModel, Generic[T], metaclass=ReproMeta):
value: T
"""
with tempfile.TemporaryDirectory() as tmp:
module_path = Path(tmp) / "pydantic_repro_models.py"
module_path.write_text(models_code)
producer = f"""
import base64
import cloudpickle
import sys
sys.path.insert(0, {tmp!r})
import pydantic_repro_models as models
# Top-level specialization materialization, matching the risky path from issue #210.
payload = cloudpickle.dumps(models.Box[int](value=5), protocol=5)
print("producer reducer calls:", models.CALLS)
print(base64.b64encode(payload).decode())
"""
producer_result = subprocess.run(
[sys.executable, "-c", producer],
capture_output=True,
text=True,
timeout=30,
)
print(producer_result.stdout.splitlines()[0])
encoded_payload = producer_result.stdout.splitlines()[1]
consumer = f"""
import base64
import cloudpickle
import sys
sys.path.insert(0, {tmp!r})
import pydantic_repro_models as models
print("consumer has Box[int] before:", hasattr(models, "Box[int]"))
obj = cloudpickle.loads(base64.b64decode({encoded_payload!r}))
print(obj)
"""
consumer_result = subprocess.run(
[sys.executable, "-c", consumer],
capture_output=True,
text=True,
timeout=30,
)
print("consumer returncode:", consumer_result.returncode)
print("consumer stdout:", consumer_result.stdout.strip())
print("consumer stderr:", consumer_result.stderr.strip())This prints the important reducer signal: and the cold consumer fails by generated-name lookup: So this is the same pydantic generic shape as the ccflow issue, but the proposed metaclass reducer is still not the hook that The confusing part is that some fresh-process repros can still pass, but I believe that is due to
So I think the issue #210 caveat is the important thing to preserve in any repro comparison: a passing fresh-process case may only show that For ccflow, the metaclass-only patch in
Those still fail in a cold consumer with errors like: and similarly for the nested/callable/optional/field-value cases. The focused test command I used for the ccflow matrix was: python -m pytest ccflow/tests/test_base_cloudpickle.py::test_ccflow_generic_specializations_pickle_across_fresh_processes -qThe reducer itself is sanity-checked directly in So my current read is: the simplification has merit in the reconstruction recipe, but the proposed hook point does not seem to be the mechanism that Could you share the exact repro script you used? I want to compare it directly before deciding whether to keep the current instance-level reducer approach or explore a smaller alternative, maybe via a registered class/metaclass reducer path. |
Pydantic Generic Pickle and Ray Notes
TLDR
Concrete Pydantic generic
BaseModelspecializations can be fragile acrossfresh-process pickle/cloudpickle boundaries. ccflow works around this for
concrete generic
ccflow.BaseModelinstances by pickling them as stableorigin + args + statedata and recreating the specialized class during load.Fixes #210.
Summary
This issue is a mismatch between three things:
GenericResult[int], dynamically at runtime.The bug is not specific to
GenericResult. Any concrete Pydantic genericBaseModelspecialization can have the same problem if that specialized class object crosses a process boundary before the receiver has materialized it.The fix is confusing because there are two separate objects involved:
GenericResult[int](value=5)generic type arguments, such as
GenericResult[int],ListResult[int], orCallableModelGenericType[NullContext, GenericResult[int]]Fixing only the top-level instance class is not enough if generated generic
classes are also embedded inside the type arguments that define that instance's
specialized class.
What Pydantic Does
Pydantic v2 does not require
pydantic.generics.GenericModel; normalBaseModelsubclasses can be generic. When code evaluates:Pydantic runs
BaseModel.__class_getitem__. In the local environment, this is Pydantic2.13.4.The relevant source is pinned to the
v2.13.4tag here:https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/main.py#L904-L969
The relevant flow is:
GenericResult[int]._generics.create_generic_submodel(...).The generated class has metadata like:
with shape:
{ "origin": GenericResult, "args": (int,), "parameters": (), }For module-level models like this repro,
originis stable and importable. Ingeneral, the reducer still relies on the origin class itself being importable or
otherwise serializable by cloudpickle. The generated specialized class is
runtime-created.
In Pydantic's
_generics.create_generic_submodel, the new subclass is createdwith the origin model's
__module__and generic metadata. The relevant sourceis pinned here:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/_internal/_generics.py#L105-L149
Then Pydantic conditionally registers the generated class in the origin module
when
_get_caller_frame_info(...)decides the specialization was created froma global context:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/_internal/_generics.py#L140-L147
That global registration is the key point. Sometimes a process has
ccflow.result.generic.GenericResult[int]as a module attribute because that process already materialized it in a context Pydantic considers global. A fresh process may not.What Pickle and Cloudpickle Do
Pickle generally reconstructs class objects by global reference:
For an ordinary class, this is fine:
can be imported in any process.
For a generated specialization, pickle/cloudpickle may see:
and serialize it by reference as if this were importable:
That works in the process that created and registered the class. It can fail in a fresh process:
Ray makes this easy to hit because Ray workers are separate Python processes. Importing
GenericResultin the worker does not necessarily createGenericResult[int]. If unpickling happens first, the generated class name is missing.Pydantic's Instance Pickle Behavior
Pydantic already has instance pickle machinery.
BaseModel.__getstate__()returns a dict containing Pydantic's internal modelstate, and
BaseModel.__setstate__()restores that state directly. The sourceis pinned here:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/main.py#L1145-L1160
{ "__dict__": self.__dict__, "__pydantic_extra__": self.__pydantic_extra__, "__pydantic_fields_set__": self.__pydantic_fields_set__, "__pydantic_private__": private, }That is important because pickle should preserve an already-validated object. It should not rerun normal validation, coerce values again, drop private attrs, or rebuild the object through
model_validate.ccflow already overrides
__getstate__/__setstate__slightly to make__pydantic_fields_set__deterministic in pickle output.The new fix keeps this Pydantic state-based behavior. For concrete generic
specializations of
ccflow.BaseModel, it changes only the reduce recipe.The Exact Failure
A minimal failing shape is:
Then in a fresh process that has imported
GenericResultbut has not evaluatedGenericResult[int]:can fail because the receiver has the origin class:
but not the generated specialization:
The problem is broader than the top-level class:
Here the top-level class is
GenericResult[ListResult[int]], and the generic arg contains another generated class,ListResult[int].This also appears inside typing aliases:
typing.Callableis especially annoying because its parameter types can appear as a plain Python list insidetyping.get_args():So a helper that only walks
typing.get_args()recursively can still miss generated classes inside that list.Generated classes can also appear as field values:
Even if the instance class is reconstructed correctly, the field value
ListResult[int]would otherwise be pickled by its fragile generated classname. That field-value shape is not fixed by the current change. The current
fix deliberately covers generated classes used as the instance class and inside
generic type arguments, while leaving arbitrary class objects stored in model
state to pickle/cloudpickle's normal behavior.
Pydantic-Only Repro
The smallest useful repro does not need ccflow or Ray. It only needs a plain
Pydantic generic model, two Python processes, and a cold receiver that imports
the generic origin class without first materializing the concrete specialization.
Consider the following repro:
Then it runs three subprocess steps:
Box, evaluatesBox[int], constructsBox[int](value=5), and serializes it withcloudpickle.Boxbut does not evaluateBox[int]before calling
cloudpickle.loads(...).Box, evaluatesBox[int], then callscloudpickle.loads(...).The observed output is:
That is the core bug in isolation. The creating process has a generated
Box[int]class registered on the module. The cold receiving process has onlythe importable generic origin class
Box, so pickle's global lookup forBox[int]fails. The warm receiver evaluatesBox[int]at module/global scopebefore unpickling, which causes Pydantic to install the same generated class as
a module attribute before pickle tries to resolve it. That proves this is an
import/materialization-ordering problem rather than a ccflow model-definition
problem.
The same repro was checked with several Pydantic 2.x releases using:
Box[int]module attributeBox[int]module attributeBox[int]module attributeBox[int]module attributeBox[int]module attributeBox[int]module attributeBox[int]module attributeBox[int]module attributeThis is not a regression in a recent Pydantic minor release. The behavior is
stable across the tested 2.x line.
Why the Fix Uses
__reduce_ex____reduce_ex__is the pickle hook that returns a reconstruction recipe.For concrete generic specializations of
ccflow.BaseModel, ccflow now returnsa recipe like:
( _new_ccflow_generic_model, (origin, portable_args), pydantic_state, )For:
the recipe is conceptually:
On load, pickle first calls the reducer function:
Then, because the reducer returned
pydantic_stateas the third tuple element,pickle applies that state to the object:
This uses Pydantic's own generic construction path in the receiving process,
while still letting pickle apply Pydantic's normal state protocol. The receiver
does not need to already have a global
GenericResult[int]module attribute.Why Type Arguments Need Special Handling
The generic argument portability layer is the ugly part, but it is solving a
real second-order problem.
If we only serialize the top-level class as:
then this works:
but these can still fail:
because
ListResult[int]is itself a generated Pydantic generic class.The helper therefore handles:
listandtuplecontainers that can appear inside type expressions, suchas the callable parameter list
The raw Pydantic state remains in the outer pickle stream. That is intentional:
pickle keeps its own memo table there, which is required to preserve shared
references, cycles, and protocol-5 buffers. The reducer only changes how the
generic class is recreated; it does not recursively rewrite arbitrary model
field/private state.
It intentionally does not treat model instances as type specs. A value like:
is still a model instance and should be pickled as an instance. Its own
__reduce_ex__will handle its generated class. Accidentally converting instances into class specs would corrupt data.Blast Radius
There are two related guards:
That predicate identifies concrete generated Pydantic generic classes such as:
The
BaseModel.__reduce_ex__override runs when pickling a ccflow modelinstance whose
type(self)satisfies that predicate. So it does apply to:It does not apply to:
Normal non-generic
BaseModelinstances continue using the default reducerpath, plus ccflow's existing deterministic
__getstate__/__setstate__hooks.
The custom reduce recipe is created only during pickling of concrete generic
ccflow
BaseModelinstances. It does not run during:model_dumpThe performance cost is therefore limited to pickling generic model instances.
The extra work is walking the generic type arguments for the model class. The
actual Pydantic instance state is still handled by the surrounding pickle
operation.
Why Not Simpler Alternatives
Why not call
model_validateon restore?Because pickle should restore object state, not validate new input.
Revalidation can:
Pydantic's own pickle support uses
__getstate__/__setstate__, so the ccflow fix follows that model.Why not rely on cloudpickle to serialize the generated class by value?
Sometimes cloudpickle can serialize dynamic classes by value. But Pydantic specializations are not just ordinary dynamic classes. They carry generated schemas, validators, serializers, generic metadata, and cache behavior.
Also, if the class appears to be importable by module/name in the creating process, cloudpickle can choose a global-reference path. That is exactly the fragile path that fails in a fresh receiver.
The stable representation for a Pydantic generic specialization is not the generated class object. It is:
Why not globally register every generated specialization?
Pydantic already conditionally registers generated specializations when it thinks they were created globally. But a fresh Ray worker has not necessarily executed the same specialization expression yet.
Trying to eagerly register all possible specializations is impossible. Registering during serialization still would not help the receiver unless the receiver imports side effects in the same order.
Why not monkeypatch pickle/cloudpickle for all classes?
Generated Pydantic specializations are class objects, so a global reducer would mean changing behavior for
typeor for broad classes of model classes. That is much wider than this bug.The ccflow fix keeps the custom behavior inside ccflow
BaseModelinstance pickling.Why not support fields containing generated classes too?
That is a real broader issue, but fixing it at the state-value level is a much
bigger change. It requires intercepting arbitrary class objects inside the
pickle stream or walking Pydantic state manually, both of which can disturb
pickle's normal identity/cycle/buffer semantics if done carelessly.
The current fix chooses the smaller and safer boundary: generated classes that
define the model instance's own type, plus generated classes inside that type's
generic arguments. A field value like
GenericResult[type](value=ListResult[int])can still fail in a cold receiver and remains out of scope.
How Broad Is This Problem?
It affects concrete Pydantic generic specializations crossing a process
boundary by pickle/cloudpickle when the receiver has not already materialized
the same specialization.
Shapes this PR fixes:
GenericResult[int](...)GenericResult[ListResult[int]](...)GenericResult[list[ListResult[int]]](...)GenericResult[dict[str, GenericContext[int]]](...)typingalias arg:GenericResult[typing.List[ListResult[int]]](...)GenericResult[typing.Callable[[ListResult[int]], int]](...)as
typing.ClassVar[ListResult[int]]andtyping.Final[ListResult[int]]GenericResult[GenericContext[int] | None](...)GenericResult[typing.Optional[ListResult[int]]](...)Not affected:
BaseModelclassesGenericResultHandled by normal nested pickling, not by rewriting field/private state:
instances normally, and each instance's own reducer handles its generated
class
Helper-level coverage only:
CallableModelGenericType[NullContext, GenericResult[int]]as a typeargument
typing.Required[ListResult[int]]andtyping.NotRequired[ListResult[int]]; Pydantic rejects these asGenericResult[...]arguments in this context, but the restore helperhandles the one-argument special-form shape
Known not fixed:
GenericResult[type](value=ListResult[int])Annotated[int, frozenset([ListResult[int]])]; the helper walks normaltyping args plus explicit
list/tuplecontainers, not every object thatcan be embedded in metadata
Why This Is Hard To Read
The code is confusing because it has to preserve three different pieces of
pickle/type state:
__getstate__/__setstate__origin[args]as builtin
list[...],typing.List[...],typing.Optional[...],typing.Callable[...], or a PEP 604A | BunionIt also has to avoid a dangerous false positive:
That is why there are separate helpers for:
Pydantic's normal
__setstate__The resulting implementation is not aesthetically simple, but each piece exists to handle a real pickle/Ray failure mode.
References
Pydantic dynamic model docs. Pydantic explicitly notes that dynamically created models must be globally defined and have
__module__provided to be pickleable:https://docs.pydantic.dev/latest/concepts/models/#dynamic-model-creation
Pydantic
BaseModel.__class_getitem__source for generic specialization:https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/main.py#L904-L969
Pydantic
_generics.create_generic_submodelsource for dynamic generatedgeneric subclasses and conditional module registration:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/_internal/_generics.py#L105-L149
Pydantic
BaseModel.__getstate__/BaseModel.__setstate__source showingstate-based pickle behavior:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/main.py#L1145-L1160
Pydantic issue #9668, broad background on Python compatibility work that mentions generic models and pickling among affected areas. It is not this exact bug:
Support Python 3.13 pydantic/pydantic#9668
Prefect's
add_cloudpickle_reduction, an analogous downstream precedent for adding reducer logic around Pydantic model classes in workflow/distributed execution contexts:https://reference.prefect.io/prefect/utilities/pydantic/#add_cloudpickle_reduction