Allow pickling PyExpr#1517
Open
ntjohnson1 wants to merge 9 commits intoapache:mainfrom
Open
Conversation
Promote the previously immutable global context slot in `datafusion-python-util` from `OnceLock<Arc<SessionContext>>` to a `RwLock<Arc<SessionContext>>` and expose `set_global_ctx` (Rust) / `SessionContext.set_as_global` (Python). Users who register UDFs or otherwise customize a context can now make it the default seen by `SessionContext.global_ctx()` and the module-level `read_*` helpers. Existing snapshots returned by `get_global_ctx()` are unaffected — the swap only changes what subsequent readers see. Also fixes a pre-existing clippy `uninlined_format_args` nit in `dataframe.rs` that was tripping the pre-commit hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `to_bytes` / `from_bytes` on `Expr` (Python wrapper) and the underlying `RawExpr` (Rust). Serialization uses `datafusion-proto`'s `Serializeable` trait, encoding function references by name. The Python wrapper implements `__getstate__` / `__setstate__` on top, so `pickle.dumps` / `dill.dumps` work out of the box. Reconstruction resolves function names against the process-wide global `SessionContext` (introduced as settable in the previous commit). Built-in functions always roundtrip; user-defined functions roundtrip when registered on a context that has been installed via `SessionContext.set_as_global()`. Adds `dill` to the dev dependency group and parametrized tests covering both serializers across columns, literals, binary ops, casts, between, aggregates, case/when, and a UDF with the global-ctx pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions defaults to a 6-hour job timeout when timeout-minutes is unset, so a hung pytest run (e.g. a deadlocked multiprocessing pool) would otherwise burn six hours of runner time before being killed. Successful test-matrix runs complete in ~4 minutes; 30 minutes leaves ample headroom while bounding the blast radius of a future hang. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawn workers re-import the module that defined a pickled function by its ``__module__`` attribute. Pytest's ``--import-mode=importlib`` (used in CI) can load test modules under synthetic names that the worker's normal import machinery cannot resolve, which manifests as ``Pool.map`` hanging forever waiting on a worker that died during unpickling. Move the worker-side helpers (``_apply_builtin_expr``, ``_apply_udf_expr``, ``_register_udf_on_global_ctx``, ``_build_add_ten_udf``) to a regular ``tests._pickle_multiprocessing_helpers`` module. The leading underscore keeps pytest from collecting it; spawn workers import it under its real dotted name in both parent and child. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ntjohnson1
commented
May 9, 2026
| jobs: | ||
| test-matrix: | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 30 |
Contributor
Author
There was a problem hiding this comment.
Right now things pass in 4 minutes but I accidentally committed a change that deadlocked and would have run for 6 hours if I didn't push new commits. Seems like a nice guard to footguns
Contributor
Author
|
Hmm I can either remove the multiprocessing test since the pickle stuff should be sufficient coverage or investigate why it's not hanging locally but is on CI later this weekend |
…e. Make sure pytest can see the tests directory as a package to access helpers
…ot cause and we can resolve. Will follow up by reverting this and removing multiprocessing specif test because not worth the complexity
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Allows holding a pyexpr when spawning python parallelism.
Closes #1520.
Rationale for this change
I'll capture most of the details in the issue, but this makes the ergonomics of working around datafusion-distributed not being deployed everywhere a little nicer. Work can be shared across multiple workers in different processes.
What changes are included in this PR?
This does change how global context is managed to treat it more like a singleton rather that just a standard default so we can assume that is the right thing to pull when unpacking the expr.
Are there any user-facing changes?
Not really, besides adding new capability.