Skip to content

Allow pickling PyExpr#1517

Open
ntjohnson1 wants to merge 9 commits intoapache:mainfrom
rerun-io:nick/pickle_expr
Open

Allow pickling PyExpr#1517
ntjohnson1 wants to merge 9 commits intoapache:mainfrom
rerun-io:nick/pickle_expr

Conversation

@ntjohnson1
Copy link
Copy Markdown
Contributor

@ntjohnson1 ntjohnson1 commented Apr 28, 2026

Which issue does this PR close?

Allows holding a pyexpr when spawning python parallelism.
Closes #1520.

Rationale for this change

I'll capture most of the details in the issue, but this makes the ergonomics of working around datafusion-distributed not being deployed everywhere a little nicer. Work can be shared across multiple workers in different processes.

What changes are included in this PR?

This does change how global context is managed to treat it more like a singleton rather that just a standard default so we can assume that is the right thing to pull when unpacking the expr.

Are there any user-facing changes?

Not really, besides adding new capability.

ntjohnson1 and others added 3 commits April 28, 2026 11:26
Promote the previously immutable global context slot in `datafusion-python-util`
from `OnceLock<Arc<SessionContext>>` to a `RwLock<Arc<SessionContext>>` and
expose `set_global_ctx` (Rust) / `SessionContext.set_as_global` (Python). Users
who register UDFs or otherwise customize a context can now make it the default
seen by `SessionContext.global_ctx()` and the module-level `read_*` helpers.

Existing snapshots returned by `get_global_ctx()` are unaffected — the swap only
changes what subsequent readers see.

Also fixes a pre-existing clippy `uninlined_format_args` nit in
`dataframe.rs` that was tripping the pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `to_bytes` / `from_bytes` on `Expr` (Python wrapper) and the underlying
`RawExpr` (Rust). Serialization uses `datafusion-proto`'s `Serializeable`
trait, encoding function references by name. The Python wrapper implements
`__getstate__` / `__setstate__` on top, so `pickle.dumps` / `dill.dumps` work
out of the box.

Reconstruction resolves function names against the process-wide global
`SessionContext` (introduced as settable in the previous commit). Built-in
functions always roundtrip; user-defined functions roundtrip when registered
on a context that has been installed via `SessionContext.set_as_global()`.

Adds `dill` to the dev dependency group and parametrized tests covering
both serializers across columns, literals, binary ops, casts, between,
aggregates, case/when, and a UDF with the global-ctx pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ntjohnson1 ntjohnson1 marked this pull request as ready for review May 9, 2026 10:57
ntjohnson1 and others added 2 commits May 9, 2026 07:41
GitHub Actions defaults to a 6-hour job timeout when timeout-minutes is
unset, so a hung pytest run (e.g. a deadlocked multiprocessing pool)
would otherwise burn six hours of runner time before being killed.
Successful test-matrix runs complete in ~4 minutes; 30 minutes leaves
ample headroom while bounding the blast radius of a future hang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawn workers re-import the module that defined a pickled function by its
``__module__`` attribute. Pytest's ``--import-mode=importlib`` (used in
CI) can load test modules under synthetic names that the worker's normal
import machinery cannot resolve, which manifests as ``Pool.map`` hanging
forever waiting on a worker that died during unpickling.

Move the worker-side helpers (``_apply_builtin_expr``, ``_apply_udf_expr``,
``_register_udf_on_global_ctx``, ``_build_add_ten_udf``) to a regular
``tests._pickle_multiprocessing_helpers`` module. The leading underscore
keeps pytest from collecting it; spawn workers import it under its real
dotted name in both parent and child.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jobs:
test-matrix:
runs-on: ubuntu-latest
timeout-minutes: 30
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now things pass in 4 minutes but I accidentally committed a change that deadlocked and would have run for 6 hours if I didn't push new commits. Seems like a nice guard to footguns

@ntjohnson1
Copy link
Copy Markdown
Contributor Author

Hmm I can either remove the multiprocessing test since the pickle stuff should be sufficient coverage or investigate why it's not hanging locally but is on CI later this weekend

ntjohnson1 added 4 commits May 9, 2026 10:34
…e. Make sure pytest can see the tests directory as a package to access helpers
…ot cause and we can resolve. Will follow up by reverting this and removing multiprocessing specif test because not worth the complexity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow pickling PyExpr

1 participant