[AURON #1840] Preserve collect_set first-occurrence order by peter941221 · Pull Request #2285 · apache/auron

peter941221 · 2026-05-25T04:59:34Z

What changed

AccSet::merge no longer swaps accumulators by set size. That swap changed the encounter order of collect_set, so no-shuffle Spark checks could see values in rhs-first order instead of first-occurrence order.

Why

Spark's collect_set preserves first-occurrence order in the no-shuffle path used by the affected aggregate suite.

Testing

git diff --check
cargo +nightly test --manifest-path native-engine/datafusion-ext-plans/Cargo.toml test_acc_set_merge -- --nocapture (blocked here by rdkafka-sys Windows build failure: %1 is not a valid Win32 application. (os error 193))

peter941221 · 2026-06-01T22:47:02Z

Rebased this onto current master and kept the fix minimal. The change still just removes the size-based swap in AccSet::merge and adds order assertions for the merge cases. New CI is running on f092709.

richox · 2026-06-03T11:52:59Z

do we really need to keep this order? if i dont understand wrong, this order we be discarded in bucket-merge spill strategy even if we maintain it in updating

peter941221 · 2026-06-03T12:33:51Z

@richox thanks for calling out the spill path. I checked that path and added a focused regression for it.

The spill bucket merge still comes back through the same collect_set merge path this PR changes:
AggTable::output bucket merge in agg_table.rs
-> partial_merge
-> merge_items in collect.rs
-> AccSet::merge in collect.rs

I also checked the spill round-trip itself. AccSetColumn::save_raw writes list.raw as-is, and unspill rebuilds it in order through load_raw / append_item, so spill does not discard the internal encounter order by itself.

To make that concrete, I pushed 645b3236, which adds test_acc_set_merge_preserves_first_occurrence_order_after_rhs_spill. It spills the larger RHS accumulator, unspills it, merges it into the LHS, and still gets [1, 2, 3] on the repo's pinned nightly-2025-05-09.

So I think we still need this fix for the spill path too.

github-actions Bot added the native label May 25, 2026

peter941221 marked this pull request as ready for review May 29, 2026 10:17

[AURON apache#1840] Preserve collect_set first-occurrence order

f092709

peter941221 force-pushed the fix/auron-1840-collect-set-order branch from 46ef225 to f092709 Compare June 1, 2026 22:43

test: cover collect_set spill merge order

645b323

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AURON #1840] Preserve collect_set first-occurrence order#2285

[AURON #1840] Preserve collect_set first-occurrence order#2285
peter941221 wants to merge 2 commits into
apache:masterfrom
peter941221:fix/auron-1840-collect-set-order

peter941221 commented May 25, 2026

Uh oh!

peter941221 commented Jun 1, 2026

Uh oh!

richox commented Jun 3, 2026

Uh oh!

peter941221 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peter941221 commented May 25, 2026

What changed

Why

Testing

Uh oh!

peter941221 commented Jun 1, 2026

Uh oh!

richox commented Jun 3, 2026

Uh oh!

peter941221 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants