[AURON #1840] Preserve collect_set first-occurrence order#2285
[AURON #1840] Preserve collect_set first-occurrence order#2285peter941221 wants to merge 2 commits into
Conversation
46ef225 to
f092709
Compare
|
Rebased this onto current master and kept the fix minimal. The change still just removes the size-based swap in AccSet::merge and adds order assertions for the merge cases. New CI is running on f092709. |
|
do we really need to keep this order? if i dont understand wrong, this order we be discarded in bucket-merge spill strategy even if we maintain it in updating |
|
@richox thanks for calling out the spill path. I checked that path and added a focused regression for it. The spill bucket merge still comes back through the same I also checked the spill round-trip itself. To make that concrete, I pushed So I think we still need this fix for the spill path too. |
What changed
AccSet::mergeno longer swaps accumulators by set size. That swap changed the encounter order ofcollect_set, so no-shuffle Spark checks could see values inrhs-firstorder instead of first-occurrence order.Why
Spark's
collect_setpreserves first-occurrence order in the no-shuffle path used by the affected aggregate suite.Testing
git diff --checkcargo +nightly test --manifest-path native-engine/datafusion-ext-plans/Cargo.toml test_acc_set_merge -- --nocapture(blocked here byrdkafka-sysWindows build failure:%1 is not a valid Win32 application. (os error 193))