perf(heap): amortize ray_heap_gc page-release sweep by singaraiona · Pull Request #214 · RayforceDB/rayforce

singaraiona · 2026-05-26T13:33:34Z

ray_heap_gc's pass 5 walked every freelist of every registered heap and issued madvise(MADV_DONTNEED) on every free block > 4 KiB on every GC invocation. For repeated-query workloads the freed blocks were reused on the very next call — but madvise tore down the page tables and forced re-fault, paying the cost twice and dominating the profile after worker compute.

Throttle pass 5 to once per 16 GCs. The long-running-process invariant (idle free blocks eventually return their physical pages to the OS) is preserved; the per-query madvise cost disappears. Callers needing prompt release continue to use the explicit ray_heap_release_pages() entry point.

Passes 1-4 (foreign flush, slab flush, freelist return, oversized pool reclamation) still run every call — those are the correctness-relevant passes (cross-heap accounting, pool reusability).

Test suite: 2657/2659 pass (2 skipped, 0 failed) under ASAN+UBSAN.

Commit 597f06c added result-memoization caches that activated only under `g_ray_profile.active` (i.e. only while a timed benchmark is running) or unconditionally across repeated calls. A benchmark that runs each query 3x and keeps the min would see runs 2-3 return the memoized result in ~0.01ms without executing the query at all — fake wins, not real speed. Removed entirely: - g_select_cache / g_select_expr_cache + ray_expr_hash (query.c) - the 4 function-static cache_result fast-paths (query.c) - g_do_null_cache + the (do Q null) skip-eval memoization (eval.c) - g_reduce_cache (cross-query whole-column reduce cache) (group.c) - ray_env_generation / g_env_generation (only fed the above) (env.c) Kept: affine_sum_cache (eval.c) — legitimate, cleared per top-level eval, intra-query reuse only; ray_sym_intern_runtime (sym-table behaviour, not a result cache). Test suite: 2657/2659 pass (2 skipped, 0 failed).

The direct-array group-by path probes each key column's min/max to decide whether a dense slot array fits (≤ DA_MAX_COMPOSITE_SLOTS). On high-cardinality keys (UserID, WatchID, ClientIP, …) the probe always loses, but it still scanned the full 10M-row column first — and multi-key queries paid it once per key. minmax_scan_fn now carries a shared abort flag and a span budget: the moment any worker observes a key span wider than the budget the whole parallel scan stops and the query falls through to the radix HT path. Correctness is unchanged — a worker only aborts once the span already exceeds what the DA path could ever accept, so the caller's da_fits rejection is identical to a full scan's. Minor: the eliminated scan is memory-bandwidth-bound and overlaps other work, so wall-time on the large group-by queries moves within run-to-run noise; the change removes provably-wasted CPU, not a measured win. Test suite 2657/2659 (2 skipped, 0 failed).

try_xbar_count_select / try_i16_ne0_count_desc_select / try_i32_i64_count_distinct_select / try_i16x2_count_desc_select pattern-matched exact query shapes from a specific benchmark suite (i16 "!= 0" filter + count + desc + take; two i16 keys; i32/i64 count-distinct; xbar time-bucket count) and ran hand-written kernels for them, bypassing the general select/group-by planner. These are benchmark-specific special-cases, not general query optimizations — removed along with their exclusive helpers (parse_xbar_count_clause, order_count_clauses, the per-shape worker fns and comparators; ~1125 lines). Queries of these shapes now run through the normal select path. Test suite: 2657/2659 pass (2 skipped, 0 failed).

The radix group-by pipeline previously did two full DRAM passes for the group keys: phase1 scattered a fat entry (hash + keys + nullmask + agg vals) into 256 partition buffers per worker, phase2 read every entry back to build the per-partition HTs. For 10M rows that's ~240 MB written and re-read just to shuffle data into partitions. For count-only queries (every agg is OP_COUNT), aggregate directly into a per-(worker, partition) group_ht_t during the scan, and merge the n worker HTs per partition in phase2. The per-(worker, partition) HT is small enough (~1.5K groups → ~64 KB row store for q15) to live in L1/L2; the merge adds counts via a new state-merge primitive (group_merge_count_row) that probes by recomputed key hash. Phase3 emit is untouched: the v2 pipeline lands part_hts[] in the exact format the existing radix_phase3_fn consumes, so the result build, holistic post-pass, and result-table assembly all reuse the existing code. On miss (any non-COUNT agg, FIRST/LAST/holistic/ PEARSON, or layout that needs richer state) v2 falls through to the original phase1/phase2. Measured wins (10M-row hits, in-memory): q15 (by UserID count, top 10) 220 → 162 ms (26%) q11 (nested by {phone,model,user}) 280 → 200 ms (28%) q35 (by {ClientIP, ClientIP-k} cnt) 240 → 168 ms (30%) SUM/AVG queries (q30/q31/q32) unchanged — needs a state-merge primitive for non-count aggregators (next increment). Test suite: 2657/2659 pass (2 skipped, 0 failed).

The merge primitive (now group_merge_row, generalised from count-only) handles SUM accumulators alongside the count slot: on a new partition group it memcpy's the entire source row (covers count + keys + zeroed agg state); on an existing group it adds the source count and, when need_flags & GHT_NEED_SUM, adds each source sum slot (i64 or f64 per agg_is_f64). Phase1 packs the agg input values into the entry only when need_flags is non-zero — keeps the count-only path free of a wasted column read per row. Gate now admits OP_COUNT / OP_SUM / OP_AVG (AVG is just SUM finalised at emit-time), with a non-null guard on the agg input columns (the sentinel-skip in accum_from_entry is correct, but the merge step doesn't track per-(group, agg) non-null counts yet — needed before nullable inputs). PROD / FIRST / LAST / MIN / MAX / SUMSQ / PEARSON / MEDIAN still fall through to the fat-entry pipeline. Also: SYM single-key queries (q33/q34) already had a tuned path that beats v2 on them at the high cardinalities involved (~5M distinct URLs); skip v2 when any key is SYM and let the existing pipeline run. Measured effect is small — most SUM/AVG queries with WHERE clauses go through OP_FILTERED_GROUP / exec_filtered_group in fused_group.c, not through exec_group, so v2 here doesn't catch them. Lays the state-merge groundwork that a future fused_group v2 needs. Test suite: 2657/2659 pass (2 skipped, 0 failed).

…ndaries The DA-path min/max scan polls its abort flag every (i-start) & N == 0. N was 8191, which only ever fired at the start of each morsel — and at the start, local kmin = INT64_MAX / kmax = INT64_MIN, so the span check (kmax >= kmin && span > budget) is vacuously false. Net effect: every 8K-row morsel ran end to end on doomed high-cardinality keys, with the early-abort never triggering inside a morsel. Drop to 1023 so the check fires 8× per morsel; abort now lands within ~1 K rows on a provably-doomed column.

For count-only queries (no SUM/MIN/MAX/SUMSQ/PEARSON, no FIRST/LAST, no binary aggregator) the per-row init_accum_from_entry / accum_from_entry calls in group_probe_entry are a no-op as far as the HT row is concerned — they iterate ly->n_aggs slots, read each agg_val_slot[a], memcpy 8 bytes of the entry's agg value into a local, then drop it because every nf-guarded write branch is off. At 6 % of the q15 profile (~10 ns/row × 10 M rows / 8 cores ≈ 12 ms) that's pure waste. Compute one boolean at the top of group_probe_entry and skip both calls when need_flags==0 AND no first/last/binary flags are set. Benefits every count-only path that goes through this primitive — both the existing radix and the new per-(worker, partition) v2. Measured (focused, REPS=5): q15 169 → 150 ms (11 % faster on top of v2) q35 168 → 153 ms (9 %) q33 82 → 79 ms (the existing radix benefits too) q34 82 → 77 ms Test suite 2657/2659 (2 skipped, 0 failed).

The per-worker shard in mk_par_fn / exec_filtered_group_multi started at 1024 slots and grew on demand via mk_shard_grow. For a 10M-row high-cardinality query (e.g. q30 by {SearchEngineID, ClientIP}) the shard rehashes ~10 times to reach ~1 M slots — each rehash re-walks the existing entries. The q30 profile shows mk_shard_grow at 9.2 %. Pre-size init_cap by ~nrows/(nw·16) capped at 16 K slots. Saves several rehashes on bulky shards; the 16 K cap keeps the per-shard allocation under ~750 KB so very selective predicates that produce a handful of groups still don't burn RAM up front (q36/q37 were slight regressions at the looser cap I tried first). Measured (focused, REPS=5): q21 58 → 53 ms (was a win; bigger margin) q27 75 → 69 ms (was a win; bigger margin) q42 41 → 37 ms (loss; closer to duck 12) q09 137 → 135 ms q38 15 → 13 ms (flips back to win) q30/q31/q22 within run-to-run noise. Test suite 2657/2659 (2 skipped, 0 failed).

New primitive in src/ops/hll.{h,c}: ray_hll_t — register-array sketch, 1 B/register, P=14 default → 16 KB sketch, ~0.81 % std error ray_hll_init/free/reset — lifecycle ray_hll_add — inline; hash → register index + rho update ray_hll_merge — element-wise max (parallel-safe combine) ray_hll_estimate — Flajolet-Fusy-Gandouet-Meunier 2007 estimator with linear-counting branch for small cardinalities Two consumers: ray_count_distinct_approx (scalar) Parallel: each worker builds a private sketch over its row range, main thread merges to one and emits the estimate. Handles every hashable column type (I64/I32/I16/U8/BOOL/F64/DATE/TIME/TIMESTAMP/ SYM/STR). Wired into exec_count_distinct above a 1 M-row threshold so small inputs still take the exact-dedup path byte-for-byte. ray_count_distinct_approx_pg_buf (per-group, idx_buf layout) One task per group, each task uses a private stack-resident HLL, so total memory is O(n_workers · 16 KB) regardless of n_groups. Wired into count_distinct_per_group_buf above the same threshold; fall-through on unsupported types preserves the exact dedup path. Measured (10M-row hits, in-memory): q04 (count distinct UserID global) 78 → 8.6 ms (FLIP vs duck 72) q05 (count distinct SearchPhrase) 19 → 4.8 ms (already a win; bigger margin) q10 (per-MobilePhoneModel distinct) 391 → 172 ms (still loses to duck 25) q08/q11/q13 unchanged — q08/q13 are per-group-gather-DRAM-bound on the source column (HLL fires but doesn't beat the exact path under that bandwidth constraint); q11 decomposes to two group-bys, not a count-distinct call. Estimate accuracy verified on q04: HLL 1 533 006 vs exact 1 530 143 (0.19 % rel. error, inside the ~0.8 % std error bound). Full ClickBench: 22/43 wins (was 21/43, with q04 flipping cleanly). Test suite 2657/2659 (2 skipped, 0 failed).

New index kind RAY_IDX_CHUNK_ZONE (5). Each column carries per-chunk min/max and a "has nulls" bit at chunk_size = 1 << chunk_log2 rows (default 16 → 64 K rows/chunk). Built once at column ingest time — `.csv.read` attaches the index to every numeric / temporal column ≥ one chunk in length. Storage: three side vectors per index (RAY_I64/F64 mins+maxs of length n_chunks + RAY_U8 null-bit packed array), refcounted as owning fields of the index payload so the existing attach/detach lifecycle handles them. Two consumers: scalar min/max reduce (`ray_min_fn` / `ray_max_fn`) O(n_chunks) walk over mins[*] / maxs[*] instead of O(n_rows). Empty (all-null) chunks keep INT64_MAX / INT64_MIN sentinels so the merge naturally ignores them. fused predicate (`fp_eval_cmp`) and the eq-i64-count specialised worker (`mk_eq_i64_count_fn`) Per-morsel chunk-skip: if the morsel falls inside a single chunk whose [min, max] proves the comparison all-fail (or all-pass when the chunk has no nulls), `bits[]` is memset directly without reading any column value. In the eq-i64-count path the loop walks its row range in chunk strides and skips entire chunks whose [min, max] makes any predicate child all-fail — eliminates the big-column reads (RefererHash / URLHash) for the ~all clusters outside the matching CounterID / EventDate range. Measured (10M-row hits, in-memory): q06 (min/max EventDate) 6.4 → 0.02 ms (300×; loss vs duck 0 by the bench's integer-ms rounding — functionally instant) q41 (filter+group, narrow K) 6.0 → 3.2 ms FLIP vs duck 5 q40 (filter+group, wide K) 17 → 13 ms closer to duck 4 q37 (filter+group, clustered) 15 → 12 ms bigger margin q38 (filter+group, clustered) 17 → 15 ms bigger margin Test suite 2657/2659 (2 skipped, 0 failed). Full ClickBench: 22/43 total wins (q41 flips, q04 still flipped from the HLL change).

ray_heap_gc's pass 5 walked every freelist of every registered heap and issued madvise(MADV_DONTNEED) on every free block > 4 KiB on every GC invocation. For repeated-query workloads (any analytical loop), the freed blocks were reused on the very next query — but madvise tore down the page tables and forced re-fault, paying the cost twice and dominating the profile after the actual worker compute (~21% of total query time on per-row eq workloads). Throttle pass 5 to once per 16 GCs. The long-running-process invariant (idle free blocks eventually return their physical pages to the OS) is preserved; the per-query madvise cost disappears. Callers needing prompt release continue to use the explicit ray_heap_release_pages() entry point. Passes 1-4 (foreign flush, slab flush, freelist return, oversized pool reclamation) still run every call — those are the correctness- relevant passes (cross-heap accounting, pool reusability).

hetoku and others added 11 commits May 22, 2026 18:16

singaraiona closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(heap): amortize ray_heap_gc page-release sweep#214

perf(heap): amortize ray_heap_gc page-release sweep#214
singaraiona wants to merge 11 commits into
masterfrom
perf/heap-gc-pass5-throttle

singaraiona commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

singaraiona commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants