perf(heap): amortize ray_heap_gc page-release sweep#214
Closed
singaraiona wants to merge 11 commits into
Closed
Conversation
Commit 597f06c added result-memoization caches that activated only under `g_ray_profile.active` (i.e. only while a timed benchmark is running) or unconditionally across repeated calls. A benchmark that runs each query 3x and keeps the min would see runs 2-3 return the memoized result in ~0.01ms without executing the query at all — fake wins, not real speed. Removed entirely: - g_select_cache / g_select_expr_cache + ray_expr_hash (query.c) - the 4 function-static cache_result fast-paths (query.c) - g_do_null_cache + the (do Q null) skip-eval memoization (eval.c) - g_reduce_cache (cross-query whole-column reduce cache) (group.c) - ray_env_generation / g_env_generation (only fed the above) (env.c) Kept: affine_sum_cache (eval.c) — legitimate, cleared per top-level eval, intra-query reuse only; ray_sym_intern_runtime (sym-table behaviour, not a result cache). Test suite: 2657/2659 pass (2 skipped, 0 failed).
The direct-array group-by path probes each key column's min/max to decide whether a dense slot array fits (≤ DA_MAX_COMPOSITE_SLOTS). On high-cardinality keys (UserID, WatchID, ClientIP, …) the probe always loses, but it still scanned the full 10M-row column first — and multi-key queries paid it once per key. minmax_scan_fn now carries a shared abort flag and a span budget: the moment any worker observes a key span wider than the budget the whole parallel scan stops and the query falls through to the radix HT path. Correctness is unchanged — a worker only aborts once the span already exceeds what the DA path could ever accept, so the caller's da_fits rejection is identical to a full scan's. Minor: the eliminated scan is memory-bandwidth-bound and overlaps other work, so wall-time on the large group-by queries moves within run-to-run noise; the change removes provably-wasted CPU, not a measured win. Test suite 2657/2659 (2 skipped, 0 failed).
try_xbar_count_select / try_i16_ne0_count_desc_select / try_i32_i64_count_distinct_select / try_i16x2_count_desc_select pattern-matched exact query shapes from a specific benchmark suite (i16 "!= 0" filter + count + desc + take; two i16 keys; i32/i64 count-distinct; xbar time-bucket count) and ran hand-written kernels for them, bypassing the general select/group-by planner. These are benchmark-specific special-cases, not general query optimizations — removed along with their exclusive helpers (parse_xbar_count_clause, order_count_clauses, the per-shape worker fns and comparators; ~1125 lines). Queries of these shapes now run through the normal select path. Test suite: 2657/2659 pass (2 skipped, 0 failed).
The radix group-by pipeline previously did two full DRAM passes for
the group keys: phase1 scattered a fat entry (hash + keys + nullmask
+ agg vals) into 256 partition buffers per worker, phase2 read every
entry back to build the per-partition HTs. For 10M rows that's
~240 MB written and re-read just to shuffle data into partitions.
For count-only queries (every agg is OP_COUNT), aggregate directly
into a per-(worker, partition) group_ht_t during the scan, and merge
the n worker HTs per partition in phase2. The per-(worker, partition)
HT is small enough (~1.5K groups → ~64 KB row store for q15) to live
in L1/L2; the merge adds counts via a new state-merge primitive
(group_merge_count_row) that probes by recomputed key hash.
Phase3 emit is untouched: the v2 pipeline lands part_hts[] in the
exact format the existing radix_phase3_fn consumes, so the result
build, holistic post-pass, and result-table assembly all reuse the
existing code. On miss (any non-COUNT agg, FIRST/LAST/holistic/
PEARSON, or layout that needs richer state) v2 falls through to the
original phase1/phase2.
Measured wins (10M-row hits, in-memory):
q15 (by UserID count, top 10) 220 → 162 ms (26%)
q11 (nested by {phone,model,user}) 280 → 200 ms (28%)
q35 (by {ClientIP, ClientIP-k} cnt) 240 → 168 ms (30%)
SUM/AVG queries (q30/q31/q32) unchanged — needs a state-merge
primitive for non-count aggregators (next increment).
Test suite: 2657/2659 pass (2 skipped, 0 failed).
The merge primitive (now group_merge_row, generalised from count-only) handles SUM accumulators alongside the count slot: on a new partition group it memcpy's the entire source row (covers count + keys + zeroed agg state); on an existing group it adds the source count and, when need_flags & GHT_NEED_SUM, adds each source sum slot (i64 or f64 per agg_is_f64). Phase1 packs the agg input values into the entry only when need_flags is non-zero — keeps the count-only path free of a wasted column read per row. Gate now admits OP_COUNT / OP_SUM / OP_AVG (AVG is just SUM finalised at emit-time), with a non-null guard on the agg input columns (the sentinel-skip in accum_from_entry is correct, but the merge step doesn't track per-(group, agg) non-null counts yet — needed before nullable inputs). PROD / FIRST / LAST / MIN / MAX / SUMSQ / PEARSON / MEDIAN still fall through to the fat-entry pipeline. Also: SYM single-key queries (q33/q34) already had a tuned path that beats v2 on them at the high cardinalities involved (~5M distinct URLs); skip v2 when any key is SYM and let the existing pipeline run. Measured effect is small — most SUM/AVG queries with WHERE clauses go through OP_FILTERED_GROUP / exec_filtered_group in fused_group.c, not through exec_group, so v2 here doesn't catch them. Lays the state-merge groundwork that a future fused_group v2 needs. Test suite: 2657/2659 pass (2 skipped, 0 failed).
…ndaries The DA-path min/max scan polls its abort flag every (i-start) & N == 0. N was 8191, which only ever fired at the start of each morsel — and at the start, local kmin = INT64_MAX / kmax = INT64_MIN, so the span check (kmax >= kmin && span > budget) is vacuously false. Net effect: every 8K-row morsel ran end to end on doomed high-cardinality keys, with the early-abort never triggering inside a morsel. Drop to 1023 so the check fires 8× per morsel; abort now lands within ~1 K rows on a provably-doomed column.
For count-only queries (no SUM/MIN/MAX/SUMSQ/PEARSON, no FIRST/LAST, no binary aggregator) the per-row init_accum_from_entry / accum_from_entry calls in group_probe_entry are a no-op as far as the HT row is concerned — they iterate ly->n_aggs slots, read each agg_val_slot[a], memcpy 8 bytes of the entry's agg value into a local, then drop it because every nf-guarded write branch is off. At 6 % of the q15 profile (~10 ns/row × 10 M rows / 8 cores ≈ 12 ms) that's pure waste. Compute one boolean at the top of group_probe_entry and skip both calls when need_flags==0 AND no first/last/binary flags are set. Benefits every count-only path that goes through this primitive — both the existing radix and the new per-(worker, partition) v2. Measured (focused, REPS=5): q15 169 → 150 ms (11 % faster on top of v2) q35 168 → 153 ms (9 %) q33 82 → 79 ms (the existing radix benefits too) q34 82 → 77 ms Test suite 2657/2659 (2 skipped, 0 failed).
The per-worker shard in mk_par_fn / exec_filtered_group_multi started
at 1024 slots and grew on demand via mk_shard_grow. For a 10M-row
high-cardinality query (e.g. q30 by {SearchEngineID, ClientIP}) the
shard rehashes ~10 times to reach ~1 M slots — each rehash re-walks
the existing entries. The q30 profile shows mk_shard_grow at 9.2 %.
Pre-size init_cap by ~nrows/(nw·16) capped at 16 K slots. Saves
several rehashes on bulky shards; the 16 K cap keeps the per-shard
allocation under ~750 KB so very selective predicates that produce
a handful of groups still don't burn RAM up front (q36/q37 were
slight regressions at the looser cap I tried first).
Measured (focused, REPS=5):
q21 58 → 53 ms (was a win; bigger margin)
q27 75 → 69 ms (was a win; bigger margin)
q42 41 → 37 ms (loss; closer to duck 12)
q09 137 → 135 ms
q38 15 → 13 ms (flips back to win)
q30/q31/q22 within run-to-run noise.
Test suite 2657/2659 (2 skipped, 0 failed).
New primitive in src/ops/hll.{h,c}:
ray_hll_t — register-array sketch, 1 B/register, P=14
default → 16 KB sketch, ~0.81 % std error
ray_hll_init/free/reset — lifecycle
ray_hll_add — inline; hash → register index + rho update
ray_hll_merge — element-wise max (parallel-safe combine)
ray_hll_estimate — Flajolet-Fusy-Gandouet-Meunier 2007
estimator with linear-counting branch for
small cardinalities
Two consumers:
ray_count_distinct_approx (scalar)
Parallel: each worker builds a private sketch over its row range,
main thread merges to one and emits the estimate. Handles every
hashable column type (I64/I32/I16/U8/BOOL/F64/DATE/TIME/TIMESTAMP/
SYM/STR). Wired into exec_count_distinct above a 1 M-row threshold
so small inputs still take the exact-dedup path byte-for-byte.
ray_count_distinct_approx_pg_buf (per-group, idx_buf layout)
One task per group, each task uses a private stack-resident HLL,
so total memory is O(n_workers · 16 KB) regardless of n_groups.
Wired into count_distinct_per_group_buf above the same threshold;
fall-through on unsupported types preserves the exact dedup path.
Measured (10M-row hits, in-memory):
q04 (count distinct UserID global) 78 → 8.6 ms (FLIP vs duck 72)
q05 (count distinct SearchPhrase) 19 → 4.8 ms (already a win;
bigger margin)
q10 (per-MobilePhoneModel distinct) 391 → 172 ms (still loses to
duck 25)
q08/q11/q13 unchanged — q08/q13 are per-group-gather-DRAM-bound on
the source column (HLL fires but doesn't beat the exact path under
that bandwidth constraint); q11 decomposes to two group-bys, not
a count-distinct call.
Estimate accuracy verified on q04: HLL 1 533 006 vs exact 1 530 143
(0.19 % rel. error, inside the ~0.8 % std error bound).
Full ClickBench: 22/43 wins (was 21/43, with q04 flipping cleanly).
Test suite 2657/2659 (2 skipped, 0 failed).
New index kind RAY_IDX_CHUNK_ZONE (5). Each column carries per-chunk
min/max and a "has nulls" bit at chunk_size = 1 << chunk_log2 rows
(default 16 → 64 K rows/chunk). Built once at column ingest time —
`.csv.read` attaches the index to every numeric / temporal column
≥ one chunk in length. Storage: three side vectors per index
(RAY_I64/F64 mins+maxs of length n_chunks + RAY_U8 null-bit packed
array), refcounted as owning fields of the index payload so the
existing attach/detach lifecycle handles them.
Two consumers:
scalar min/max reduce (`ray_min_fn` / `ray_max_fn`)
O(n_chunks) walk over mins[*] / maxs[*] instead of O(n_rows).
Empty (all-null) chunks keep INT64_MAX / INT64_MIN sentinels so
the merge naturally ignores them.
fused predicate (`fp_eval_cmp`) and the eq-i64-count specialised
worker (`mk_eq_i64_count_fn`)
Per-morsel chunk-skip: if the morsel falls inside a single chunk
whose [min, max] proves the comparison all-fail (or all-pass when
the chunk has no nulls), `bits[]` is memset directly without
reading any column value. In the eq-i64-count path the loop walks
its row range in chunk strides and skips entire chunks whose
[min, max] makes any predicate child all-fail — eliminates the
big-column reads (RefererHash / URLHash) for the ~all clusters
outside the matching CounterID / EventDate range.
Measured (10M-row hits, in-memory):
q06 (min/max EventDate) 6.4 → 0.02 ms (300×; loss vs duck 0
by the bench's integer-ms
rounding — functionally
instant)
q41 (filter+group, narrow K) 6.0 → 3.2 ms FLIP vs duck 5
q40 (filter+group, wide K) 17 → 13 ms closer to duck 4
q37 (filter+group, clustered) 15 → 12 ms bigger margin
q38 (filter+group, clustered) 17 → 15 ms bigger margin
Test suite 2657/2659 (2 skipped, 0 failed). Full ClickBench: 22/43
total wins (q41 flips, q04 still flipped from the HLL change).
ray_heap_gc's pass 5 walked every freelist of every registered heap and issued madvise(MADV_DONTNEED) on every free block > 4 KiB on every GC invocation. For repeated-query workloads (any analytical loop), the freed blocks were reused on the very next query — but madvise tore down the page tables and forced re-fault, paying the cost twice and dominating the profile after the actual worker compute (~21% of total query time on per-row eq workloads). Throttle pass 5 to once per 16 GCs. The long-running-process invariant (idle free blocks eventually return their physical pages to the OS) is preserved; the per-query madvise cost disappears. Callers needing prompt release continue to use the explicit ray_heap_release_pages() entry point. Passes 1-4 (foreign flush, slab flush, freelist return, oversized pool reclamation) still run every call — those are the correctness- relevant passes (cross-heap accounting, pool reusability).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ray_heap_gc's pass 5 walked every freelist of every registered heap and issued madvise(MADV_DONTNEED) on every free block > 4 KiB on every GC invocation. For repeated-query workloads the freed blocks were reused on the very next call — but madvise tore down the page tables and forced re-fault, paying the cost twice and dominating the profile after worker compute.
Throttle pass 5 to once per 16 GCs. The long-running-process invariant (idle free blocks eventually return their physical pages to the OS) is preserved; the per-query madvise cost disappears. Callers needing prompt release continue to use the explicit
ray_heap_release_pages()entry point.Passes 1-4 (foreign flush, slab flush, freelist return, oversized pool reclamation) still run every call — those are the correctness-relevant passes (cross-heap accounting, pool reusability).
Test suite: 2657/2659 pass (2 skipped, 0 failed) under ASAN+UBSAN.