Skip to content

Vamana build optimization set and fp16 support#2264

Open
bkarsin wants to merge 15 commits into
NVIDIA:mainfrom
bkarsin:vamana-build-opt
Open

Vamana build optimization set and fp16 support#2264
bkarsin wants to merge 15 commits into
NVIDIA:mainfrom
bkarsin:vamana-build-opt

Conversation

@bkarsin

@bkarsin bkarsin commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Series of GPU Vamana build performance optimizations that addresses #2178 and #1757. Initial estimates from #1757 were not accurate, so many other optimizations were tried (some abandoned, some successful). This PR includes:

GreedySearch optimizations:

  • Multi-warp blocks - increases occupancy of GreedySearch
  • Reduce shared memory used per warp
  • fp16 query approximation (along with fp16 support)

RobustPrune optimizations:

  • Multi-warp block support (block size depends on graph degree)
  • Caching accepted candidate vectors during occlusion loop
  • Remove redundant merge of candidate lists and avoid syncs

General optimizations:

  • Replace prefix_sums kernel with cub variant
  • Re-use distances from GreedySearch in RobustPrune kernel
  • fp16 support added

Together these optimizations give significant speedups across all configs with minimal recall variance compared to the current baseline. I benchmarked performance across a range of synthetic datasets and two real-world datasets. (NOTE: finishing benchmarks and will update tables below once they are all collected).

Synthetic dataset build tests (all 1M vector datasets)

    RTX pro 6000 RTX pro 6000 H100 H100 L4 L4
TYPE Dim/Degree BASELINE OPT BASELINE OPT BASELINE OPT
fp32 64/32 1.11 1.07 1.35 1.19 9.83 3.57
fp32 768 / 32 5.98 4.03 6.01 4.92 36.6 25.8
fp32 960 / 32 7.82 5.61 8.23 6.52 50.1 38.7
fp32 64/64 4.4 4.05 6.14 5.4 16.4 14.8
fp32 768/64 28.8 18.2 33.7 25.3 194 125
fp32 960/64 39.4 24.9 50.7 36.5 251 185
fp16 64/32   1.05       3.8
fp16 768 / 32   2.5       11.1
fp16 960 / 32   3.13       14.1
fp16 64/64   4.01       13.6
fp16 768/64   10.4       53.8
fp16 960/64   14.2       65.6
int8 64/32 1.06 0.966 1.69 1.74 3.5 3.35
int8 768 / 32 2.98 2.38 4.82 3.69 11.2 9.21
int8 960 / 32 3.85 3 6.39 4.32 15.1 12
int8 64/64 4.15 3.61 6.36 6.05 14.4 13.3
int8 768/64 12.9 9.74 22.4 14.9 51.7 39.7
int8 960/64 17.1 12.5 28.8 19.9 72.1 52.1

Also tested real-world BIGANN 10M (uint8 128D) and GIST (fp32 960D) datasets:

    RTX pro 6000 RTX pro 6000 H100 H100 L4 L4
Dataset deg / iters BASELINE OPT BASELINE OPT BASELINE OPT
BIGANN 32 / 1.0     13.1152 12.58 40.3484 39.887
BIGANN 32 / 2.0     32.0109 30.4 90.6138 89.994
BIGANN 64 / 1.0     48.1399 43.65 126.248 120.394
BIGANN 64 / 2.0     118.444 107.91 299.2 285.543
GIST (fp32) 32 / 1.0 6.39807 4.9804 6.773 5.51 41.48 32.3789
GIST (fp32) 32 / 2.0 16.2624 12.1026 19.0785 13.7675 103.197 76.84
GIST (fp32) 64 / 1.0 22.9393 16.74 27.2307 20.1767 150.2 109.98
GIST (fp32) 64 / 2.0 60.1051 41.452 71.1611 53.6482 380.185 262.1
GIST (fp16) 32 / 1.0   2.597   3.80264    
GIST (fp16) 32 / 2.0   6.176   8.692    
GIST (fp16) 64 / 1.0   9.965   13.793    
GIST (fp16) 64 / 2.0   23.95   34.1922    

bkarsin added 15 commits June 15, 2026 14:11
…e vector in shared memory in the RobustPrune occlusion loop

(cherry picked from commit f45fd1b49283434eb4a3017da069ead501e938c3)
…efix sum with cub scan and hoist per-batch reverse-edge allocations

(cherry picked from commit 3b8650f5ebea52421f187332a2f6f3bdd599c42e)
…lusion across multiple warps per query (raise occupancy)

(cherry picked from commit 2e02f938f97e65ca073daf07397d538432b52867)
… query->existing-edge distances in the RobustPrune merge (avoid recompute)

(cherry picked from commit 041d355c585b7f98302d054b80ac287d637a07e4)
…oords in FP16 smem for dim>=512 to raise GreedySearch occupancy (salvage of N8)
…nce (one warp) instead of redundantly on all 128 threads, then broadcast
…ck (4 vs 8) to raise occupancy/MLP on the degree-64 occlusion sweep
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant