Vamana build optimization set and fp16 support by bkarsin · Pull Request #2264 · NVIDIA/cuvs

bkarsin · 2026-06-25T00:11:03Z

Series of GPU Vamana build performance optimizations that addresses #2178 and #1757. Initial estimates from #1757 were not accurate, so many other optimizations were tried (some abandoned, some successful). This PR includes:

GreedySearch optimizations:

Multi-warp blocks - increases occupancy of GreedySearch
Reduce shared memory used per warp
fp16 query approximation (along with fp16 support)

RobustPrune optimizations:

Multi-warp block support (block size depends on graph degree)
Caching accepted candidate vectors during occlusion loop
Remove redundant merge of candidate lists and avoid syncs

General optimizations:

Replace prefix_sums kernel with cub variant
Re-use distances from GreedySearch in RobustPrune kernel
fp16 support added

Together these optimizations give significant speedups across all configs with minimal recall variance compared to the current baseline. I benchmarked performance across a range of synthetic datasets and two real-world datasets. (NOTE: finishing benchmarks and will update tables below once they are all collected).

Synthetic dataset build tests (all 1M vector datasets)

		RTX pro 6000	RTX pro 6000	H100	H100	L4	L4
TYPE	Dim/Degree	BASELINE	OPT	BASELINE	OPT	BASELINE	OPT
fp32	64/32	1.11	1.07	1.35	1.19	9.83	3.57
fp32	768 / 32	5.98	4.03	6.01	4.92	36.6	25.8
fp32	960 / 32	7.82	5.61	8.23	6.52	50.1	38.7
fp32	64/64	4.4	4.05	6.14	5.4	16.4	14.8
fp32	768/64	28.8	18.2	33.7	25.3	194	125
fp32	960/64	39.4	24.9	50.7	36.5	251	185
fp16	64/32		1.05				3.8
fp16	768 / 32		2.5				11.1
fp16	960 / 32		3.13				14.1
fp16	64/64		4.01				13.6
fp16	768/64		10.4				53.8
fp16	960/64		14.2				65.6
int8	64/32	1.06	0.966	1.69	1.74	3.5	3.35
int8	768 / 32	2.98	2.38	4.82	3.69	11.2	9.21
int8	960 / 32	3.85	3	6.39	4.32	15.1	12
int8	64/64	4.15	3.61	6.36	6.05	14.4	13.3
int8	768/64	12.9	9.74	22.4	14.9	51.7	39.7
int8	960/64	17.1	12.5	28.8	19.9	72.1	52.1

Also tested real-world BIGANN 10M (uint8 128D) and GIST (fp32 960D) datasets:

		RTX pro 6000	RTX pro 6000	H100	H100	L4	L4
Dataset	deg / iters	BASELINE	OPT	BASELINE	OPT	BASELINE	OPT
BIGANN	32 / 1.0			13.1152	12.58	40.3484	39.887
BIGANN	32 / 2.0			32.0109	30.4	90.6138	89.994
BIGANN	64 / 1.0			48.1399	43.65	126.248	120.394
BIGANN	64 / 2.0			118.444	107.91	299.2	285.543
GIST (fp32)	32 / 1.0	6.39807	4.9804	6.773	5.51	41.48	32.3789
GIST (fp32)	32 / 2.0	16.2624	12.1026	19.0785	13.7675	103.197	76.84
GIST (fp32)	64 / 1.0	22.9393	16.74	27.2307	20.1767	150.2	109.98
GIST (fp32)	64 / 2.0	60.1051	41.452	71.1611	53.6482	380.185	262.1
GIST (fp16)	32 / 1.0		2.597		3.80264
GIST (fp16)	32 / 2.0		6.176		8.692
GIST (fp16)	64 / 1.0		9.965		13.793
GIST (fp16)	64 / 2.0		23.95		34.1922

(cherry picked from commit 14e36b3)

(cherry picked from commit d8b547b)

(cherry picked from commit 4d970e0)

…L2 comparators. (cherry picked from commit 0746d15)

…e vector in shared memory in the RobustPrune occlusion loop (cherry picked from commit f45fd1b49283434eb4a3017da069ead501e938c3)

…efix sum with cub scan and hoist per-batch reverse-edge allocations (cherry picked from commit 3b8650f5ebea52421f187332a2f6f3bdd599c42e)

…lusion across multiple warps per query (raise occupancy) (cherry picked from commit 2e02f938f97e65ca073daf07397d538432b52867)

… query->existing-edge distances in the RobustPrune merge (avoid recompute) (cherry picked from commit 041d355c585b7f98302d054b80ac287d637a07e4)

…oords in FP16 smem for dim>=512 to raise GreedySearch occupancy (salvage of N8)

…nce (one warp) instead of redundantly on all 128 threads, then broadcast

…ck (4 vs 8) to raise occupancy/MLP on the degree-64 occlusion sweep

copy-pr-bot · 2026-06-25T00:11:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bkarsin added 15 commits June 15, 2026 14:11

Initial multi-warp greedy search optimization

a4d82e1

(cherry picked from commit 14e36b3)

Add fp16 support for Vamana build/serialize

05cc8cc

(cherry picked from commit d8b547b)

Fix recall issue with multi-warp optimization

5bb5e6c

(cherry picked from commit 4d970e0)

Speed up fp16 vamana build with promoted query vectors and optimized …

3dca1b0

…L2 comparators. (cherry picked from commit 0746d15)

perf(vamana): P1_cache_candidate_vector — Cache the accepted candidat…

02b5638

…e vector in shared memory in the RobustPrune occlusion loop (cherry picked from commit f45fd1b49283434eb4a3017da069ead501e938c3)

perf(vamana): O1_parallel_scan_buffer_pool — Replace single-thread pr…

9ec9b99

…efix sum with cub scan and hoist per-batch reverse-edge allocations (cherry picked from commit 3b8650f5ebea52421f187332a2f6f3bdd599c42e)

perf(vamana): N3_multiwarp_robust_prune — Parallelize RobustPrune occ…

696d3d2

…lusion across multiple warps per query (raise occupancy) (cherry picked from commit 2e02f938f97e65ca073daf07397d538432b52867)

perf(vamana): N7_reuse_search_distances_in_prune — Reuse GreedySearch…

24f2ed7

… query->existing-edge distances in the RobustPrune merge (avoid recompute) (cherry picked from commit 041d355c585b7f98302d054b80ac287d637a07e4)

perf(vamana): M3_fp16_query_smem_occupancy — Store the cached query c…

bbe3d50

…oords in FP16 smem for dim>=512 to raise GreedySearch occupancy (salvage of N8)

perf(vamana): M5_prune_single_warp_merge — Do the RobustPrune merge o…

eeca286

…nce (one warp) instead of redundantly on all 128 threads, then broadcast

perf(vamana): M6_prune_warps_tuning — Sweep RobustPrune warps-per-blo…

4348e3e

…ck (4 vs 8) to raise occupancy/MLP on the degree-64 occlusion sweep

Clean up and remove dead code

4a0c179

pre-commit fixes

cb91c99

Fix bug with odd dimension data for some cases

2dc5fa9

Merge branch 'NVIDIA:main' into vamana-build-opt

a9b8a8d

github-project-automation Bot added this to Unstructured Data Processing Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vamana build optimization set and fp16 support#2264

Vamana build optimization set and fp16 support#2264
bkarsin wants to merge 15 commits into
NVIDIA:mainfrom
bkarsin:vamana-build-opt

bkarsin commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bkarsin commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant