perf: optimize vector search SQL to leverage HNSW/GIN index for 40-500x speedup#5285
Open
zhouliang5266 wants to merge 1 commit into
Open
perf: optimize vector search SQL to leverage HNSW/GIN index for 40-500x speedup#5285zhouliang5266 wants to merge 1 commit into
zhouliang5266 wants to merge 1 commit into
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
7c2110e to
68de0ef
Compare
…eedup)
Problem: SQL queries perform full table scans despite per-KB HNSW indexes
existing in common.py. Two root causes:
1. SQL patterns (ORDER BY distance without LIMIT, ts_rank_cd without @@)
don't trigger index usage
2. knowledge_id__in bypasses partial indexes (WHERE knowledge_id = '{k_id}')
Changes (4 files):
SQL optimization:
- embedding_search.sql: CTE + LIMIT to fetch top-K candidates via HNSW
- blend_search.sql: Two-phase - HNSW candidates first, then JOIN for text scores
- keywords_search.sql: Add @@ GIN pre-filter before ts_rank_cd
Query routing (pg_vector.py):
- Split hit_test()/query() to iterate per-KB when multiple knowledge bases
are queried, ensuring each query hits its partial HNSW index
- Update parameter arrays in 3 search classes for new SQL placeholder order
Benchmark (770K vectors, 3840 dims, 22GB, 5 KBs):
- blend: 16,358ms -> ~220ms (74x)
- embedding: 6,551ms -> ~160ms (41x)
- keywords: 10,662ms -> ~20ms (533x)
68de0ef to
c859456
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The current vector search SQL queries perform full table scans even when HNSW/GIN indexes exist, causing severe performance degradation on large datasets.
This PR rewrites the 3 search SQL files to use index-friendly query patterns and adds per-knowledge-base query routing to leverage partial HNSW indexes, achieving 40-500x speedup on large-scale deployments.
Problem
Two issues prevent HNSW/GIN indexes from being used:
1. SQL queries don't utilize indexes
Although
common.pyalready creates per-knowledge-base HNSW indexes (embedding_hnsw_idx_{k_id}), the SQL queries fail to utilize these indexes:ORDER BY distancescans the entire embedding table (no LIMIT in inner query), pgvector falls back to exact searchts_rank_cdfor every row in a single pass, full table scan with no early terminationts_rank_cd()computed on every row without@@pre-filter, no GIN index utilization2. Per-KB partial indexes bypassed by knowledge_id__in
hit_test()andquery()useknowledge_id__in=knowledge_id_listwhich producesWHERE knowledge_id IN (...). PostgreSQL cannot use partial indexes withWHERE knowledge_id = '{k_id}'in this case, falling back to full table scan across all knowledge bases.Solution
SQL Optimization (3 files)
embedding_search.sql — Use CTE (
WITH vector_top AS) to first fetch top-K candidates via HNSW index withLIMIT LEAST(top_number * 10, 500), then apply DISTINCT ON and threshold filtering on the small candidate set.blend_search.sql — Two-phase approach: CTE first gets vector candidates via HNSW (with LIMIT), then
JOIN embeddingto computets_rank_cdtext scores only on the candidate set. UsesCOALESCE(..., 0)for rows without text scores.keywords_search.sql — Add
AND search_vector @@ websearch_to_tsquery('simple', %s)pre-filter to leverage GIN index, avoiding full-tablets_rank_cdcomputation.Per-KB Query Routing (pg_vector.py)
Split
hit_test()andquery()to iterate per knowledge base when multiple KBs are queried. Each per-KB query usesknowledge_id=kid(exact match) which enables PostgreSQL to use the corresponding partial HNSW index. Results from all KBs are merged and sorted by similarity. Single KB case is optimized to avoid overhead.Parameter Update (pg_vector.py)
Update parameter arrays in all 3 search classes (
EmbeddingSearch,KeywordsSearch,BlendSearch) to match the new SQL placeholder order.Performance Results
Tested on production data: 770K vectors, 22GB embedding table, 5 knowledge bases, PostgreSQL 17 + pgvector
The test environment uses a 3840-dimension embedding model (beyond pgvector's default 2000-dim HNSW limit), with additional
halfvecconfiguration applied separately. The optimizations in this PR are dimension-independent and benefit all deployments.Additional Recommendation: Disable PostgreSQL JIT
PostgreSQL JIT compilation was designed for long-running analytical queries. For vector search queries that complete in <10ms with HNSW indexes, JIT compilation overhead (50-200ms) far exceeds the actual query execution time.
Recommended: Add
jit = offtopostgresql.conf. PostgreSQL 19 will disable JIT by default, aligning with this recommendation.Before disabling JIT (single embedding query):
After disabling JIT:
Prerequisites
common.pyper knowledge base)search_vectorcolumn (recommended for keywords_search optimization):Testing