BE-629: Add spherical k-means entity clustering endpoint via /entities/embeddings/clusters#8919
BE-629: Add spherical k-means entity clustering endpoint via /entities/embeddings/clusters#8919indietyp wants to merge 8 commits into
/entities/embeddings/clusters#8919Conversation
feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: embedding clustering feat: embedding clustering feat: embedding clustering feat: embedding clustering feat: checkpoint feat: checkpoint feat: checkpoint fix: merge feat: checkpoint feat: checkpoint feat: checkpoint fix: merge feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint feat: checkpoint] feat: checkpoint] feat: checkpoint] feat: checkpoint feat: checkpoint
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
PR SummaryMedium Risk Overview Implements spherical k-means in Routing nests Reviewed by Cursor Bugbot for commit 2953664. Bugbot is set up for automated code reviews on this repo. Configure here. |
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Pull request overview
Adds embedding-based spherical k-means clustering to the graph store and exposes it via a new REST endpoint (POST /entities/embeddings/clusters). The implementation introduces a SIMD-accelerated Rust clustering engine and wires it through the Postgres store, API routing/OpenAPI, and relevant store wrappers/shims.
Changes:
- Introduces a new
embeddingmodule inhash_graph_storewith aDimensioninvariant type, SIMD kernels, and a spherical k-means implementation. - Extends the
EntityStoreAPI withcluster_entitiesand implements it in the Postgres store, including permission filtering and embedding truncation viasubvector. - Adds the REST endpoint and forwards the new store method through type-fetcher and integration test shims.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/graph/integration/postgres/lib.rs | Forwards cluster_entities through the DatabaseApi integration shim. |
| libs/@local/graph/type-fetcher/src/store.rs | Forwards cluster_entities through the type-fetcher store wrapper. |
| libs/@local/graph/store/src/lib.rs | Enables required nightly features and registers the new embedding module. |
| libs/@local/graph/store/src/error.rs | Adds ClusterError for clustering-related failures. |
| libs/@local/graph/store/src/entity/store.rs | Adds request/response types and the EntityStore::cluster_entities trait method. |
| libs/@local/graph/store/src/entity/mod.rs | Re-exports the new clustering API types. |
| libs/@local/graph/store/src/embedding/mod.rs | Declares the new embedding submodules and lint expectations. |
| libs/@local/graph/store/src/embedding/kernel.rs | Implements SIMD-accelerated vector primitives and tests. |
| libs/@local/graph/store/src/embedding/dimension.rs | Adds Dimension newtype enforcing “positive multiple of 8”. |
| libs/@local/graph/store/src/embedding/clustering.rs | Implements spherical k-means (+ seeding/restarts/parallel assignment) and tests. |
| libs/@local/graph/postgres-store/src/store/postgres/knowledge/entity/mod.rs | Implements cluster_entities query + permission filtering + clustering execution. |
| libs/@local/graph/api/src/rest/entity.rs | Registers the new REST endpoint and nests existing embeddings routing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bf17f16 to
a9bcac4
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8919 +/- ##
==========================================
+ Coverage 59.57% 59.89% +0.31%
==========================================
Files 1366 1369 +3
Lines 132760 134183 +1423
Branches 6045 6095 +50
==========================================
+ Hits 79094 80365 +1271
- Misses 52732 52870 +138
- Partials 934 948 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Merging this PR will degrade performance by 15.38%
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5db85e2. Configure here.
Benchmark results
|
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2002 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1002 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 3314 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 1527 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 2078 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 1033 | Flame Graph |
policy_resolution_medium
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 102 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 52 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 269 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 108 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 133 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 63 | Flame Graph |
policy_resolution_none
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 2 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 8 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 3 | Flame Graph |
policy_resolution_small
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 52 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 26 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 94 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 27 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 66 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 29 | Flame Graph |
read_scaling_complete
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id;one_depth | 1 entities | Flame Graph | |
| entity_by_id;one_depth | 10 entities | Flame Graph | |
| entity_by_id;one_depth | 25 entities | Flame Graph | |
| entity_by_id;one_depth | 5 entities | Flame Graph | |
| entity_by_id;one_depth | 50 entities | Flame Graph | |
| entity_by_id;two_depth | 1 entities | Flame Graph | |
| entity_by_id;two_depth | 10 entities | Flame Graph | |
| entity_by_id;two_depth | 25 entities | Flame Graph | |
| entity_by_id;two_depth | 5 entities | Flame Graph | |
| entity_by_id;two_depth | 50 entities | Flame Graph | |
| entity_by_id;zero_depth | 1 entities | Flame Graph | |
| entity_by_id;zero_depth | 10 entities | Flame Graph | |
| entity_by_id;zero_depth | 25 entities | Flame Graph | |
| entity_by_id;zero_depth | 5 entities | Flame Graph | |
| entity_by_id;zero_depth | 50 entities | Flame Graph |
read_scaling_linkless
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | 1 entities | Flame Graph | |
| entity_by_id | 10 entities | Flame Graph | |
| entity_by_id | 100 entities | Flame Graph | |
| entity_by_id | 1000 entities | Flame Graph | |
| entity_by_id | 10000 entities | Flame Graph |
representative_read_entity
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph |
representative_read_entity_type
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| get_entity_type_by_id | Account ID: bf5a9ef5-dc3b-43cf-a291-6210c0321eba
|
Flame Graph |
representative_read_multiple_entities
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_property | traversal_paths=0 | 0 | |
| entity_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=0 | 0 | |
| link_by_source_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true |
scenarios
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| full_test | query-limited | Flame Graph | |
| full_test | query-unlimited | Flame Graph | |
| linked_queries | query-limited | Flame Graph | |
| linked_queries | query-unlimited | Flame Graph |


🌟 What is the purpose of this PR?
This PR adds a
POST /entities/embeddings/clustersendpoint that groups a set of entities by embedding similarity using spherical k-means clustering. Callers supply a list of entity IDs, a desired cluster count, and an optional embedding dimension (matryoshka truncation). The response contains the cluster assignments with unit-normalized centroids, plus a list of entities that had no stored embedding.The clustering algorithm is implemented from scratch in Rust using SIMD-accelerated kernels (
f32x8), k-means++ seeding, multiple restarts, and parallel assignment via Rayon. Embeddings are truncated server-side in Postgres usingsubvectorbefore being sent over the wire, keeping network cost proportional to the requested dimension rather than the full stored width.The implementation is up to 24x faster than existing crates that operate on CPUs.
🔍 What does this change?
Dimensionnewtype that enforces the positive-multiple-of-8 invariant required by the SIMD kernels.kernelmodule with SIMD-accelerated primitives:dot,add_into,scale_into,scale,add_scaled_into,normalize,micro_4x2(4-point × 2-centroid tiled dot product), andnearest4(nearest-centroid search for 4 points simultaneously).clusteringmodule implementing spherical k-means with k-means++ D² seeding, Lloyd iterations, empty-cluster reseeding, convergence tolerance, and configurable restarts via aConfigstruct.ClusterEntitiesParams,EntityCluster, andClusterEntitiesResponsetypes to the entity store API.ClusterErrorerror type covering invalid dimension, dimension-too-large, and store failure cases.cluster_entitiesto theEntityStoretrait and implements it in the Postgres store, including permission filtering that avoids leaking which entity IDs were denied versus missing embeddings.POST /entities/embeddings/clustersand nests the existingPOST /entities/embeddingshandler under/entities/embeddings/to keep the routing consistent.cluster_entitiesthrough the type-fetcher store wrapper and the integration testDatabaseApishim.Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
🛡 What tests cover this?
squared_chord_distancecovering identical, orthogonal, opposite, and zero-norm cases.dot,add_into,scale_into,scale,add_scaled_into,normalize,micro_4x2,nearest4) verified against scalar reference implementations.Dimensionnewtype covering valid multiples of 8, zero rejection, and non-multiples rejection.❓ How to test this?
POST /entities/embeddings/clustersrequest with a JSON body containingentityIds,clusterCount, and optionallydimensionandseed.clusters(each withclusterId,entityIds, andcentroid) andmissingEmbeddingsfor any entities without stored embeddings.