Skip to content

Predictable raft::resources#3052

Open
achirkin wants to merge 18 commits into
NVIDIA:mainfrom
achirkin:enh-predictable-resources
Open

Predictable raft::resources#3052
achirkin wants to merge 18 commits into
NVIDIA:mainfrom
achirkin:enh-predictable-resources

Conversation

@achirkin

@achirkin achirkin commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Make the raft::resources resource initialization semantics more predictable.

  1. All resources still are still initialized lazily on-demand (no change), but behave as-if they were initialized during the raft::resources handle construction: all copies of the raft::resources handle point to the same resources (not a breaking change, fixes the re-initialization issue).
  2. set_resource changes to non-const semantics (breaking change).
    Before: all set_xxx resource-updating calls were operating on const handle
    Now: all set_xxx resources require a non-const handle.

Thread-safety

First and foremost, thread-safety of using a specific resource depends on that resource. Here we discuss the thread-safety of using raft::resources handle itself (get_resource and set_resource functions).

Accessing the same resource by const ref

Updates (set_resource) are not possible (a user should copy a handle and modify the new one). All concurrent get_resource calls are atomic and safe, even if they initialize the factories and resources under the hood. The worst can happen is the same resource being initialized concurrently in two threads but only one being stored in the handle (the other one is discarded).

Accessing the same resource by non-const ref

Using the same object by non-const ref from multiple threads is always unsafe.

Accessing copies of the same resource

The resources and factories are updated atomically. Modifying any resource doesn't propagate to the copies. Accessing a resource while another threads access or modifies the same resource via another handle is thus safe.

Implementation details

The PR adds one more layer of indirection (one extra shared_ptr) to each resource, which may cause an extra runtime overhead. This is unavoidable if we want to allow lazy-initialized resources back-propagate across handles.
On the other hand, the PR removes the handle mutex in favor of C++20 std::atomic<shared_ptr>, which may reduce the runtime overheads a little bit.

Breaking changes

  • All resource setters change the function signature - this is a big breaking change. However, all known use-sites already call the resources setters on non-const raft::resources handles.
  • std::atomic<shared_ptr> requires enforcing C++20 for all dependent libraries. An alternative would be to put a mutex per raft::resource_cell.

@achirkin achirkin self-assigned this Jun 9, 2026
@achirkin achirkin requested review from a team as code owners June 9, 2026 11:39
@achirkin achirkin added improvement Improvement / enhancement to an existing function breaking Breaking change labels Jun 9, 2026
@achirkin achirkin moved this to In Progress in Unstructured Data Processing Jun 9, 2026
@achirkin achirkin mentioned this pull request Jun 11, 2026
@achirkin

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Updated the project to use C++20 and CUDA C++20.
    • Resource settings now behave more consistently across copied or moved handles.
  • Bug Fixes

    • Improved reliability when configuring streams, memory resources, communicators, and GPU handles.
    • Fixed cases where resource updates were not applied as expected across related resource objects.
  • Tests

    • Added coverage for resource copying, moving, and lazy initialization behavior.

Walkthrough

The PR replaces the mutex-protected factories_/resources_ vectors in raft::resources with a per-slot atomic shared-cell model (resource_cell). A new ensure_default_factory method installs default factories via lock-free compare_exchange_strong. All resource getter helpers adopt this method; setter helpers lose their const qualifier. resources, device_resources, and handle_t gain defaulted copy/move semantics. Memory-tracking subclasses adapt to the new cells_ storage. C++ and CUDA standards are bumped to 20.

Changes

raft::resources atomic cell model refactor

Layer / File(s) Summary
resource_cell type and resources atomic implementation
cpp/include/raft/core/resource/resource_types.hpp, cpp/include/raft/core/resources.hpp
Replaces empty_resource/empty_resource_factory with resource_cell holding atomic shared_ptr<resource_factory> and shared_ptr<resource>. Rewrites resources constructor, has_resource_factory, add_resource_factory, and get_resource to use a vector of shared cells without a mutex. Adds ensure_default_factory using compare_exchange_strong. Enables defaulted copy/move on resources.
Defaulted copy/move on device_resources and handle_t
cpp/include/raft/core/device_resources.hpp, cpp/include/raft/core/handle.hpp
Removes custom copy constructors and deleted move operations on device_resources and handle_t, replacing them with explicitly defaulted copy/move constructors and assignment operators. Removes <mutex> and <unordered_map> from device_resources.hpp.
Resource getters → ensure_default_factory; setters → non-const resources&
cpp/include/raft/core/resource/cuda_stream.hpp, cpp/include/raft/core/resource/cuda_stream_pool.hpp, cpp/include/raft/core/resource/stream_view.hpp, cpp/include/raft/core/resource/cublas_handle.hpp, cpp/include/raft/core/resource/cublaslt_handle.hpp, cpp/include/raft/core/resource/cusolver_dn_handle.hpp, cpp/include/raft/core/resource/cusolver_sp_handle.hpp, cpp/include/raft/core/resource/cusparse_handle.hpp, cpp/include/raft/core/resource/device_memory_resource.hpp, cpp/include/raft/core/resource/device_id.hpp, cpp/include/raft/core/resource/device_properties.hpp, cpp/include/raft/core/resource/managed_memory_resource.hpp, cpp/include/raft/core/resource/pinned_memory_resource.hpp, cpp/include/raft/core/resource/comms.hpp, cpp/include/raft/core/resource/nccl_comm.hpp, cpp/include/raft/core/resource/sub_comms.hpp, cpp/include/raft/core/resource/thrust_policy.hpp, cpp/include/raft/core/resource/custom_resource.hpp, cpp/include/raft/core/resource/detail/stream_sync_event.hpp
All resource getter helpers replace the has_resource_factory/add_resource_factory conditional with a single ensure_default_factory call. All resource setter helpers change resources const& to resources&.
multi_gpu root_rank value semantics and factory parameterization
cpp/include/raft/core/resource/multi_gpu.hpp
root_rank_resource gains an explicit constructor taking an optional int. root_rank_resource_factory stores and propagates root_rank_ through make_resource(). get_root_rank returns int by value instead of int&. set_root_rank takes resources& and replaces the factory with a parameterized one.
memory_stats_resources and memory_tracking_resources cell adaptation
cpp/include/raft/core/memory_stats_resources.hpp, cpp/include/raft/core/memory_tracking_resources.hpp
snapshot_ member type changes from a pair_resource vector to std::vector<std::shared_ptr<resource::resource_cell>>. init() captures cells_ instead of resources_. THRUST_POLICY invalidation replaces the cell entry with a fresh resource_cell instead of overwriting factories_/resources_ with empty objects.
C++20 standard bump and new resources behavior tests
cpp/CMakeLists.txt, cpp/internal/CMakeLists.txt, cpp/tests/core/handle.cpp
CMakeLists for raft and raft_internal bump required C++/CUDA standards from 17 to 20. New gtests cover lazy-init propagation, non-const requirement for add_resource_factory, explicit-set isolation between copies, and move construction/assignment preserving resource factories.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and matches the main change: making raft::resources behavior more predictable.
Description check ✅ Passed The description clearly matches the changeset and explains the predictable resources semantics and C++20 impact.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/include/raft/core/resource/device_memory_resource.hpp (1)

255-323: 📐 Maintainability & Code Quality | 🟠 Major | 🏗️ Heavy lift

HIGH: Public workspace setter signatures now hard-break const-callers.

Changing set_workspace_resource, set_workspace_to_pool_resource, set_workspace_to_global_resource, and set_large_workspace_resource from resources const& to resources& is a source-breaking change in a public header, and I don't see deprecated bridge overloads or migration shims here. Downstream code that forwards a const handle into these helpers will stop compiling immediately.

As per path instructions, "public C++ APIs should be stateless and lightweight wrappers" and "breaking C++ API changes require deprecation over releases when possible".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/device_memory_resource.hpp` around lines 255 -
323, The public workspace setter APIs now require a mutable resources handle,
which breaks existing const-callers. Restore compatibility in the workspace
helper functions set_workspace_resource, set_workspace_to_pool_resource,
set_workspace_to_global_resource, and set_large_workspace_resource by adding
const-friendly overloads or a non-breaking bridge that forwards from resources
const& without changing existing call sites, while keeping the new behavior
internally where mutation is needed.

Source: Path instructions

cpp/include/raft/core/resource/sub_comms.hpp (1)

54-67: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

HIGH: set_subcomm mutates shared state through a const handle.

Line 59 keeps resources const&, but the function inserts into a shared unordered_map with no synchronization. Under the new shared-cell copy semantics that mutation propagates across copies and can race with get_subcomm or another set_subcomm, which is at odds with the documented "different instances of raft::resources are safe" expectation.

As per path instructions, "different instances of raft::resources are safe", and the PR objective says copies "can be accessed and modified independently without propagating changes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/sub_comms.hpp` around lines 54 - 67,
`set_subcomm` is mutating shared `resources` state through a const handle, which
can propagate across copies and race with `get_subcomm` or another
`set_subcomm`. Update the `set_subcomm` path in `sub_comms.hpp` so
sub-communicator updates are isolated per `resources` instance rather than
writing directly through shared state; use the existing
`resources`/`sub_comms_resource` APIs to make the mutation copy-local and add
synchronization or ownership separation as needed. Keep the fix centered around
`set_subcomm`, `get_subcomm`, and `sub_comms_resource_factory` so the “different
instances are safe” and independent-copy semantics are preserved.

Source: Path instructions

🧹 Nitpick comments (2)
cpp/tests/core/handle.cpp (1)

354-360: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Strengthen the propagation assertion (optional).

stream_a == stream_b is trivially satisfied because both resolve to the default cuda_stream_per_thread, so it doesn't actually prove the concrete resource is shared via the cell. Consider explicitly setting a non-default stream on a before the lazy-init path, or rely solely on has_resource_factory(b) as the propagation signal. Not a defect — the has_resource_factory check on Line 358 already validates the core behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/core/handle.cpp` around lines 354 - 360, The propagation test in
the resource lazy-init path is too weak because `resource::get_cuda_stream(a)`
and `resource::get_cuda_stream(b)` can both return the default
`cuda_stream_per_thread`, so the equality check in the `handle.cpp` test does
not prove shared state. Update the test around `resource::get_cuda_stream` and
`has_resource_factory` by either setting a non-default CUDA stream on `a` before
lazy initialization and then asserting `b` sees the same concrete stream, or
remove the trivial `ASSERT_EQ(stream_a, stream_b)` and rely on the
`has_resource_factory(resource::resource_type::CUDA_STREAM_VIEW)` assertion as
the propagation check.
cpp/include/raft/core/resource/cublaslt_handle.hpp (1)

49-50: 🚀 Performance & Scalability | 🔵 Trivial

Avoid allocating the default factory on every get_cublaslt_handle() call.
ensure_default_factory() still receives a freshly constructed cublaslt_resource_factory each time, so repeated lookups pay an unnecessary heap allocation before the CAS. Have you considered a lazy-builder overload or an early has_resource_factory() check?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/cublaslt_handle.hpp` around lines 49 - 50, The
get_cublaslt_handle() path is still creating a new cublaslt_resource_factory on
every call before ensuring the default factory. Update the cublaslt_handle.hpp
logic around get_cublaslt_handle() and ensure_default_factory() so the default
factory is only constructed lazily when no factory is already present, using an
early has_resource_factory() check or a lazy-builder overload. Keep the resource
lookup behavior the same, but avoid the repeated heap allocation before the CAS.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/include/raft/core/resource/comms.hpp`:
- Line 61: The public resource setter API was changed from const to non-const,
which is a breaking change for existing callers using const resources handles.
Update the relevant public-facing documentation and migration/upgrade notes to
explicitly call out the `set_comms` signature change in `resources` and the
similar setter changes in this PR, and include guidance for downstream users to
update their call sites accordingly. Use the `set_comms` function and any
related setters as the symbols to reference in the release notes so the API
change is easy to find.

In `@cpp/include/raft/core/resource/cublas_handle.hpp`:
- Around line 62-64: The cublas handle access path in cublas_handle.hpp is
mutating a shared CUBLAS_HANDLE cell by rebinding its stream on every call,
which breaks safety when copied resources diverge via CUDA_STREAM_VIEW. Update
the cublas handle setup/access logic around cublas_resource_factory and
get_cublas_handle so each raft::resources instance keeps stream ordering
isolated, rather than reusing and retargeting the same handle across copies.
Ensure the handle’s stream is bound per-resource or per-stream-view in a way
that does not affect sibling copies and preserves correct ordering with
get_cuda_stream().

In `@cpp/include/raft/core/resource/cusolver_dn_handle.hpp`:
- Around line 68-69: get_cusolver_dn_handle currently returns a cached
cusolverDnHandle_t without reattaching it to the current stream, so after
set_cuda_stream() the handle may still target an old stream. Update
get_cusolver_dn_handle to call cusolverDnSetStream(...) on the retrieved handle
using get_cuda_stream(res) before returning it, and apply the same rebinding
pattern to get_cusparse_handle and get_cusolver_sp_handle so all stream-bound
solver handles stay aligned with the active raft::resources stream.

In `@cpp/include/raft/core/resource/multi_gpu.hpp`:
- Around line 55-58: The public root-rank API change in get_root_rank and
set_root_rank needs a temporary deprecated compatibility bridge to avoid
breaking downstream callers. Add a deprecated shim in the multi_gpu resource API
that preserves the previous mutable-reference/caller behavior while forwarding
to the new implementation, and make sure set_root_rank still accepts existing
const resources usage through a migration wrapper. Include Doxygen on any new
public API in resources/multi_gpu.hpp and mark the old entry points deprecated
with a clear migration path before removing them.

In `@cpp/include/raft/core/resource/nccl_comm.hpp`:
- Around line 78-80: `get_resource` in `nccl_comm.hpp` is using
`has_resource_factory(NCCL_COMM)` as an initialization gate, which races under
the shared-cell model and can return a partially initialized communicator vector
or run `_init_nccl_comms(res)` twice. Make `get_resource` (and the NCCL_COMM
path in `ensure_default_factory`/`nccl_comm_resource_factory`) rely on a single
atomic initialization mechanism so that only one caller publishes the factory
and performs `_init_nccl_comms(res)`, while concurrent callers on the same const
handle safely observe the fully initialized state.

In `@cpp/include/raft/core/resources.hpp`:
- Line 81: The public resource update API in resources.hpp has a breaking
constness change, so add a compatibility/deprecation path instead of only
exposing the new mutable form. Keep add_resource_factory and the related set_*
wrappers locatable as the affected symbols, and provide transitional overloads
or forwarding methods that preserve the old public surface where possible while
steering callers to the new API. For the public header, add Doxygen for any new
or transitional APIs and mark the breaking replacements with deprecation
warnings so downstream users get a clear migration path.
- Around line 100-104: `ensure_default_factory` currently indexes `cells_` with
`rtype` without validating the resource type, so a factory returning `LAST_KEY`
can access past the vector. Update `ensure_default_factory` in `resources.hpp`
to mirror the `add_resource_factory` guard for
`resource::resource_type::LAST_KEY` before touching `cells_`, and only perform
the `cells_[rtype]->factory.compare_exchange_strong(...)` path for valid
resource types.

---

Outside diff comments:
In `@cpp/include/raft/core/resource/device_memory_resource.hpp`:
- Around line 255-323: The public workspace setter APIs now require a mutable
resources handle, which breaks existing const-callers. Restore compatibility in
the workspace helper functions set_workspace_resource,
set_workspace_to_pool_resource, set_workspace_to_global_resource, and
set_large_workspace_resource by adding const-friendly overloads or a
non-breaking bridge that forwards from resources const& without changing
existing call sites, while keeping the new behavior internally where mutation is
needed.

In `@cpp/include/raft/core/resource/sub_comms.hpp`:
- Around line 54-67: `set_subcomm` is mutating shared `resources` state through
a const handle, which can propagate across copies and race with `get_subcomm` or
another `set_subcomm`. Update the `set_subcomm` path in `sub_comms.hpp` so
sub-communicator updates are isolated per `resources` instance rather than
writing directly through shared state; use the existing
`resources`/`sub_comms_resource` APIs to make the mutation copy-local and add
synchronization or ownership separation as needed. Keep the fix centered around
`set_subcomm`, `get_subcomm`, and `sub_comms_resource_factory` so the “different
instances are safe” and independent-copy semantics are preserved.

---

Nitpick comments:
In `@cpp/include/raft/core/resource/cublaslt_handle.hpp`:
- Around line 49-50: The get_cublaslt_handle() path is still creating a new
cublaslt_resource_factory on every call before ensuring the default factory.
Update the cublaslt_handle.hpp logic around get_cublaslt_handle() and
ensure_default_factory() so the default factory is only constructed lazily when
no factory is already present, using an early has_resource_factory() check or a
lazy-builder overload. Keep the resource lookup behavior the same, but avoid the
repeated heap allocation before the CAS.

In `@cpp/tests/core/handle.cpp`:
- Around line 354-360: The propagation test in the resource lazy-init path is
too weak because `resource::get_cuda_stream(a)` and
`resource::get_cuda_stream(b)` can both return the default
`cuda_stream_per_thread`, so the equality check in the `handle.cpp` test does
not prove shared state. Update the test around `resource::get_cuda_stream` and
`has_resource_factory` by either setting a non-default CUDA stream on `a` before
lazy initialization and then asserting `b` sees the same concrete stream, or
remove the trivial `ASSERT_EQ(stream_a, stream_b)` and rely on the
`has_resource_factory(resource::resource_type::CUDA_STREAM_VIEW)` assertion as
the propagation check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38c5e204-a599-42f7-a81d-cdc0784e4e52

📥 Commits

Reviewing files that changed from the base of the PR and between 7c7f3ec and 473bdc7.

📒 Files selected for processing (29)
  • cpp/CMakeLists.txt
  • cpp/include/raft/core/device_resources.hpp
  • cpp/include/raft/core/handle.hpp
  • cpp/include/raft/core/memory_stats_resources.hpp
  • cpp/include/raft/core/memory_tracking_resources.hpp
  • cpp/include/raft/core/resource/comms.hpp
  • cpp/include/raft/core/resource/cublas_handle.hpp
  • cpp/include/raft/core/resource/cublaslt_handle.hpp
  • cpp/include/raft/core/resource/cuda_stream.hpp
  • cpp/include/raft/core/resource/cuda_stream_pool.hpp
  • cpp/include/raft/core/resource/cusolver_dn_handle.hpp
  • cpp/include/raft/core/resource/cusolver_sp_handle.hpp
  • cpp/include/raft/core/resource/cusparse_handle.hpp
  • cpp/include/raft/core/resource/custom_resource.hpp
  • cpp/include/raft/core/resource/detail/stream_sync_event.hpp
  • cpp/include/raft/core/resource/device_id.hpp
  • cpp/include/raft/core/resource/device_memory_resource.hpp
  • cpp/include/raft/core/resource/device_properties.hpp
  • cpp/include/raft/core/resource/managed_memory_resource.hpp
  • cpp/include/raft/core/resource/multi_gpu.hpp
  • cpp/include/raft/core/resource/nccl_comm.hpp
  • cpp/include/raft/core/resource/pinned_memory_resource.hpp
  • cpp/include/raft/core/resource/resource_types.hpp
  • cpp/include/raft/core/resource/stream_view.hpp
  • cpp/include/raft/core/resource/sub_comms.hpp
  • cpp/include/raft/core/resource/thrust_policy.hpp
  • cpp/include/raft/core/resources.hpp
  • cpp/internal/CMakeLists.txt
  • cpp/tests/core/handle.cpp

Comment thread cpp/include/raft/core/resource/comms.hpp
Comment on lines +62 to 64
res.ensure_default_factory(std::make_shared<cublas_resource_factory>(get_cuda_stream(res)));
auto ret = *res.get_resource<cublasHandle_t>(resource_type::CUBLAS_HANDLE);
RAFT_CUBLAS_TRY(cublasSetStream(ret, get_cuda_stream(res)));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

HIGH: Shared cuBLAS handle is rebound across divergent stream copies.

After a resources copy changes its CUDA_STREAM_VIEW, both copies can still share the same CUBLAS_HANDLE cell, and Line 64 mutates that shared handle’s stream on every access. Interleaved use of the two handles can therefore enqueue BLAS work on the wrong stream or introduce false ordering between otherwise independent copies.

As per path instructions, "different instances of raft::resources are safe" and "ensure work enqueued on raft::resources’ internal streams is correctly ordered w.r.t. get_cuda_stream()".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/cublas_handle.hpp` around lines 62 - 64, The
cublas handle access path in cublas_handle.hpp is mutating a shared
CUBLAS_HANDLE cell by rebinding its stream on every call, which breaks safety
when copied resources diverge via CUDA_STREAM_VIEW. Update the cublas handle
setup/access logic around cublas_resource_factory and get_cublas_handle so each
raft::resources instance keeps stream ordering isolated, rather than reusing and
retargeting the same handle across copies. Ensure the handle’s stream is bound
per-resource or per-stream-view in a way that does not affect sibling copies and
preserves correct ordering with get_cuda_stream().

Source: Path instructions

Comment on lines +68 to 69
res.ensure_default_factory(std::make_shared<cusolver_dn_resource_factory>(get_cuda_stream(res)));
return *res.get_resource<cusolverDnHandle_t>(resource_type::CUSOLVER_DN_HANDLE);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C3 'get_cusolver|get_cusparse|SetStream|get_cuda_stream\(res\)' cpp/include/raft/core/resource

Repository: rapidsai/raft

Length of output: 9241


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect resource ownership/copy semantics and the sibling handle getters.
ast-grep outline cpp/include/raft/core/resource/resource.hpp --view expanded || true
ast-grep outline cpp/include/raft/core/resource/cublas_handle.hpp --view expanded || true
ast-grep outline cpp/include/raft/core/resource/cusparse_handle.hpp --view expanded || true
ast-grep outline cpp/include/raft/core/resource/cusolver_sp_handle.hpp --view expanded || true
ast-grep outline cpp/include/raft/core/resource/cusolver_dn_handle.hpp --view expanded || true

echo '--- resource/copy semantics search ---'
rg -n -C3 'shared_ptr|copy|copy-construct|clone|resource_cell|ensure_default_factory|get_resource<|set_cuda_stream|CUDA_STREAM_VIEW' cpp/include/raft/core/resource cpp/include/raft/core | sed -n '1,220p'

Repository: rapidsai/raft

Length of output: 18111


🏁 Script executed:

#!/bin/bash
set -euo pipefail

ast-grep outline cpp/include/raft/core/resource/cuda_stream.hpp --view expanded || true
sed -n '1,220p' cpp/include/raft/core/resource/cuda_stream.hpp

Repository: rapidsai/raft

Length of output: 2874


HIGH: Rebind stream-bound solver handles on access

set_cuda_stream() can replace the stream resource after this handle has already been cached, so get_cusolver_dn_handle() may keep enqueuing work on the old stream. get_cusparse_handle() and get_cusolver_sp_handle() have the same pattern; cusolverDnSetStream(...)/cusparseSetStream(...)/cusolverSpSetStream(...) on each access would keep the cached handle aligned with the current raft::resources stream.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/cusolver_dn_handle.hpp` around lines 68 - 69,
get_cusolver_dn_handle currently returns a cached cusolverDnHandle_t without
reattaching it to the current stream, so after set_cuda_stream() the handle may
still target an old stream. Update get_cusolver_dn_handle to call
cusolverDnSetStream(...) on the retrieved handle using get_cuda_stream(res)
before returning it, and apply the same rebinding pattern to get_cusparse_handle
and get_cusolver_sp_handle so all stream-bound solver handles stay aligned with
the active raft::resources stream.

Source: Path instructions

Comment on lines +55 to 58
inline int get_root_rank(resources const& res)
{
if (!res.has_resource_factory(resource_type::ROOT_RANK)) {
res.add_resource_factory(std::make_shared<root_rank_resource_factory>());
}
res.ensure_default_factory(std::make_shared<root_rank_resource_factory>());
return *res.get_resource<int>(resource_type::ROOT_RANK);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

HIGH: Add a deprecation bridge for the public root-rank API change.

Issue: get_root_rank no longer exposes the previous mutable reference path, and set_root_rank now rejects const resources& callers.

Why: This public header change can break downstream code at compile time; have you considered a temporary deprecated compatibility shim or explicit migration bridge before removal? As per coding guidelines, “API changes require deprecation warnings.” As per path instructions, “For public headers under cpp/include/raft, also require Doxygen on new APIs and deprecation warnings on breaking changes.”

Also applies to: 123-125

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/multi_gpu.hpp` around lines 55 - 58, The
public root-rank API change in get_root_rank and set_root_rank needs a temporary
deprecated compatibility bridge to avoid breaking downstream callers. Add a
deprecated shim in the multi_gpu resource API that preserves the previous
mutable-reference/caller behavior while forwarding to the new implementation,
and make sure set_root_rank still accepts existing const resources usage through
a migration wrapper. Include Doxygen on any new public API in
resources/multi_gpu.hpp and mark the old entry points deprecated with a clear
migration path before removing them.

Sources: Coding guidelines, Path instructions

Comment on lines 78 to 80
if (!res.has_resource_factory(resource_type::NCCL_COMM)) {
res.add_resource_factory(std::make_shared<nccl_comm_resource_factory>());
res.ensure_default_factory(std::make_shared<nccl_comm_resource_factory>());
_init_nccl_comms(res);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | 🏗️ Heavy lift

CRITICAL: Factory publication races NCCL communicator initialization.

has_resource_factory(NCCL_COMM) is no longer a safe "initialized" check under the shared-cell model. Once one caller stores the factory, another caller can skip _init_nccl_comms(res) and return an empty/partially initialized vector, and two callers that both observe "no factory" can both run _init_nccl_comms against the same resource.

As per path instructions, "Concurrent get_resource calls on the same const handle are safe" and "comment if concurrent get/set semantics can lead to wrong initialization, UAF, or data races."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resource/nccl_comm.hpp` around lines 78 - 80,
`get_resource` in `nccl_comm.hpp` is using `has_resource_factory(NCCL_COMM)` as
an initialization gate, which races under the shared-cell model and can return a
partially initialized communicator vector or run `_init_nccl_comms(res)` twice.
Make `get_resource` (and the NCCL_COMM path in
`ensure_default_factory`/`nccl_comm_resource_factory`) rely on a single atomic
initialization mechanism so that only one caller publishes the factory and
performs `_init_nccl_comms(res)`, while concurrent callers on the same const
handle safely observe the fully initialized state.

Source: Path instructions

Comment thread cpp/include/raft/core/resources.hpp
Comment on lines +100 to +104
void ensure_default_factory(std::shared_ptr<resource::resource_factory> factory) const
{
resource::resource_type rtype = factory->get_resource_type();
std::shared_ptr<resource::resource_factory> expected{};
cells_[rtype]->factory.compare_exchange_strong(expected, std::move(factory));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

CRITICAL: Guard ensure_default_factory before indexing cells_.

add_resource_factory rejects LAST_KEY, but ensure_default_factory indexes cells_[rtype] without the same check. A factory returning LAST_KEY indexes one past the vector and can crash or corrupt memory.

Suggested fix
 void ensure_default_factory(std::shared_ptr<resource::resource_factory> factory) const
 {
+  RAFT_EXPECTS(factory != nullptr, "Resource factory must not be null.");
   resource::resource_type rtype = factory->get_resource_type();
+  RAFT_EXPECTS(rtype != resource::resource_type::LAST_KEY,
+               "LAST_KEY is a placeholder and not a valid resource factory type.");
   std::shared_ptr<resource::resource_factory> expected{};
   cells_[rtype]->factory.compare_exchange_strong(expected, std::move(factory));
 }

As per coding guidelines, invalid memory access is a CRITICAL issue.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void ensure_default_factory(std::shared_ptr<resource::resource_factory> factory) const
{
resource::resource_type rtype = factory->get_resource_type();
std::shared_ptr<resource::resource_factory> expected{};
cells_[rtype]->factory.compare_exchange_strong(expected, std::move(factory));
void ensure_default_factory(std::shared_ptr<resource::resource_factory> factory) const
{
RAFT_EXPECTS(factory != nullptr, "Resource factory must not be null.");
resource::resource_type rtype = factory->get_resource_type();
RAFT_EXPECTS(rtype != resource::resource_type::LAST_KEY,
"LAST_KEY is a placeholder and not a valid resource factory type.");
std::shared_ptr<resource::resource_factory> expected{};
cells_[rtype]->factory.compare_exchange_strong(expected, std::move(factory));
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/raft/core/resources.hpp` around lines 100 - 104,
`ensure_default_factory` currently indexes `cells_` with `rtype` without
validating the resource type, so a factory returning `LAST_KEY` can access past
the vector. Update `ensure_default_factory` in `resources.hpp` to mirror the
`add_resource_factory` guard for `resource::resource_type::LAST_KEY` before
touching `cells_`, and only perform the
`cells_[rtype]->factory.compare_exchange_strong(...)` path for valid resource
types.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change improvement Improvement / enhancement to an existing function

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant