Fixups for NVIDIA fabric support by artulab · Pull Request #534 · ROCm/iris

artulab · 2026-06-11T23:08:18Z

Motivation

Fix follow-up issues needed for NVIDIA fabric support, including runtime setup, topology handling, and benchmark/example compatibility.

Technical Details

This PR adds NVIDIA CUDA address-range/export hooks, improves VMM chunked external tensor import behavior, supports externally launched multinode torchrun in iris.bench, and updates all-gather GEMM benchmark/example paths for NVIDIA environments.

Test Plan

Tested the benchmark and example iris program on the clusters.

Test Result

PASS

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses follow-up fixes for NVIDIA fabric support across Iris topology discovery, memory import paths, and benchmark/example launch behavior.

Changes:

Add NVIDIA CUDA driver hooks for exporting arbitrary VMM-backed pointer handles and querying allocation address ranges.
Improve topology discovery on NVIDIA by probing fabric connectivity via CUDA fabric handle import/export when NVML fabric UUIDs are missing.
Update benchmarks/examples and AG+MM auto-config to work under externally launched multi-node torchrun and NVIDIA environments.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
iris/ops/all_gather_matmul_hbm_buffer.py	Makes a ROCm-specific launch kwarg conditional to avoid applying it on NVIDIA.
iris/host/memory/allocators/vmem_chunked_allocator.py	Adds fallback behavior for external tensor import on single-rank local CUDA jobs and improves DriverNotSupported handling.
iris/host/distributed/topology.py	Adds an NVIDIA fabric connectivity probe and uses it to synthesize fabric domains when NVML fabric info is missing.
iris/drivers/local/nvidia.py	Adds CUDA driver calls to export retained allocation handles and query pointer address ranges, plus context management.
iris/bench/_runner.py	Treats an already-launched `torchrun` job as the benchmark rank instead of spawning a nested elastic job.
examples/14_all_gather_gemm/example_run_pull.py	Supports `torchrun`-style environment launches, local-rank device selection, and optional topology printing.
benchmark/ops/all_gather_matmul/auto_config.py	Adds NVIDIA arch detection and heuristic config fallback for NVIDIA (and currently for any non-AMD arch).

+    if arch not in ("mi300x", "mi355x"):
+        heuristic_config, heuristic_hbm = _apply_heuristic(M, N, K, arch=arch)
+        return AutoConfigResult(
+            enabled=True,
+            config_params=heuristic_config,
+            hbm_buffer_params=heuristic_hbm,
+            source=f"Heuristic fallback for {arch} (no tuned configs available)",
+        )


+    local_row = [False] * world_size
+    for record in records:
+        if not record or not record.get("ok"):
+            continue
+        peer_rank = int(record["rank"])
+        if peer_rank == rank:
+            local_row[peer_rank] = True
+            continue
+        if driver is None:
+            continue
+        try:
+            mapping = driver.import_and_map(peer_rank, record["handle"], int(record["size"]))
+            imported_mappings.append(mapping)
+            local_row[peer_rank] = True
+        except Exception as exc:
+            logger.debug("[Rank %d] CUDA fabric handle import from rank %d failed: %s", rank, peer_rank, exc)


+    # If launched by torchrun/srun for a multi-node job, do not spawn another
+    # local elastic job.  The current process is already one benchmark rank.
+    if all(key in os.environ for key in ("RANK", "LOCAL_RANK", "WORLD_SIZE")):


+    def _can_return_external_tensor_alias(self) -> bool:
+        return self.num_ranks == 1 and self.driver.__class__.__name__ == "LocalCudaDriver"


+    def export_pointer_handle(self, ptr: int, size: int) -> bytes:
+        """Export the VMM allocation containing ptr as a 4-byte native-endian POSIX FD."""
+        self._check_initialized()


Fixups for NVIDIA fabric support

eab1440

artulab requested review from mawad-amd and neoblizz as code owners June 11, 2026 23:08

Copilot AI review requested due to automatic review settings June 11, 2026 23:08

artulab requested a review from BKP as a code owner June 11, 2026 23:08

github-actions Bot added in-progress We are working on it iris Iris project issue labels Jun 11, 2026

Copilot AI reviewed Jun 11, 2026

View reviewed changes

mawad-amd approved these changes Jun 11, 2026

View reviewed changes

Comment thread iris/bench/_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixups for NVIDIA fabric support#534

Fixups for NVIDIA fabric support#534
artulab wants to merge 1 commit into
mainfrom
artulab/fabric-fixups

artulab commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		def _can_return_external_tensor_alias(self) -> bool:
		return self.num_ranks == 1 and self.driver.__class__.__name__ == "LocalCudaDriver"

Conversation

artulab commented Jun 11, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants