Skip to content

Fixups for NVIDIA fabric support#534

Open
artulab wants to merge 1 commit into
mainfrom
artulab/fabric-fixups
Open

Fixups for NVIDIA fabric support#534
artulab wants to merge 1 commit into
mainfrom
artulab/fabric-fixups

Conversation

@artulab

@artulab artulab commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Fix follow-up issues needed for NVIDIA fabric support, including runtime setup, topology handling, and benchmark/example compatibility.

Technical Details

This PR adds NVIDIA CUDA address-range/export hooks, improves VMM chunked external tensor import behavior, supports externally launched multinode torchrun in iris.bench, and updates all-gather GEMM benchmark/example paths for NVIDIA environments.

Test Plan

Tested the benchmark and example iris program on the clusters.

Test Result

PASS

Submission Checklist

Copilot AI review requested due to automatic review settings June 11, 2026 23:08
@artulab artulab requested a review from BKP as a code owner June 11, 2026 23:08
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR addresses follow-up fixes for NVIDIA fabric support across Iris topology discovery, memory import paths, and benchmark/example launch behavior.

Changes:

  • Add NVIDIA CUDA driver hooks for exporting arbitrary VMM-backed pointer handles and querying allocation address ranges.
  • Improve topology discovery on NVIDIA by probing fabric connectivity via CUDA fabric handle import/export when NVML fabric UUIDs are missing.
  • Update benchmarks/examples and AG+MM auto-config to work under externally launched multi-node torchrun and NVIDIA environments.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
iris/ops/all_gather_matmul_hbm_buffer.py Makes a ROCm-specific launch kwarg conditional to avoid applying it on NVIDIA.
iris/host/memory/allocators/vmem_chunked_allocator.py Adds fallback behavior for external tensor import on single-rank local CUDA jobs and improves DriverNotSupported handling.
iris/host/distributed/topology.py Adds an NVIDIA fabric connectivity probe and uses it to synthesize fabric domains when NVML fabric info is missing.
iris/drivers/local/nvidia.py Adds CUDA driver calls to export retained allocation handles and query pointer address ranges, plus context management.
iris/bench/_runner.py Treats an already-launched torchrun job as the benchmark rank instead of spawning a nested elastic job.
examples/14_all_gather_gemm/example_run_pull.py Supports torchrun-style environment launches, local-rank device selection, and optional topology printing.
benchmark/ops/all_gather_matmul/auto_config.py Adds NVIDIA arch detection and heuristic config fallback for NVIDIA (and currently for any non-AMD arch).

Comment on lines +547 to +554
if arch not in ("mi300x", "mi355x"):
heuristic_config, heuristic_hbm = _apply_heuristic(M, N, K, arch=arch)
return AutoConfigResult(
enabled=True,
config_params=heuristic_config,
hbm_buffer_params=heuristic_hbm,
source=f"Heuristic fallback for {arch} (no tuned configs available)",
)
Comment on lines +389 to +404
local_row = [False] * world_size
for record in records:
if not record or not record.get("ok"):
continue
peer_rank = int(record["rank"])
if peer_rank == rank:
local_row[peer_rank] = True
continue
if driver is None:
continue
try:
mapping = driver.import_and_map(peer_rank, record["handle"], int(record["size"]))
imported_mappings.append(mapping)
local_row[peer_rank] = True
except Exception as exc:
logger.debug("[Rank %d] CUDA fabric handle import from rank %d failed: %s", rank, peer_rank, exc)
Comment thread iris/bench/_runner.py
Comment on lines +558 to +560
# If launched by torchrun/srun for a multi-node job, do not spawn another
# local elastic job. The current process is already one benchmark rank.
if all(key in os.environ for key in ("RANK", "LOCAL_RANK", "WORLD_SIZE")):
Comment on lines +451 to +452
def _can_return_external_tensor_alias(self) -> bool:
return self.num_ranks == 1 and self.driver.__class__.__name__ == "LocalCudaDriver"
Comment on lines +472 to +474
def export_pointer_handle(self, ptr: int, size: int) -> bytes:
"""Export the VMM allocation containing ptr as a 4-byte native-endian POSIX FD."""
self._check_initialized()
Comment thread iris/bench/_runner.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants