Summary
The NVIDIA DRA Driver for GPUs is adding support for allocating GPUs on NVSwitch-based HGX systems via Fabric Manager partitions kubernetes-sigs/dra-driver-nvidia-gpu#1190. This relies on nv-fabricmanager running on the host in Shared NVSwitch fabric mode (FABRIC_MODE=1), where partitions are queried and activated on demand through the FM SDK rather than activated automatically at boot.
Today, the GPU Operator's driver daemonset runs nv-fabricmanager in its default bare-metal / full-passthrough mode (FABRIC_MODE=0), which programs the whole fabric as a single domain and does not expose selectable partitions. There is currently no supported way to have the Operator-managed driver daemonset start nv-fabricmanager in the shared/partition mode that the DRA driver requires.
Summary
The NVIDIA DRA Driver for GPUs is adding support for allocating GPUs on NVSwitch-based HGX systems via Fabric Manager partitions kubernetes-sigs/dra-driver-nvidia-gpu#1190. This relies on nv-fabricmanager running on the host in Shared NVSwitch fabric mode (FABRIC_MODE=1), where partitions are queried and activated on demand through the FM SDK rather than activated automatically at boot.
Today, the GPU Operator's driver daemonset runs nv-fabricmanager in its default bare-metal / full-passthrough mode (FABRIC_MODE=0), which programs the whole fabric as a single domain and does not expose selectable partitions. There is currently no supported way to have the Operator-managed driver daemonset start nv-fabricmanager in the shared/partition mode that the DRA driver requires.