Describe the bug
The GPU Operator supports disabling the device plugin on specific nodes via the nvidia.com/gpu.deploy.device-plugin=false label. This is useful for scenarios such as running the DRA driver on some nodes, where the operator manages the device plugin for all the other nodes.
While the device plugin is correctly not deployed on these labeled nodes, the gpu-operator-validator pod fails during the plugin-validation phase. The plugin validator expects GPU resources to be present and, failing to find them, enters a CrashLoopBackOff state.
kubectl get nodes -l nvidia.com/gpu.deploy.device-plugin=false
NAME STATUS ROLES AGE VERSION
collarquill Ready <none> 47m v1.34.2+k0s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-operator-validator-rv4qn 0/1 Init:3/4 9 (6m52s ago) 45m 10.244.0.25 collarquill <none> <none>
Container logs:
$ kubectl logs -n gpu-operator nvidia-operator-validator-rv4qn -c plugin-validation
time="2026-06-15T12:43:28Z" level=info msg="version: 84601875-amd64, commit: 8460187"
time="2026-06-15T12:43:29Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2026-06-15T12:43:34Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2026-06-15T12:43:39Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2026-06-15T12:43:44Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2026-06-15T12:43:49Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
Container status:
plugin-validation:
Container ID: containerd://a0b48c5068bd7bf8e81c6405d2aa6e2f3ce9b66a2a19d7632f0f14b535e0ff40
Image: nvcr.io/nvidia/gpu-operator:v25.10.1
Image ID: nvcr.io/nvidia/gpu-operator@sha256:634471cdfedcc3bd6b4412a905a9fbc9a9bf91df7f436aa00454b088d087c60a
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 15 Jun 2026 14:40:05 +0200
Finished: Mon, 15 Jun 2026 14:42:37 +0200
To Reproduce
- Label a node with label
nvidia.com/gpu.deploy.device-plugin=false
- Install the Nvidia GPU operator
- Observe the
gpu-operator-validator pod failing on the affected node
Expected behavior
The plugin validator should check whether the nvidia.com/gpu.deploy.device-plugin is explicitly set to false and in that case skip the validation without writing the status file.
Proposed fix
- In the
validateGPUResource() function just after the Node retrieval, we can check whether the the node has the nvidia.com/gpu.deploy.device-plugin explicitly set to false:
if node.Labels["nvidia.com/gpu.deploy.device-plugin"] == "false" {
return pluginDisabledError
}
- The
validate() function of the Plugin struct can check whether a pluginDisabledError has been returned and in that case return nil, without writing the status file:
err = p.validateGPUResource()
if err != nil {
if errors.Is(err, pluginDisabledError) {
log.Info("Device plugin is disabled, skipping GPU resource validation")
return nil
}
return err
}
...
err = createStatusFile(outputDirFlag + "/" + pluginStatusFile)
...
If this behavior is confirmed as unintended, I am happy to submit a PR to update the validator logic to respect this label. Please let me know if this approach aligns with the project's goals.
Describe the bug
The GPU Operator supports disabling the device plugin on specific nodes via the
nvidia.com/gpu.deploy.device-plugin=falselabel. This is useful for scenarios such as running the DRA driver on some nodes, where the operator manages the device plugin for all the other nodes.While the device plugin is correctly not deployed on these labeled nodes, the gpu-operator-validator pod fails during the plugin-validation phase. The plugin validator expects GPU resources to be present and, failing to find them, enters a CrashLoopBackOff state.
Container logs:
Container status:
To Reproduce
nvidia.com/gpu.deploy.device-plugin=falsegpu-operator-validatorpod failing on the affected nodeExpected behavior
The plugin validator should check whether the
nvidia.com/gpu.deploy.device-pluginis explicitly set tofalseand in that case skip the validation without writing the status file.Proposed fix
validateGPUResource()function just after the Node retrieval, we can check whether the the node has thenvidia.com/gpu.deploy.device-pluginexplicitly set tofalse:validate()function of thePluginstruct can check whether a pluginDisabledError has been returned and in that case return nil, without writing the status file:If this behavior is confirmed as unintended, I am happy to submit a PR to update the validator logic to respect this label. Please let me know if this approach aligns with the project's goals.