Skip to content

[Bug]: nvidia-operator-validator fails on nodes with nvidia.com/gpu.deploy.device-plugin=false #2550

@claudiolor

Description

@claudiolor

Describe the bug

The GPU Operator supports disabling the device plugin on specific nodes via the nvidia.com/gpu.deploy.device-plugin=false label. This is useful for scenarios such as running the DRA driver on some nodes, where the operator manages the device plugin for all the other nodes.

While the device plugin is correctly not deployed on these labeled nodes, the gpu-operator-validator pod fails during the plugin-validation phase. The plugin validator expects GPU resources to be present and, failing to find them, enters a CrashLoopBackOff state.

kubectl get nodes -l nvidia.com/gpu.deploy.device-plugin=false
NAME          STATUS   ROLES    AGE   VERSION
collarquill   Ready    <none>   47m   v1.34.2+k0s
NAME                                                          READY   STATUS      RESTARTS        AGE   IP            NODE          NOMINATED NODE   READINESS GATES
nvidia-operator-validator-rv4qn                               0/1     Init:3/4    9 (6m52s ago)   45m   10.244.0.25   collarquill   <none>           <none>

Container logs:

$ kubectl logs -n gpu-operator nvidia-operator-validator-rv4qn -c plugin-validation
time="2026-06-15T12:43:28Z" level=info msg="version: 84601875-amd64, commit: 8460187"
time="2026-06-15T12:43:29Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2026-06-15T12:43:34Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2026-06-15T12:43:39Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2026-06-15T12:43:44Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2026-06-15T12:43:49Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"

Container status:

  plugin-validation:
    Container ID:  containerd://a0b48c5068bd7bf8e81c6405d2aa6e2f3ce9b66a2a19d7632f0f14b535e0ff40
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:634471cdfedcc3bd6b4412a905a9fbc9a9bf91df7f436aa00454b088d087c60a
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 15 Jun 2026 14:40:05 +0200
      Finished:     Mon, 15 Jun 2026 14:42:37 +0200

To Reproduce

  • Label a node with label nvidia.com/gpu.deploy.device-plugin=false
  • Install the Nvidia GPU operator
  • Observe the gpu-operator-validator pod failing on the affected node

Expected behavior

The plugin validator should check whether the nvidia.com/gpu.deploy.device-plugin is explicitly set to false and in that case skip the validation without writing the status file.

Proposed fix

  • In the validateGPUResource() function just after the Node retrieval, we can check whether the the node has the nvidia.com/gpu.deploy.device-plugin explicitly set to false:
    if node.Labels["nvidia.com/gpu.deploy.device-plugin"] == "false" {
    	return pluginDisabledError
    }
    
  • The validate() function of the Plugin struct can check whether a pluginDisabledError has been returned and in that case return nil, without writing the status file:
    err = p.validateGPUResource()
    if err != nil {
    	if errors.Is(err, pluginDisabledError) {
    		log.Info("Device plugin is disabled, skipping GPU resource validation")
    		return nil
    	}
    	return err
    }
    ...
    err = createStatusFile(outputDirFlag + "/" + pluginStatusFile)
    ... 
    

If this behavior is confirmed as unintended, I am happy to submit a PR to update the validator logic to respect this label. Please let me know if this approach aligns with the project's goals.

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions