[Bug]: nvidia-operator-validator fails on nodes with nvidia.com/gpu.deploy.device-plugin=false

**Describe the bug**

The GPU Operator supports disabling the device plugin on specific nodes via the `nvidia.com/gpu.deploy.device-plugin=false` label. This is useful for scenarios such as running the DRA driver on some nodes, where the operator manages the device plugin for all the other nodes.

While the device plugin is correctly **not** deployed on these labeled nodes, the gpu-operator-validator pod fails during the plugin-validation phase. The  plugin validator expects GPU resources to be present and, failing to find them, enters a CrashLoopBackOff state.

```
kubectl get nodes -l nvidia.com/gpu.deploy.device-plugin=false
NAME          STATUS   ROLES    AGE   VERSION
collarquill   Ready    <none>   47m   v1.34.2+k0s
```

```
NAME                                                          READY   STATUS      RESTARTS        AGE   IP            NODE          NOMINATED NODE   READINESS GATES
nvidia-operator-validator-rv4qn                               0/1     Init:3/4    9 (6m52s ago)   45m   10.244.0.25   collarquill   <none>           <none>
```

**Container logs**:
```
$ kubectl logs -n gpu-operator nvidia-operator-validator-rv4qn -c plugin-validation
time="2026-06-15T12:43:28Z" level=info msg="version: 84601875-amd64, commit: 8460187"
time="2026-06-15T12:43:29Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2026-06-15T12:43:34Z" level=info msg="GPU resources are not yet discovered by the node, retry: 2"
time="2026-06-15T12:43:39Z" level=info msg="GPU resources are not yet discovered by the node, retry: 3"
time="2026-06-15T12:43:44Z" level=info msg="GPU resources are not yet discovered by the node, retry: 4"
time="2026-06-15T12:43:49Z" level=info msg="GPU resources are not yet discovered by the node, retry: 5"
```

**Container status**:

```
  plugin-validation:
    Container ID:  containerd://a0b48c5068bd7bf8e81c6405d2aa6e2f3ce9b66a2a19d7632f0f14b535e0ff40
    Image:         nvcr.io/nvidia/gpu-operator:v25.10.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:634471cdfedcc3bd6b4412a905a9fbc9a9bf91df7f436aa00454b088d087c60a
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 15 Jun 2026 14:40:05 +0200
      Finished:     Mon, 15 Jun 2026 14:42:37 +0200
```

**To Reproduce**
- Label a node with label `nvidia.com/gpu.deploy.device-plugin=false`
- Install the Nvidia GPU operator
- Observe the `gpu-operator-validator` pod failing on the affected node

**Expected behavior**

The plugin validator should check whether the `nvidia.com/gpu.deploy.device-plugin` is explicitly set to `false` and in that case skip the validation without writing the status file.

**Proposed fix**

- In the [`validateGPUResource()` function](https://github.com/NVIDIA/gpu-operator/blob/main/cmd/nvidia-validator/main.go#L1456) just after the Node retrieval, we can check whether the the node has the `nvidia.com/gpu.deploy.device-plugin` explicitly set to `false`:
  ```
  if node.Labels["nvidia.com/gpu.deploy.device-plugin"] == "false" {
  	return pluginDisabledError
  }
  ```
- The [`validate()` function of the `Plugin` struct](https://github.com/NVIDIA/gpu-operator/blob/main/cmd/nvidia-validator/main.go#L1202) can check whether a pluginDisabledError has been returned and in that case return nil, without writing the status file:
  ```
  err = p.validateGPUResource()
  if err != nil {
  	if errors.Is(err, pluginDisabledError) {
  		log.Info("Device plugin is disabled, skipping GPU resource validation")
  		return nil
  	}
  	return err
  }
  ...
  err = createStatusFile(outputDirFlag + "/" + pluginStatusFile)
  ... 
  ```
If this behavior is confirmed as unintended, I am happy to submit a PR to update the validator logic to respect this label. Please let me know if this approach aligns with the project's goals.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: nvidia-operator-validator fails on nodes with nvidia.com/gpu.deploy.device-plugin=false #2550

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: nvidia-operator-validator fails on nodes with nvidia.com/gpu.deploy.device-plugin=false #2550

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions