Environment:
- GPU Operator Version: 25.3.0 → 25.10.0
- OS: Ubuntu 24.04
- Kernel Version: 6.8.0-generic
Description:
When upgrading from a GPU Operator deployment using a pre-compiled driver image to a deployment using dynamic driver compilation while GPU workloads are running on the node, the driver pod may enter a CrashLoopBackOff state during the upgrade.
During this process, the node label:
nvidia.com/gpu-driver-upgrade-state=upgrade-failed
is applied by the upgrade controller.
After manually triggering the upgrade controller by setting:
nvidia.com/gpu-driver-upgrade-state=upgrade-required
The upgrade state progresses through:
upgrade-required
→ pod-deletion-required
→ upgrade-failed
→ upgrade-done
Eventually, the driver pod recovers and becomes Ready.
This behaviour suggests an inconsistency in the upgrade controller state machine. Specifically, the upgrade-failed state appears to be treated as a terminal failure state, yet the controller subsequently transitions the same node to upgrade-done without an explicit recovery action.
Reproduction Steps:
- Deploy GPU Operator v25.3.0 using a pre-compiled driver image.
- Run one or more GPU workloads on the node.
- Upgrade GPU Operator to v25.10.0 and disable the pre-compiled driver image (switch to dynamic driver compilation).
- Observe that the driver pod enters CrashLoopBackOff, typically due to the init container failing to unload the driver while workloads are active.
- Observe the node label:
nvidia.com/gpu-driver-upgrade-state=upgrade-failed
- Manually update the node label:
nvidia.com/gpu-driver-upgrade-state=upgrade-required
- Observe the following state transitions:
upgrade-required
→ pod-deletion-required
→ upgrade-failed
→ upgrade-done
- Observe that the driver pod eventually recovers and reaches the Ready state.
Observed Behavior:
- Driver pod enters CrashLoopBackOff during upgrade.
- Upgrade controller marks the node as upgrade-failed.
- Re-triggering the upgrade controller causes the state machine to continue processing.
- The node transitions from upgrade-failed to upgrade-done.
- Driver pods recover successfully despite the node previously being marked as failed.
Expected Behavior:
- If upgrade-failed is intended to be a terminal state, no subsequent transition to upgrade-done should occur without an explicit recovery workflow.
- Alternatively, the state machine should support a clearly defined recovery path from upgrade-failed to a successful completion state.
Notes:
- In some cases, the upgrade controller reports the node state as upgrade-done during the GPU Operator upgrade, but driver pods are stuck at Crashloopbackoff state.
- The observed transition from upgrade-failed to upgrade-done indicates a potential issue in the upgrade controller state machine logic. Either:
- upgrade-failed is being assigned prematurely for a recoverable condition, or
- The state is not truly terminal despite its naming and intended semantics.
- This can leave the upgrade status inconsistent and may require unnecessary manual intervention.
Environment:
Description:
When upgrading from a GPU Operator deployment using a pre-compiled driver image to a deployment using dynamic driver compilation while GPU workloads are running on the node, the driver pod may enter a CrashLoopBackOff state during the upgrade.
During this process, the node label:
nvidia.com/gpu-driver-upgrade-state=upgrade-failedis applied by the upgrade controller.
After manually triggering the upgrade controller by setting:
nvidia.com/gpu-driver-upgrade-state=upgrade-requiredThe upgrade state progresses through:
upgrade-required
→ pod-deletion-required
→ upgrade-failed
→ upgrade-done
Eventually, the driver pod recovers and becomes Ready.
This behaviour suggests an inconsistency in the upgrade controller state machine. Specifically, the upgrade-failed state appears to be treated as a terminal failure state, yet the controller subsequently transitions the same node to upgrade-done without an explicit recovery action.
Reproduction Steps:
nvidia.com/gpu-driver-upgrade-state=upgrade-failednvidia.com/gpu-driver-upgrade-state=upgrade-requiredupgrade-required
→ pod-deletion-required
→ upgrade-failed
→ upgrade-done
Observed Behavior:
Expected Behavior:
Notes: