[Bug]: Gpu-Operator Upgrade Controller Incorrectly Transitions from upgrade-failed to upgrade-done



**Environment:**
- GPU Operator Version: 25.3.0 → 25.10.0
- OS: Ubuntu 24.04
- Kernel Version: 6.8.0-generic


**Description:**
When upgrading from a GPU Operator deployment using a pre-compiled driver image to a deployment using dynamic driver compilation while GPU workloads are running on the node, the driver pod may enter a CrashLoopBackOff state during the upgrade.

During this process, the node label:

`nvidia.com/gpu-driver-upgrade-state=upgrade-failed`

is applied by the upgrade controller.

After manually triggering the upgrade controller by setting:

`nvidia.com/gpu-driver-upgrade-state=upgrade-required`

The upgrade state progresses through:
upgrade-required
→ pod-deletion-required
→ upgrade-failed
→ upgrade-done

Eventually, the driver pod recovers and becomes Ready.

This behaviour suggests an inconsistency in the upgrade controller state machine. Specifically, the upgrade-failed state appears to be treated as a terminal failure state, yet the controller subsequently transitions the same node to upgrade-done without an explicit recovery action.


 

**Reproduction Steps:**

1. Deploy GPU Operator v25.3.0 using a pre-compiled driver image.
2. Run one or more GPU workloads on the node.
3. Upgrade GPU Operator to v25.10.0 and disable the pre-compiled driver image (switch to dynamic driver compilation).
4. Observe that the driver pod enters CrashLoopBackOff, typically due to the init container failing to unload the driver while workloads are active.
5. Observe the node label:
     `nvidia.com/gpu-driver-upgrade-state=upgrade-failed`
6. Manually update the node label:
    `nvidia.com/gpu-driver-upgrade-state=upgrade-required`
7. Observe the following state transitions:
    upgrade-required
    → pod-deletion-required
    → upgrade-failed
    → upgrade-done
8. Observe that the driver pod eventually recovers and reaches the Ready state.


**Observed Behavior:**

- Driver pod enters CrashLoopBackOff during upgrade.
- Upgrade controller marks the node as upgrade-failed.
- Re-triggering the upgrade controller causes the state machine to continue processing.
- The node transitions from upgrade-failed to upgrade-done.
- Driver pods recover successfully despite the node previously being marked as failed.

**Expected Behavior:**

- If upgrade-failed is intended to be a terminal state, no subsequent transition to upgrade-done should occur without an explicit recovery workflow.
- Alternatively, the state machine should support a clearly defined recovery path from upgrade-failed to a successful completion state.

**Notes:**
- In some cases, the upgrade controller reports the node state as upgrade-done during the GPU Operator upgrade, but driver pods are stuck at Crashloopbackoff state.
- The observed transition from upgrade-failed to upgrade-done indicates a potential issue in the upgrade controller state machine logic. Either:
- upgrade-failed is being assigned prematurely for a recoverable condition, or
- The state is not truly terminal despite its naming and intended semantics.
- This can leave the upgrade status inconsistent and may require unnecessary manual intervention.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Gpu-Operator Upgrade Controller Incorrectly Transitions from upgrade-failed to upgrade-done #2549

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Gpu-Operator Upgrade Controller Incorrectly Transitions from upgrade-failed to upgrade-done #2549

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions