Skip to content

[Bug]: Gpu-Operator Upgrade Controller Incorrectly Transitions from upgrade-failed to upgrade-done #2549

@anoopsinghnegi

Description

@anoopsinghnegi

Environment:

  • GPU Operator Version: 25.3.0 → 25.10.0
  • OS: Ubuntu 24.04
  • Kernel Version: 6.8.0-generic

Description:
When upgrading from a GPU Operator deployment using a pre-compiled driver image to a deployment using dynamic driver compilation while GPU workloads are running on the node, the driver pod may enter a CrashLoopBackOff state during the upgrade.

During this process, the node label:

nvidia.com/gpu-driver-upgrade-state=upgrade-failed

is applied by the upgrade controller.

After manually triggering the upgrade controller by setting:

nvidia.com/gpu-driver-upgrade-state=upgrade-required

The upgrade state progresses through:
upgrade-required
→ pod-deletion-required
→ upgrade-failed
→ upgrade-done

Eventually, the driver pod recovers and becomes Ready.

This behaviour suggests an inconsistency in the upgrade controller state machine. Specifically, the upgrade-failed state appears to be treated as a terminal failure state, yet the controller subsequently transitions the same node to upgrade-done without an explicit recovery action.

Reproduction Steps:

  1. Deploy GPU Operator v25.3.0 using a pre-compiled driver image.
  2. Run one or more GPU workloads on the node.
  3. Upgrade GPU Operator to v25.10.0 and disable the pre-compiled driver image (switch to dynamic driver compilation).
  4. Observe that the driver pod enters CrashLoopBackOff, typically due to the init container failing to unload the driver while workloads are active.
  5. Observe the node label:
    nvidia.com/gpu-driver-upgrade-state=upgrade-failed
  6. Manually update the node label:
    nvidia.com/gpu-driver-upgrade-state=upgrade-required
  7. Observe the following state transitions:
    upgrade-required
    → pod-deletion-required
    → upgrade-failed
    → upgrade-done
  8. Observe that the driver pod eventually recovers and reaches the Ready state.

Observed Behavior:

  • Driver pod enters CrashLoopBackOff during upgrade.
  • Upgrade controller marks the node as upgrade-failed.
  • Re-triggering the upgrade controller causes the state machine to continue processing.
  • The node transitions from upgrade-failed to upgrade-done.
  • Driver pods recover successfully despite the node previously being marked as failed.

Expected Behavior:

  • If upgrade-failed is intended to be a terminal state, no subsequent transition to upgrade-done should occur without an explicit recovery workflow.
  • Alternatively, the state machine should support a clearly defined recovery path from upgrade-failed to a successful completion state.

Notes:

  • In some cases, the upgrade controller reports the node state as upgrade-done during the GPU Operator upgrade, but driver pods are stuck at Crashloopbackoff state.
  • The observed transition from upgrade-failed to upgrade-done indicates a potential issue in the upgrade controller state machine logic. Either:
  • upgrade-failed is being assigned prematurely for a recoverable condition, or
  • The state is not truly terminal despite its naming and intended semantics.
  • This can leave the upgrade status inconsistent and may require unnecessary manual intervention.

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions