add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518
Conversation
cf23d79 to
4adfc71
Compare
8125f47 to
b21a20b
Compare
b21a20b to
a1dea10
Compare
rajathagasthya
left a comment
There was a problem hiding this comment.
Overall looks solid, just a few questions.
2e803be to
452812c
Compare
| if changed { | ||
| return ctrl.Result{RequeueAfter: time.Second}, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
I am trying to reason about why we are doing this. Does the below statement (from the PR description) capture the reasoning?
ClusterPolicy status did not consistently move to notReady for early NVIDIADriver ownership failures. It now sets status.state to notReady and records the concrete conflict message in status conditions.
It feels wrong to call AssignOwners() here in the ClusterPolicy controller when it is already being called in the NVIDIADriver controller (and I would argue it should only be called there). If the goal is to mark ClusterPolicy as notReady when any NVIDIADriver CR is notReady could we not check the status of all the NVIDIADriver CRs in the cluster instead? And even then, I question whether we want the ClusterPolicy status to take the NVIDIADriver CR statuses into account.
There was a problem hiding this comment.
@rahulait it looks like my comment only got added to L164. I meant to attach this to the entire code block.
There was a problem hiding this comment.
These changes were for migration from clusterpolicy to nvidiadriver. I can see these are causing confusion, let me see if I can optimize them further and only let nvidiadriver add ownership labels.
There was a problem hiding this comment.
Regarding status field of clusterpolicy: how should we model ClusterPolicy status when driver management is moved to NVIDIADriver?
With NVIDIADriver enabled, ownership of the driver operand moves out of ClusterPolicy, while the remaining operands are still managed by ClusterPolicy. During a node driver upgrade, the driver pod is replaced first, and then other operands may restart or go through init again.
Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart? Or should ClusterPolicy continue to represent the combined health of the overall GPU Operator stack, including NVIDIADriver-managed driver state?
There was a problem hiding this comment.
As of today and even with this PR, the clusterpolicy state keeps fluctuating when driver is upgraded by nvidiadriver. I have filed a separate task to fix it in a separate followup PR.
| return false, fmt.Errorf("failed to list GPU nodes: %w", err) | ||
| } | ||
|
|
||
| for i := range nodes.Items { |
There was a problem hiding this comment.
The AssignOwners method will be called during every reconcile of a NVIDIADriver CR and each reconciler involves a loop through the entire list of gpu nodes. How will the performance be in large clusters ?
There was a problem hiding this comment.
The loop is O(number of GPU nodes × number of nvidiadriver CRs), but the CPU cost is not the main issue. The main cost at large scale is listing/copying thousands of Node objects and patching changed nodes. In an already converged/synced cluster, patches are skipped, so impact should be modest, cost is mostly large cache reads. The worst case is convergence after rollout or selector changes where multiple nodes will be patched together, API writes and watch fan-out dominate. We can mitigate by reusing a single GPU-node list across validation/assignment and rate-limiting node patches per reconcile. I would consider it an enhancement we can do to ease load in case of large clusters.
There was a problem hiding this comment.
Created a task for this so that we can do enhancements in future PRs.
There was a problem hiding this comment.
Anything we can add to the instrumentation so that we can closely monitor this? Or are the current metrics we expose sufficient?
There was a problem hiding this comment.
I think we should. controller-runtime metrics might tell us how much time a reconcile is taking, or if the queue is backing up, but it won't tell us how much time assignOwners is taking or any other call. We should at the very minimum atleast add logs which can be enabled to view time taken during each call. I'll add this requirement to the rate-limit request so that we do both at the same time. Let me know if you think we should add it to current PR and I'll add accordingly.
375e329 to
56f5a7d
Compare
56f5a7d to
66c5d29
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
LGTM. Thanks @rahulait for the hard work and patience on this one!
| return false, fmt.Errorf("failed to list GPU nodes: %w", err) | ||
| } | ||
|
|
||
| for i := range nodes.Items { |
There was a problem hiding this comment.
Anything we can add to the instrumentation so that we can closely monitor this? Or are the current metrics we expose sufficient?
66c5d29 to
cb154ca
Compare
| // included even for delete events so the NotFound reconcile path can clear | ||
| // stale node owner labels. | ||
| func (r *NVIDIADriverReconciler) enqueueNVIDIADriverReconcilers(ctx context.Context, driver *nvidiav1alpha1.NVIDIADriver) []reconcile.Request { | ||
| requests := r.enqueueAllNVIDIADrivers(ctx) |
There was a problem hiding this comment.
Instead of doing multiple list-alls of NVIDIADriver CR, how about creating a struct field (a map/set) in the NVIDIADriverReconciler struct which holds the unique set of all NVIDIADriver CRs? We'd add a new NVIDIADriver CR to the set if it's processed for the first time. We can remove the NVIDIADriver CR from this set when it is marked for deletion.
There was a problem hiding this comment.
This will create a second source of truth beside the controller-runtime cache. We'll have to deal with startup correctness, delete correctness, concurrency, etc. The current r.List(ctx, list) is normally served from the informer cache, not a live API-server request, so it is much cheaper than it looks. It is also much safer because it reflects the cache’s current view after watch/list reconciliation. Are you seeing any performance issues with current approach? Just want to be clear on the issue you might be seeing with current approach.
There was a problem hiding this comment.
While it may not be an issue in clusters with a few NVIDIADrivers. This becomes a problem in large clusters where it's also possible to have several NVIDIADriver CRs to loop through. In these scenarios, we are bound to hit scaling problems.
There was a problem hiding this comment.
I think we should move the assignOwners() method to its own controller, that way it can reconcile on its own when needed and add ownership label. Then, we don't need to redo those calculations again and again on each reconcile in nvidiadriver CR and it can focus just on driver enablement. I'll add ownership-reconciler in a follow up PR which will improve things a lot in large clusters with lots of nvidiadrivers. I'll open follow up PRs for this.
There was a problem hiding this comment.
@rahulait in addition to moving assignOwners() to its own controller, what are your thoughts on moving the other label operations, like labelNodesWithOrphanDriverPods(), to this separate controller? Would that be possible / desirable? I am also thinking about whether it makes sense to move all of the node labeling that GPU Operator does (primarily in the clusterpolicy controller) to a dedicated controller.
cb154ca to
7e5c414
Compare
7eec877 to
c7d6ba0
Compare
|
/ok to test c7d6ba0 |
… to nvidiadriver changes include: * added default nvidiadriver which doesn't conflict with other user-specified nvidiadrivers * added migration support from clusterpolicy to nvidiadriver * added/updated e2e tests for nvidiadriver workflow * use new field to mark a nvidiadriver as default instead of using a label * don't allow adding nodeselector to default nvidiadriver, it should cover entire cluster * move default NVIDIADriver ownership helpers into internal/nvidiadriver and re-use * requeue after node owner label changes so later reconcile steps read updated cache state * use patch instead of update for node label updates Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
c7d6ba0 to
53770fc
Compare
|
/ok to test 53770fc |
Fixes:
#2353
#2355
Summary
This PR adds support for a default
NVIDIADriverCR and enables controlled migration from legacyClusterPolicydriver management toNVIDIADriver-based driver management.When
driver.nvidiaDriverCRD.enabled=trueanddriver.nvidiaDriverCRD.deployDefaultCR=true, Helm renders a defaultNVIDIADriverCR. This CR is identified byspec.default=true:The CR name is not semantically important. Helm renders it as
default, but users can create an arbitrarily namedNVIDIADriverand mark it as the fallback driver by settingspec.default=true.Behavior
NVIDIADriverCRs own nodes selected by theirspec.nodeSelector.NVIDIADriveracts as fallback for GPU nodes that do not match any non-default driver.NVIDIADriverhasspec.default=true, fallback ownership is skipped.NVIDIADriverhasspec.default=true, reconciliation fails closed.NVIDIADrivercannot definespec.nodeSelector; the fallback always applies to remaining GPU nodes that do not match non-default drivers.spec.defaultis changed tofalse, that CR becomes a normal user-definedNVIDIADriverand participates in nodeSelector conflict detection.NVIDIADrivernode selectors cannot include the operator-managed owner labelnvidia.com/gpu-operator.driver.owner.NVIDIADrivermode,NVIDIADriverownership conflicts markClusterPolicyasnotReadyand surface the conflict in its status conditions.Migration Support
This PR supports migration from
ClusterPolicydriver management toNVIDIADriverdriver management by:NVIDIADriverfrom Helm values when CRD mode is enabledClusterPolicydriver fields whendriver.nvidiaDriverCRD.enabled=falseNVIDIADriverownership through the operator-managed node labelnvidia.com/gpu.driver.ownerClusterPolicy-owned driver daemonsets afterNVIDIADriverownership takes overClusterPolicydriver daemonsets beforeNVIDIADriverrolls replacement podsMigration requires the target GPU nodes to be matched by at least one
NVIDIADriverCR. Ifdriver.nvidiaDriverCRD.enabled=truebut no matchingNVIDIADriverexists, the controller has no driver owner to assign to those nodes, so fallback ownership is skipped until a matching CR is created.Upgrade Example
When upgrading from a release that does not contain this migration support, migration from
ClusterPolicydriver management toNVIDIADriverdriver management must be performed in two upgrades. This applies only to clusters that are currently usingClusterPolicyfor driver management and are switching toNVIDIADriver. It applies to releases older thanv26.7.0.Clusters that are already using
NVIDIADriverin an older release do not need this two-phase migration flow.Starting with
v26.7.0, the migration support is already present in the running controller. Clusters already runningv26.7.0or newer can switch toNVIDIADrivermode directly by upgrading or updating configuration withdriver.nvidiaDriverCRD.enabled=true, as long as the target GPU nodes are matched byNVIDIADriverCRs. The recommended path is to also setdriver.nvidiaDriverCRD.deployDefaultCR=trueso Helm renders the default fallback CR. Alternatively, users can create their ownNVIDIADriverCRs and mark one as the default fallback withspec.default=true. See theBehaviorsection for details on default and non-default ownership.Do not jump directly from an old release to the new release with
driver.nvidiaDriverCRD.enabled=true. The old controller does not know about the controlled migration flow. If the old controller exits while the new release is already configured forNVIDIADrivermode, it can tear down all existing driver pods immediately before the new controller has a chance to take ownership.Use this sequence instead:
Upgrade to the new release with
ClusterPolicydriver management still enabled.Example Helm settings:
This gets the new controller logic running while the existing
ClusterPolicy-owned driver pods remain in place.Upgrade again to the same new release, this time enabling
NVIDIADrivermode.Example Helm settings:
This renders the default
NVIDIADriverand performs the controlled migration fromClusterPolicyownership toNVIDIADriverownership.Issues Found And Fixed
During testing, we found several edge cases:
defaultas special was too restrictive and could conflict with user-created CRs. The default driver is now identified byspec.default=trueinstead.spec.nodeSelector.spec.nodeSelectorcould includenvidia.com/gpu-operator.driver.ownerand override the operator-managed owner selector used by driver daemonsets. The controller, admission validation, and daemonset nodeSelector construction now reject or defensively preserve that invariant.ClusterPolicyreconciliation previously attemptedNVIDIADriverowner assignment even whenNVIDIADrivermode was disabled. Owner assignment and orphaned-pod migration are now gated ondriver.enabled=trueanddriver.nvidiaDriverCRD.enabled=true.spec.default, so it bumps.metadata.generationand appears in thekubectl get nvidiadriverDefaultprint column.NVIDIADrivernames.E2E Coverage
The
NVIDIADrivere2e flow now covers:NVIDIADriveronly whendriver.nvidiaDriverCRD.enabled=trueanddriver.nvidiaDriverCRD.deployDefaultCR=true.deployDefaultCR=false.NVIDIADriverCRD mode.spec.default=true.NVIDIADrivercan take ownership of GPU nodes.NVIDIADriver.NVIDIADriver.spec.default=falsemakes that CR a normal driver and conflict detection applies.NVIDIADrivermode before teardown. It verifies the legacy driver daemonset is removed, the existing legacy pod is orphaned, the defaultNVIDIADrivertakes ownership, the orphaned pod is deleted by the upgrade flow within 5 minutes, and the replacement driver pod becomes ready.Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Added unit-tests, e2e tests and also did manual testing on a 3 node cluster with multiple nvidiadriver CRs. Everything worked as expected even with migration.