fix(k8s): cover EKS node-pool delete (K8S06-03) and fix CSI test-pool flakes#474
fix(k8s): cover EKS node-pool delete (K8S06-03) and fix CSI test-pool flakes#474abegnoche wants to merge 2 commits into
Conversation
Add the AWS EKS provider steps that exercise the node-pool delete leg the k8s suite already declares (K8sNodePoolCheck for K8S06-03). A throwaway CPU pool with its own terraform-delete.tfstate is created in setup and destroyed in the test phase so the matching check can assert it converges to zero nodes; a teardown safety-net destroy covers setup-only runs. Validated end-to-end against a live EKS cluster: the delete check polled the isv-test-delete-pool nodegroup selector after destroy and passed, with no node-count or GPU-capacity interference from the throwaway pool. Signed-off-by: Alexandre Begnoche <abegnoche@nvidia.com>
Two CSI provisioning failures surfaced on the EKS suite were both caused by the cluster-capability checks sweeping in the throwaway node pools that node-pool CRUD checks create/scale/delete within the same run: - Static EBS (K8S23-05 `static`): the hand-built static PV declared no topology, so its consumer pod could be scheduled cross-AZ from the zonal EBS volume and the attach hung until the bind timeout. Add an optional `static_pv.zone` that pins the PV to the volume's AZ via `spec.nodeAffinity`; the EKS setup now emits the volume's AZ and the suite wires it through. No-op for zone-agnostic backends (EFS). - EFS shared-fs/nfs (K8S23-04): provisioning succeeded but the mount pod could land on a freshly joined test-pool node whose CSI node-plugin DaemonSet pod was not yet running, hanging the mount. Keep CSI probe pods off nodes carrying the `isv.ncp.validation/pool` marker, mirroring the existing K8sNodeCountCheck exclusion. No-op where no test pools exist (single-node k3s/minikube/microk8s). Verified against the live EKS run's diagnostics (EFS provisioner logs showed ProvisioningSucceeded; the failures were at the node mount/attach stage on test-pool nodes). Adds unit coverage for both mutators. The K8sGpuStressWorkload failure on a fresh GPU test-pool node is a distinct container-level failure (not a scheduling hang) and is left for a separate fix once logs from a preserved cluster confirm the cause. Signed-off-by: Alexandre Begnoche <abegnoche@nvidia.com>
📝 WalkthroughWalkthroughTwo validation enhancements for EKS: a throwaway CPU node pool lifecycle is added to ChangesEKS Delete Node Pool and CSI Zone Pinning
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-17 20:56:21 UTC | Commit: ca01306 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
isvtest/src/isvtest/validations/k8s_storage.py (1)
1982-1997: ⚡ Quick winAvoid clobbering existing affinity in
_set_mount_pod_fields.Line 1982 replaces the entire
spec.affinityobject, which can silently drop template-defined
podAffinity/podAntiAffinityor existing node constraints. Merge the exclusion rule instead of overwriting.Suggested patch
- spec["affinity"] = { - "nodeAffinity": { - "requiredDuringSchedulingIgnoredDuringExecution": { - "nodeSelectorTerms": [ - { - "matchExpressions": [ - { - "key": "isv.ncp.validation/pool", - "operator": "DoesNotExist", - } - ] - } - ] - } - } - } + affinity = spec.setdefault("affinity", {}) + node_affinity = affinity.setdefault("nodeAffinity", {}) + required = node_affinity.setdefault( + "requiredDuringSchedulingIgnoredDuringExecution", + {"nodeSelectorTerms": [{"matchExpressions": []}]}, + ) + terms = required.setdefault("nodeSelectorTerms", []) + if not terms: + terms.append({"matchExpressions": []}) + for term in terms: + exprs = term.setdefault("matchExpressions", []) + exprs.append( + { + "key": "isv.ncp.validation/pool", + "operator": "DoesNotExist", + } + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@isvtest/src/isvtest/validations/k8s_storage.py` around lines 1982 - 1997, The code in the _set_mount_pod_fields function is completely replacing the spec["affinity"] dictionary, which can lose any existing affinity settings like podAffinity or podAntiAffinity that were previously defined. Instead of directly assigning a new affinity structure, check if spec["affinity"] already exists and merge the nodeAffinity exclusion rule into it while preserving any existing affinity constraints. Only create the full affinity structure if it does not already exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@isvtest/src/isvtest/validations/k8s_storage.py`:
- Around line 1982-1997: The code in the _set_mount_pod_fields function is
completely replacing the spec["affinity"] dictionary, which can lose any
existing affinity settings like podAffinity or podAntiAffinity that were
previously defined. Instead of directly assigning a new affinity structure,
check if spec["affinity"] already exists and merge the nodeAffinity exclusion
rule into it while preserving any existing affinity constraints. Only create the
full affinity structure if it does not already exist.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 48a7c2d2-294a-4a2e-9682-040b009b6392
📒 Files selected for processing (5)
isvctl/configs/providers/aws/config/eks.yamlisvctl/configs/providers/aws/scripts/eks/setup.shisvctl/configs/suites/k8s.yamlisvtest/src/isvtest/validations/k8s_storage.pyisvtest/tests/test_k8s_storage.py
Summary
terraform-delete.tfstate) created in setup and deleted in the test phase, so the matchingK8sNodePoolCheckcan assert it converges to zero nodes; a teardown safety-net destroy covers setup-only runs.K8S23-05static): the hand-built static PV declared no topology, so its consumer pod could be scheduled cross-AZ from the zonal EBS volume and the attach hung until the bind timeout. Adds an optionalstatic_pv.zonethat pins the PV to the volume's AZ viaspec.nodeAffinity; the EKS setup now emits the volume's AZ and the suite wires it through. No-op for zone-agnostic backends (EFS).K8S23-04): provisioning succeeded but the mount pod could land on a freshly joined test-pool node whose CSI node-plugin DaemonSet pod wasn't running yet, hanging the mount. Keeps CSI probe pods off nodes carrying theisv.ncp.validation/poolmarker, mirroring the existingK8sNodeCountCheckexclusion. No-op where no test pools exist (single-node k3s/minikube/microk8s).Validation
Run end-to-end against a live EKS cluster (
ISVTEST_INCLUDE_UNRELEASED=1):K8sNodePoolCheckdelete leg (K8S06-03) → passed (pool converged to zero nodes after the test-phase delete).K8sCsiStorageTypesCheckpvc-binds[shared-fs]/pvc-binds[nfs]→ passed.K8sCsiProvisioningModesCheckstatic→ passed.k8s_storagemutators;isvtest/tests/test_k8s_storage.pygreen (make test),make validate-suites CHECK=1green, shellcheck/ruff clean.