Skip to content

fix(k8s): cover EKS node-pool delete (K8S06-03) and fix CSI test-pool flakes#474

Open
abegnoche wants to merge 2 commits into
NVIDIA:mainfrom
abegnoche:feat/k8s-eks-node-pool-delete
Open

fix(k8s): cover EKS node-pool delete (K8S06-03) and fix CSI test-pool flakes#474
abegnoche wants to merge 2 commits into
NVIDIA:mainfrom
abegnoche:feat/k8s-eks-node-pool-delete

Conversation

@abegnoche

@abegnoche abegnoche commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

  • Wire the EKS node-pool delete leg (K8S06-03). Adds the AWS EKS provider steps that exercise the delete operation the k8s suite already declares: a throwaway CPU pool (own terraform-delete.tfstate) created in setup and deleted in the test phase, so the matching K8sNodePoolCheck can assert it converges to zero nodes; a teardown safety-net destroy covers setup-only runs.
  • Fix two CSI checks that failed on transient test-pool nodes. Both surfaced on the EKS suite and were caused by cluster-capability checks sweeping in the throwaway node pools that node-pool CRUD checks create/scale/delete within the same run:
    • Static EBS (K8S23-05 static): the hand-built static PV declared no topology, so its consumer pod could be scheduled cross-AZ from the zonal EBS volume and the attach hung until the bind timeout. Adds an optional static_pv.zone that pins the PV to the volume's AZ via spec.nodeAffinity; the EKS setup now emits the volume's AZ and the suite wires it through. No-op for zone-agnostic backends (EFS).
    • EFS shared-fs/nfs (K8S23-04): provisioning succeeded but the mount pod could land on a freshly joined test-pool node whose CSI node-plugin DaemonSet pod wasn't running yet, hanging the mount. Keeps CSI probe pods off nodes carrying the isv.ncp.validation/pool marker, mirroring the existing K8sNodeCountCheck exclusion. No-op where no test pools exist (single-node k3s/minikube/microk8s).

Validation

Run end-to-end against a live EKS cluster (ISVTEST_INCLUDE_UNRELEASED=1):

  • K8sNodePoolCheck delete leg (K8S06-03) → passed (pool converged to zero nodes after the test-phase delete).
  • K8sCsiStorageTypesCheck pvc-binds[shared-fs] / pvc-binds[nfs]passed.
  • K8sCsiProvisioningModesCheck staticpassed.
  • No node-count or GPU-capacity interference from the throwaway pools.
  • Unit: added regression tests for both k8s_storage mutators; isvtest/tests/test_k8s_storage.py green (make test), make validate-suites CHECK=1 green, shellcheck/ruff clean.

Add the AWS EKS provider steps that exercise the node-pool delete leg the
k8s suite already declares (K8sNodePoolCheck for K8S06-03). A throwaway
CPU pool with its own terraform-delete.tfstate is created in setup and
destroyed in the test phase so the matching check can assert it converges
to zero nodes; a teardown safety-net destroy covers setup-only runs.

Validated end-to-end against a live EKS cluster: the delete check polled
the isv-test-delete-pool nodegroup selector after destroy and passed,
with no node-count or GPU-capacity interference from the throwaway pool.

Signed-off-by: Alexandre Begnoche <abegnoche@nvidia.com>
Two CSI provisioning failures surfaced on the EKS suite were both caused
by the cluster-capability checks sweeping in the throwaway node pools that
node-pool CRUD checks create/scale/delete within the same run:

- Static EBS (K8S23-05 `static`): the hand-built static PV declared no
  topology, so its consumer pod could be scheduled cross-AZ from the
  zonal EBS volume and the attach hung until the bind timeout. Add an
  optional `static_pv.zone` that pins the PV to the volume's AZ via
  `spec.nodeAffinity`; the EKS setup now emits the volume's AZ and the
  suite wires it through. No-op for zone-agnostic backends (EFS).
- EFS shared-fs/nfs (K8S23-04): provisioning succeeded but the mount pod
  could land on a freshly joined test-pool node whose CSI node-plugin
  DaemonSet pod was not yet running, hanging the mount. Keep CSI probe
  pods off nodes carrying the `isv.ncp.validation/pool` marker, mirroring
  the existing K8sNodeCountCheck exclusion. No-op where no test pools
  exist (single-node k3s/minikube/microk8s).

Verified against the live EKS run's diagnostics (EFS provisioner logs
showed ProvisioningSucceeded; the failures were at the node mount/attach
stage on test-pool nodes). Adds unit coverage for both mutators.

The K8sGpuStressWorkload failure on a fresh GPU test-pool node is a
distinct container-level failure (not a scheduling hang) and is left for
a separate fix once logs from a preserved cluster confirm the cause.

Signed-off-by: Alexandre Begnoche <abegnoche@nvidia.com>
@abegnoche abegnoche requested a review from a team as a code owner June 17, 2026 20:55
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Two validation enhancements for EKS: a throwaway CPU node pool lifecycle is added to eks.yaml for the K8S06-03 delete scenario (create in setup, destroy in test, safety-net in teardown). Independently, setup.sh now queries AWS for the standalone EBS volume's availability zone, propagates it via the CSI inventory, and k8s_storage.py threads a zone parameter through the static PV provisioning path to conditionally set spec.nodeAffinity on the PV manifest. Mount-pod scheduling also gains a node anti-affinity to exclude transient validation node pools.

Changes

EKS Delete Node Pool and CSI Zone Pinning

Layer / File(s) Summary
EKS delete node pool lifecycle
isvctl/configs/providers/aws/config/eks.yaml
Adds setup, test, and teardown steps for a throwaway CPU node pool using a dedicated terraform-delete.tfstate, covering creation, destruction, and an idempotent safety-net.
EBS volume AZ detection
isvctl/configs/providers/aws/scripts/eks/setup.sh
Introduces STATIC_VOLUME_AZ, queries AWS EC2 for the standalone EBS volume's AZ after handle resolution, resets to empty on failure, and emits static_volume_az in the CSI inventory JSON.
Zone config field and signature threading
isvctl/configs/suites/k8s.yaml, isvtest/src/isvtest/validations/k8s_storage.py
Adds static_pv.zone to the suite config (Jinja-defaulted from the inventory), updates K8sCsiProvisioningModesCheck docs, and threads the zone parameter through _run_static, _apply_pv, and _set_pv_fields.
PV zone-affinity and mount-pod anti-affinity
isvtest/src/isvtest/validations/k8s_storage.py
_set_pv_fields conditionally writes or removes spec.nodeAffinity based on zone. _set_mount_pod_fields adds a DoesNotExist anti-affinity for isv.ncp.validation/pool nodes.
Unit tests
isvtest/tests/test_k8s_storage.py
Adds tests for no-zone / with-zone PV nodeAffinity behavior and for mount-pod exclusion of test-pool nodes.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main changes: fixing EKS node-pool delete (K8S06-03) and CSI test-pool flakes, which aligns with the changeset across configuration, setup script, and validation code.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@abegnoche abegnoche changed the title feat(k8s): cover EKS node-pool delete (K8S06-03) and fix CSI test-pool flakes fix(k8s): cover EKS node-pool delete (K8S06-03) and fix CSI test-pool flakes Jun 17, 2026
@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-17 20:56:21 UTC | Commit: ca01306

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
isvtest/src/isvtest/validations/k8s_storage.py (1)

1982-1997: ⚡ Quick win

Avoid clobbering existing affinity in _set_mount_pod_fields.

Line 1982 replaces the entire spec.affinity object, which can silently drop template-defined
podAffinity/podAntiAffinity or existing node constraints. Merge the exclusion rule instead of overwriting.

Suggested patch
-    spec["affinity"] = {
-        "nodeAffinity": {
-            "requiredDuringSchedulingIgnoredDuringExecution": {
-                "nodeSelectorTerms": [
-                    {
-                        "matchExpressions": [
-                            {
-                                "key": "isv.ncp.validation/pool",
-                                "operator": "DoesNotExist",
-                            }
-                        ]
-                    }
-                ]
-            }
-        }
-    }
+    affinity = spec.setdefault("affinity", {})
+    node_affinity = affinity.setdefault("nodeAffinity", {})
+    required = node_affinity.setdefault(
+        "requiredDuringSchedulingIgnoredDuringExecution",
+        {"nodeSelectorTerms": [{"matchExpressions": []}]},
+    )
+    terms = required.setdefault("nodeSelectorTerms", [])
+    if not terms:
+        terms.append({"matchExpressions": []})
+    for term in terms:
+        exprs = term.setdefault("matchExpressions", [])
+        exprs.append(
+            {
+                "key": "isv.ncp.validation/pool",
+                "operator": "DoesNotExist",
+            }
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvtest/src/isvtest/validations/k8s_storage.py` around lines 1982 - 1997, The
code in the _set_mount_pod_fields function is completely replacing the
spec["affinity"] dictionary, which can lose any existing affinity settings like
podAffinity or podAntiAffinity that were previously defined. Instead of directly
assigning a new affinity structure, check if spec["affinity"] already exists and
merge the nodeAffinity exclusion rule into it while preserving any existing
affinity constraints. Only create the full affinity structure if it does not
already exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@isvtest/src/isvtest/validations/k8s_storage.py`:
- Around line 1982-1997: The code in the _set_mount_pod_fields function is
completely replacing the spec["affinity"] dictionary, which can lose any
existing affinity settings like podAffinity or podAntiAffinity that were
previously defined. Instead of directly assigning a new affinity structure,
check if spec["affinity"] already exists and merge the nodeAffinity exclusion
rule into it while preserving any existing affinity constraints. Only create the
full affinity structure if it does not already exist.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 48a7c2d2-294a-4a2e-9682-040b009b6392

📥 Commits

Reviewing files that changed from the base of the PR and between ff7cdd4 and ca01306.

📒 Files selected for processing (5)
  • isvctl/configs/providers/aws/config/eks.yaml
  • isvctl/configs/providers/aws/scripts/eks/setup.sh
  • isvctl/configs/suites/k8s.yaml
  • isvtest/src/isvtest/validations/k8s_storage.py
  • isvtest/tests/test_k8s_storage.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant