Skip to content

Migrating Ipfailover test cases to images repo#243

Open
melvinjoseph86 wants to merge 3 commits into
openshift:masterfrom
melvinjoseph86:ipfailover_migration_e2e
Open

Migrating Ipfailover test cases to images repo#243
melvinjoseph86 wants to merge 3 commits into
openshift:masterfrom
melvinjoseph86:ipfailover_migration_e2e

Conversation

@melvinjoseph86

@melvinjoseph86 melvinjoseph86 commented Jun 25, 2026

Copy link
Copy Markdown

Summary by CodeRabbit

  • New Features
    • Added an ipfailover end-to-end test extension with Ginkgo suite registration and platform-aware selection.
    • Introduced Network_Edge coverage for MASTER/backup transitions, VIP migration, configuration validation, preemption timing, VRRP ID exclusion, VIP range handling, and unicast failover behavior.
  • Documentation
    • Added README with suite groupings, run commands, execution constraints, and expected runtimes.
  • Chores
    • Added a test build Makefile and a multi-stage Docker build to package the test extension; updated ignore rules to avoid committing generated test binaries.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: melvinjoseph86
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs pull-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release/pull/81103

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: An error was encountered. No known errors were detected, please see the full error message for details.

Full error message. unable to get additional pr info from string: openshift/release/pull/81103: string: openshift/release/pull/81103 doesn't match expected format: org/repo#number

Please contact an administrator to resolve this issue.

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs pull-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release#81103

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs periodic-ci-openshift-images-master-e2e-metal-ipi-tests-ext openshift/release#81103

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@redhat-chai-bot

Copy link
Copy Markdown

Investigation: Deserialization Error Root Cause

I investigated the Deserialization Error: invalid character 'I' looking for beginning of value failure reported in #wg-openshift-tests-extension. Here are the findings:

Root Cause: klog.SetOutput(g.GinkgoWriter) timing bug

The init() function in test/e2e/ipfailover.go captures a reference to g.GinkgoWriter at init-time, but at that point GinkgoWriter is still initialized to os.Stdout (set in Ginkgo's own init() in core_dsl.go). Later, OTE's configureGinkgo() replaces the GinkgoWriter variable with a new Writer pointing to os.Stderr, but klog still holds the old Writer object that writes to stdout.

This means all klog INFO lines (I0625...) go to stdout for the entire process lifetime, contaminating the JSON output stream.

The same timing issue affects rest.SetDefaultWarningHandler(rest.NewWarningWriter(g.GinkgoWriter, ...)).

Recommended fix for this PR

Replace the GinkgoWriter references in init() with direct os.Stderr:

func init() {
    // Write directly to stderr — no timing dependency on GinkgoWriter
    klog.SetOutput(os.Stderr)

    // Redirect REST client warnings directly to stderr
    rest.SetDefaultWarningHandler(
        rest.NewWarningWriter(os.Stderr, rest.WarningWriterOptions{}),
    )

    exutil.WithCleanup(func() {})
}

The flag.Set("ginkgo.output-interceptor-mode", "none") line can also be removed — OTE uses NoopOutputInterceptor{} unconditionally, so the flag has no effect.

Framework-side fix

A structural fix is also being prepared for openshift-eng/openshift-tests-extension to make newTestResultFromOutput() resilient to non-JSON stdout content (similar to how origin's extractJSON() works). This will prevent future occurrences regardless of the contamination source.

Investigation triggered from #wg-openshift-tests-extension discussion

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1ac8ef8-af47-4dc6-b410-403cfb1c7a3d

📥 Commits

Reviewing files that changed from the base of the PR and between 4d59ec2 and c6a1662.

📒 Files selected for processing (1)
  • ipfailover/keepalived/Dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
  • ipfailover/keepalived/Dockerfile

Walkthrough

Adds an ipfailover keepalived test extension with build packaging, a Go entrypoint, shared test helpers, deployment testdata, six e2e scenarios, and usage documentation.

Changes

ipfailover OTE extension

Layer / File(s) Summary
Build packaging
ipfailover/keepalived/tests/go.mod, ipfailover/keepalived/tests/Makefile, ipfailover/keepalived/Dockerfile, .gitignore
Adds the Go module, dependency overrides, build targets, a multi-stage image build that packages the extension, and an ignore rule for the generated binary.
Entrypoint and suites
ipfailover/keepalived/tests/cmd/main.go
Adds the Cobra entrypoint that filters local e2e specs, applies platform selection, registers the extension, and defines the serial conformance, disruptive, and all suites.
Deployment template and helpers
ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml, ipfailover/keepalived/tests/e2e/util.go
Adds the deployment template and helper functions for namespace setup, template processing, pod/VIP polling, unicast peer setup, and IP stack detection.
E2E scenarios and docs
ipfailover/keepalived/tests/e2e/ipfailover.go, ipfailover/keepalived/tests/README.md
Adds the Ginkgo suite, six ipfailover scenarios, initialization wiring, and extension usage documentation.

Estimated code review effort: 4 (Complex) | ~60 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (3 errors, 3 warnings)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❌ Error All 6 g.It titles use long author/ticket-prefixed strings (e.g. 41030), making them overly-specific rather than concise stable names. Rename them to short static descriptions like should deploy ipfailover; keep issue/author metadata خارج test titles.
Container-Privileges ❌ Error The new Deployment manifest sets privileged: true and hostNetwork: true, which this check explicitly flags. Remove or justify the privileged/hostNetwork settings, or isolate them to the minimal test-only workload with clear documentation.
No-Sensitive-Data-In-Logs ❌ Error FAIL: util.go logs worker node names, pod IPs, and image pull specs, plus pod logs/describe output, which can expose internal hostnames and cluster data. Redact or remove these debug logs, and avoid dumping full pod logs/describe output or registry/pod IP details unless sanitized.
Docstring Coverage ⚠️ Warning Docstring coverage is 4.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Microshift Test Compatibility ⚠️ Warning New tests call config.openshift.io and network.operator APIs and require 2 worker nodes/failover behavior, with no MicroShift skip or apigroup tag. Add a MicroShift guard or [Skipped:MicroShift]/[apigroup:...] tag, or move these HA/API-dependent tests out of MicroShift presubmit coverage.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning New Ginkgo tests hardcode IPv4 ranges (192.168.254.1-255/254) and derive VIPs by splitting pod IPs on '.', so they’re not IPv6-safe. Use IP-family detection and family-correct CIDRs/addresses instead of hardcoded IPv4 values; avoid dot-splitting IP logic and add IPv6 CI coverage.
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: moving ipfailover test cases into the images repo.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Test Structure And Quality ✅ Passed The new Ginkgo tests are scenario-focused, use exutil.NewCLI’s built-in project cleanup, and all cluster waits/polls have bounded timeouts.
Single Node Openshift (Sno) Test Compatibility ✅ Passed All new Ginkgo tests are guarded by a BeforeEach that skips when ready worker nodes are <2, so they won’t run on SNO; no unguarded SNO assumptions remain.
Topology-Aware Scheduling Compatibility ✅ Passed The new Deployment targets worker nodes, but the tests gate execution on at least 2 ready worker nodes before creating it, so it avoids SNO/TNF/TNA/HyperShift incompatibility.
Ote Binary Stdout Contract ✅ Passed No main/init stdout prints were added; logging is routed to GinkgoWriter, and framework.Logf writes to GinkgoWriter too.
No-Weak-Crypto ✅ Passed No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret-comparison code appears in the touched non-vendor files.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (1)
ipfailover/keepalived/Dockerfile (1)

5-5: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Copy only the inputs this builder stage actually needs.

COPY . . pulls the whole repository into the build context even though this stage only builds ipfailover/keepalived/tests. That widens the accidental secret/file exposure surface and invalidates the cache on unrelated changes.

♻️ Proposed fix
 WORKDIR /go/src/github.com/openshift/ipfailover
-COPY . .
+COPY ipfailover/keepalived/tests ./ipfailover/keepalived/tests
 RUN cd ipfailover/keepalived/tests && \
     make build && \

As per path instructions, COPY specific files, not entire context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/Dockerfile` at line 5, The builder stage in the
keepalived Dockerfile is over-copying the entire repository with COPY . ., even
though it only needs the inputs required to build ipfailover/keepalived/tests.
Replace the broad copy with explicit copies of the minimal needed files and
directories so the stage stays narrowly scoped, avoids exposing unrelated files,
and preserves cache reuse; use the Dockerfile stage and the tests build target
as the main references when locating the change.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ipfailover/keepalived/tests/cmd/main.go`:
- Around line 25-26: Move the framework.AfterReadingAllFlags call so it runs
after root.Execute() has parsed flags, because invoking it before parsing leaves
framework.TestContext at default values and ignores kube/e2e overrides. Update
the startup flow in main so util.InitStandardFlags still happens first, then
flags are parsed via root.Execute, and only then
framework.AfterReadingAllFlags(&framework.TestContext) is called.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go`:
- Around line 168-171: After deleting the current primary, do not immediately
recompute MASTER/BACKUP from the pod list; in the ipfailover e2e flow around
getPodListByLabel, ensureIpfailoverMasterBackup, waitForPrimaryPod, and
futurePrimaryPod[0], add a replica-aware wait until the replacement pod set is
fully available and stable before reading pod names again. Apply the same fix in
both affected flows so the recalculation happens only after the rollout
completes and the pod list reflects the final state.
- Around line 75-84: The IPv6 single-stack check in ipfailover e2e is too narrow
because it only inspects the first clusterNetwork CIDR and can misclassify
IPv6-first dual-stack clusters. Update the logic in the test around the current
oc-based network lookup to reuse checkIPStackType(oc) or otherwise evaluate the
full status.clusterNetwork CIDR list before calling g.Skip, so only true IPv6
single-stack clusters are skipped.
- Around line 69-73: The remote-worker skip check in the ipfailover e2e test is
using the wrong node selector, so it can miss remote-worker clusters. Update the
logic in the test block around the remote-worker gate to query
`node.openshift.io/remote-worker` directly, matching the approach used by
`exactNodeDetails()`, and then keep the skip decision based on that targeted
result instead of searching generic `nodes -l kubernetes.io/hostname` output.

In `@ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml`:
- Around line 28-75: The openshift-ipfailover container definition in the
ipfailover manifest is missing an explicit resource envelope. Add cpu and memory
requests and limits to the openshift-ipfailover container alongside its existing
ports, securityContext, and env settings so it conforms to the resource policy
for every container.
- Around line 14-27: The ipfailover test template hard-codes a 2-replica
deployment with a worker-only node selector, which assumes a standard
multi-worker topology. Update the ipfailover.yaml template to derive replicas
from the number of schedulable nodes and make the e2e test setup topology-aware
before rendering it, so it can handle SingleReplica, DualReplica, and
arbiter-style clusters. Use the existing ipfailover template fields and related
test setup logic to cap replica count to available schedulable nodes and avoid
assuming node-role.kubernetes.io/worker is always present.
- Around line 23-44: The ipfailover test pod spec is effectively fully
host-privileged and needs to be tightened in the openshift-ipfailover workload
definition. Update the container and pod security settings to follow the
security guidance by adding non-root and no-escalation constraints, making the
root filesystem read-only, dropping all capabilities unless explicitly required,
and disabling service account token automount unless the test truly needs it.
Keep host access only if required for the test, but scope it to the minimum
necessary privileges and document any exception in the ipfailover template
rather than leaving the default manifest overly permissive.
- Around line 69-75: The keepalived test pod spec currently defines only
livenessProbe, which lets the pod report Ready too early and makes the e2e flow
race-prone. Add a readinessProbe alongside the existing livenessProbe in the
ipfailover manifest so readiness only flips after keepalived is actually
prepared to serve, keeping ensurePodWithLabelReady in sync with the downstream
assertions in ipfailover.go.
- Around line 22-25: Remove the invalid top-level privileged field from the pod
spec in the ipfailover test manifest. The fix is to delete
spec.template.spec.privileged from the ipfailover.yaml fixture and keep the
existing containers[].securityContext.privileged setting as the only privileged
declaration. Use the ipfailover pod template structure under spec.template.spec
to locate the change.

In `@ipfailover/keepalived/tests/e2e/util.go`:
- Around line 319-331: The helper waitForPodWithLabelReady currently only checks
Ready status, so it can succeed with a single surviving pod during rollout.
Update waitForPodWithLabelReady to also verify the number of pods returned by
the label selector in addition to readiness, and keep polling until both the
expected replica count and Ready condition are met. Use the existing wait.Poll
logic and the kubectl get call in waitForPodWithLabelReady to enforce the count
before returning success.
- Around line 29-31: The worker-counting logic in the e2e helper is parsing `oc
get nodes` table text with `strings.Count`, which double-counts `Ready` inside
`NotReady` and can underreport healthy nodes. Update the node readiness check in
the helper that builds `linuxWorkerDetails` and `nodeCount` to use structured
node data from the API or JSON output, inspecting node conditions and
unschedulable state instead of matching substrings in human-readable output.
- Around line 240-249: The unicast peer list in the Nutanix/none branch is built
from the first node address, but .status.addresses[0] is not guaranteed to be an
InternalIP. Update the logic in the keepalived e2e util path that calls
oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", ...) to filter node
addresses by type == "InternalIP" before assembling OPENSHIFT_HA_UNICAST_PEERS,
and keep the existing validation that enough worker IPs were found.
- Around line 160-176: The getVipOwnerPod helper is treating the literal
“command terminated with exit code 1” as a VIP miss, which can mask real oc exec
failures and return the wrong master or an empty result. Update the logic around
oc.AsAdmin().WithoutNamespace().Run("exec").Args(...).Output() to inspect err
first, only consider exit code 1 from the command as “VIP not present,” and
explicitly retry or fail for any other error conditions; keep the selection
logic in getVipOwnerPod and its primaryNode assignment tied to successful
command results.

In `@ipfailover/keepalived/tests/go.mod`:
- Line 3: The module version in ipfailover/keepalived/tests/go.mod is newer than
the Go toolchain used by the ipfailover/keepalived/Dockerfile builder stage, so
the build may auto-download a different compiler. Update the builder image in
Dockerfile to a Go 1.25-based image, or reduce the go directive in go.mod to
match the supported compiler version. Make sure the versions stay aligned across
go.mod and the Dockerfile so the tests build with the image’s local toolchain.

---

Nitpick comments:
In `@ipfailover/keepalived/Dockerfile`:
- Line 5: The builder stage in the keepalived Dockerfile is over-copying the
entire repository with COPY . ., even though it only needs the inputs required
to build ipfailover/keepalived/tests. Replace the broad copy with explicit
copies of the minimal needed files and directories so the stage stays narrowly
scoped, avoids exposing unrelated files, and preserves cache reuse; use the
Dockerfile stage and the tests build target as the main references when locating
the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment thread ipfailover/keepalived/tests/cmd/main.go Outdated
Comment thread ipfailover/keepalived/tests/e2e/ipfailover.go
Comment thread ipfailover/keepalived/tests/e2e/ipfailover.go
Comment on lines +168 to +171
g.By("7. Verify the virtual IP is floated onto the new MASTER node")
newPodName1 := getPodListByLabel(oc, ns, "ipfailover=hello-openshift")
newMasterNode, _ := ensureIpfailoverMasterBackup(oc, ns, newPodName1)
waitForPrimaryPod(oc, ns, newMasterNode, virtualIP)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Wait for the replacement pod set before recalculating MASTER/BACKUP.

Both flows delete the current primary and then immediately read pod names again. There is a normal window where only the surviving replica exists, so ensureIpfailoverMasterBackup() / futurePrimaryPod[0] can race the rollout and fail nondeterministically. Add a replica-aware wait between the delete and the next pod-list read.

Also applies to: 346-353

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/ipfailover.go` around lines 168 - 171, After
deleting the current primary, do not immediately recompute MASTER/BACKUP from
the pod list; in the ipfailover e2e flow around getPodListByLabel,
ensureIpfailoverMasterBackup, waitForPrimaryPod, and futurePrimaryPod[0], add a
replica-aware wait until the replacement pod set is fully available and stable
before reading pod names again. Apply the same fix in both affected flows so the
recalculation happens only after the rollout completes and the pod list reflects
the final state.

Comment on lines +14 to +27
replicas: 2
selector:
matchLabels:
ipfailover: hello-openshift
template:
metadata:
labels:
ipfailover: hello-openshift
spec:
serviceAccountName: ipfailover
privileged: true
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/worker: ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Avoid hard-coding a 2-replica, worker-only topology.

replicas: 2 plus node-role.kubernetes.io/worker assumes at least two schedulable workers exist. This template will stall on SingleReplica/DualReplica/arbiter-style topologies instead of adapting or being explicitly gated. Cap replicas to the schedulable node count and make the surrounding test logic topology-aware before rendering this template. As per coding guidelines, "check whether they introduce scheduling constraints assuming standard HA topology with 3+ control-plane nodes and dedicated workers" and "Cap replica counts to schedulable nodes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/testdata/router/ipfailover.yaml` around lines
14 - 27, The ipfailover test template hard-codes a 2-replica deployment with a
worker-only node selector, which assumes a standard multi-worker topology.
Update the ipfailover.yaml template to derive replicas from the number of
schedulable nodes and make the e2e test setup topology-aware before rendering
it, so it can handle SingleReplica, DualReplica, and arbiter-style clusters. Use
the existing ipfailover template fields and related test setup logic to cap
replica count to available schedulable nodes and avoid assuming
node-role.kubernetes.io/worker is always present.

Source: Coding guidelines

Comment thread ipfailover/keepalived/tests/e2e/util.go Outdated
Comment thread ipfailover/keepalived/tests/e2e/util.go
Comment on lines +240 to +249
if platformtype == "nutanix" || platformtype == "none" {
getPodListByLabel(oc, oc.Namespace(), "ipfailover=hello-openshift")
workerIPAddress, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", "--selector=node-role.kubernetes.io/worker=", "-ojsonpath={.items[*].status.addresses[0].address}").Output()
o.Expect(err).NotTo(o.HaveOccurred())
modifiedIPList := strings.Split(workerIPAddress, " ")
if len(modifiedIPList) < 2 {
e2e.Failf("There is not enough IP addresses to add as unicast peer")
}
ipList := strings.Join(modifiedIPList, ",")
cmd := fmt.Sprintf("OPENSHIFT_HA_UNICAST_PEERS=%v", ipList)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don’t assume .status.addresses[0] is the worker IP.

Kubernetes does not guarantee address order. On many clusters index 0 is Hostname or ExternalIP, so OPENSHIFT_HA_UNICAST_PEERS can be populated with unusable values and the Nutanix/none path never converges. Filter node addresses by type=="InternalIP" before building the peer list.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/util.go` around lines 240 - 249, The unicast
peer list in the Nutanix/none branch is built from the first node address, but
.status.addresses[0] is not guaranteed to be an InternalIP. Update the logic in
the keepalived e2e util path that calls
oc.AsAdmin().WithoutNamespace().Run("get").Args("nodes", ...) to filter node
addresses by type == "InternalIP" before assembling OPENSHIFT_HA_UNICAST_PEERS,
and keep the existing validation that enough worker IPs were found.

Comment on lines +319 to +331
func waitForPodWithLabelReady(oc *exutil.CLI, ns, label string) error {
return wait.Poll(5*time.Second, 3*time.Minute, func() (bool, error) {
status, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("pod", "-n", ns, "-l", label, `-ojsonpath={.items[*].status.conditions[?(@.type=="Ready")].status}`).Output()
e2e.Logf("the Ready status of pod is %v", status)
if err != nil || status == "" {
e2e.Logf("failed to get pod status: %v, retrying...", err)
return false, nil
}
if strings.Contains(status, "False") {
e2e.Logf("the pod Ready status not met; wanted True but got %v, retrying...", status)
return false, nil
}
return true, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

This readiness gate can pass before the replacement replica exists.

It only checks the Ready condition of the pods that already exist; it never verifies how many labeled pods are present. During a rollout, one surviving Ready pod is enough for this to return success, and callers then immediately assume a two-pod MASTER/BACKUP set. Make the helper validate the expected replica count too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/e2e/util.go` around lines 319 - 331, The helper
waitForPodWithLabelReady currently only checks Ready status, so it can succeed
with a single surviving pod during rollout. Update waitForPodWithLabelReady to
also verify the number of pods returned by the label selector in addition to
readiness, and keep polling until both the expected replica count and Ready
condition are met. Use the existing wait.Poll logic and the kubectl get call in
waitForPodWithLabelReady to enforce the count before returning success.

Comment thread ipfailover/keepalived/tests/go.mod
@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-with-prs openshift/release#81231

@melvinjoseph86

Copy link
Copy Markdown
Author

/test-with-release openshift/release#81231

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs openshift/release#81231 e2e-openstack-ipi-tests-ext

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@melvinjoseph86 melvinjoseph86 force-pushed the ipfailover_migration_e2e branch from 791cb2c to 4d59ec2 Compare July 3, 2026 10:50
@melvinjoseph86 melvinjoseph86 marked this pull request as ready for review July 3, 2026 10:50
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2026
@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-with-prs openshift/release#81231

@openshift-ci openshift-ci Bot requested review from knobunc and rikatz July 3, 2026 10:51
@melvinjoseph86

Copy link
Copy Markdown
Author

/test-with-release openshift/release#81231

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs openshift/release#81231 e2e-openstack-ipi-tests-ext

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
ipfailover/keepalived/tests/README.md (1)

49-49: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Specify a language for the fenced code block.

📝 Proposed fix
-```
+```text
 tests/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/tests/README.md` at line 49, The fenced code block in
the README is missing a language tag, so update the example in the tests/
section to use an explicit fence language. Make the change in the README
markdown where the code block appears, and use the fenced block syntax
consistently so the snippet is clearly marked as text.

Source: Linters/SAST tools

ipfailover/keepalived/Dockerfile (2)

5-5: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Avoid copying the entire build context.

COPY . . pulls in the whole openshift/images repository context into the builder stage, which is unnecessary for building the keepalived test extension and increases build context size/cache invalidation. Scope the copy to only the required subtree.

As per path instructions, "COPY specific files, not entire context."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/Dockerfile` at line 5, The builder stage is copying the
entire repository context with COPY . ., which bloats the image build and
invalidates cache unnecessarily. Update the keepalived Dockerfile to copy only
the minimal required subtree or specific build inputs needed for the keepalived
test extension, replacing the broad copy in the build step.

Source: Path instructions


32-34: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Placement in /usr/bin/ looks unconventional for a .tar.gz.

Copying an archive (not an executable) into /usr/bin/ is atypical; consider extracting to a dedicated directory (e.g. /usr/local/ipfailover-tests/) if the archive is meant to be unpacked by consumers, to avoid confusion with executables in PATH.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/Dockerfile` around lines 32 - 34, The Dockerfile copy
step for the test extension archive is placing a .tar.gz into /usr/bin/, which
is misleading for a non-executable artifact. Update the COPY in the keepalived
Dockerfile so the archive is stored in a dedicated non-PATH location such as a
tests or artifacts directory, and keep the placement aligned with how the
ipfailover-test-extension.tar.gz is intended to be consumed or unpacked.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ipfailover/keepalived/Dockerfile`:
- Around line 6-11: The Dockerfile build step packages ipfailover-tests-ext into
ipfailover-test-extension.tar.gz, but the image never unpacks it and the tests
expect the binary itself. Update the keepalived Dockerfile build flow so the
artifact copied into /usr/bin/ is the executable ipfailover-tests-ext, or add an
explicit extraction step after the tarball is created; use the existing build
sequence around make build and the ipfailover-tests-ext output to keep the final
image runnable.

---

Nitpick comments:
In `@ipfailover/keepalived/Dockerfile`:
- Line 5: The builder stage is copying the entire repository context with COPY .
., which bloats the image build and invalidates cache unnecessarily. Update the
keepalived Dockerfile to copy only the minimal required subtree or specific
build inputs needed for the keepalived test extension, replacing the broad copy
in the build step.
- Around line 32-34: The Dockerfile copy step for the test extension archive is
placing a .tar.gz into /usr/bin/, which is misleading for a non-executable
artifact. Update the COPY in the keepalived Dockerfile so the archive is stored
in a dedicated non-PATH location such as a tests or artifacts directory, and
keep the placement aligned with how the ipfailover-test-extension.tar.gz is
intended to be consumed or unpacked.

In `@ipfailover/keepalived/tests/README.md`:
- Line 49: The fenced code block in the README is missing a language tag, so
update the example in the tests/ section to use an explicit fence language. Make
the change in the README markdown where the code block appears, and use the
fenced block syntax consistently so the snippet is clearly marked as text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +6 to +11
RUN cd ipfailover/keepalived/tests && \
make build && \
cd bin && \
tar -czvf ipfailover-test-extension.tar.gz ipfailover-tests-ext && \
rm -f ipfailover-tests-ext

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Look for anything that extracts ipfailover-test-extension.tar.gz (e.g. CI scripts, entrypoint, Makefile targets)
rg -n "ipfailover-test-extension" --hidden -g '!*.git*'

Repository: openshift/images

Length of output: 154


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the Dockerfile and related files for extraction/use of the tarball
git ls-files 'ipfailover/keepalived/*' 'ipfailover/**' | sed -n '1,200p'
echo
echo '--- Dockerfile ---'
cat -n ipfailover/keepalived/Dockerfile | sed -n '1,120p'
echo
echo '--- Search for tar extraction or the archive name ---'
rg -n "ipfailover-test-extension|tar -x|tarball|\.tar\.gz|ipfailover-tests-ext" ipfailover --hidden -g '!*.git*'

Repository: openshift/images

Length of output: 34534


Unpack the test extension before shipping it. The image copies ipfailover-test-extension.tar.gz into /usr/bin/, but nothing in this repo extracts it, and the tests reference ipfailover-tests-ext directly. Copy the binary itself or add an unpack step so it’s executable from the expected path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ipfailover/keepalived/Dockerfile` around lines 6 - 11, The Dockerfile build
step packages ipfailover-tests-ext into ipfailover-test-extension.tar.gz, but
the image never unpacks it and the tests expect the binary itself. Update the
keepalived Dockerfile build flow so the artifact copied into /usr/bin/ is the
executable ipfailover-tests-ext, or add an explicit extraction step after the
tarball is created; use the existing build sequence around make build and the
ipfailover-tests-ext output to keep the final image runnable.

melvinjoseph86 and others added 2 commits July 3, 2026 23:39
Creates tests-extension/ directory at repository root to match
OpenShift CI job expectations from openshift/release#81231.

The CI jobs run: cd tests-extension && make build && ./bin/ipfailover-tests-ext
This structure provides that interface while delegating to the actual
test implementation in ipfailover/keepalived/tests/.

Files created:
- tests-extension/Makefile: Delegates build to ipfailover/keepalived/tests/
- tests-extension/README.md: Documents structure and CI integration
- tests-extension/.gitignore: Excludes bin/ directory from git

The bin/ directory is created by make build (254MB binary) and should
not be committed. CI will build it fresh using the Makefile.

This fixes all 5 CI job failures (e2e-aws, images, e2e-metal-ipi-ovn-ipv6,
okd-scos-images, e2e-aws-upgrade) which were failing because the directory
structure didn't exist.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Change from rhel-9-golang-1.25-openshift-4.19 (not available) to
rhel-9-golang-1.23-openshift-4.18 (standard CI builder image).

The 1.25 image doesn't exist in the CI registry, causing build failures:
  dockerimage.image.openshift.io "quay.io/openshift/ci:ocp_builder_rhel-9-golang-1.25-openshift-4.19" not found

This fixes the CI build stage failure in e2e-aws and other jobs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-with-prs openshift/release#81231

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@melvinjoseph86

Copy link
Copy Markdown
Author

/test-with-release openshift/release#81231

@melvinjoseph86

Copy link
Copy Markdown
Author

@coderabbitai can you check the required changes are made

@melvinjoseph86

Copy link
Copy Markdown
Author

/payload-job-with-prs openshift/release#81231 e2e-openstack-ipi-tests-ext

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd ipfailover.go -p ipfailover/keepalived/tests
echo "---init---"
rg -n "func init" -A 20 ipfailover/keepalived/tests/e2e/ipfailover.go
echo "---output-interceptor---"
rg -n "output-interceptor|GinkgoWriter|klog.SetOutput|SetDefaultWarningHandler|os.Stderr" ipfailover/keepalived/tests/e2e/ipfailover.go

Length of output: 1672


@melvinjoseph86 I checked the current state of ipfailover/keepalived/tests/e2e/ipfailover.go, and the recommended fix from the investigation has not been applied yet. The init() function (Lines 19-34) still contains the original code:

func init() {
	flag.Set("ginkgo.output-interceptor-mode", "none")

	klog.SetOutput(g.GinkgoWriter)

	rest.SetDefaultWarningHandler(
		rest.NewWarningWriter(g.GinkgoWriter, rest.WarningWriterOptions{}),
	)

	exutil.WithCleanup(func() {})
}

This still uses g.GinkgoWriter, which is captured at init() time — before Ginkgo replaces it with the final stderr-backed writer. As explained in the earlier investigation, this timing issue causes klog INFO logs to leak to stdout and contaminate the JSON output.

To apply the suggested fix, update init() to write directly to os.Stderr and drop the now-ineffective flag.Set call:

func init() {
	// Write directly to stderr — no timing dependency on GinkgoWriter
	klog.SetOutput(os.Stderr)

	// Redirect REST client warnings directly to stderr
	rest.SetDefaultWarningHandler(
		rest.NewWarningWriter(os.Stderr, rest.WarningWriterOptions{}),
	)

	exutil.WithCleanup(func() {})
}

Note you'll also need to ensure os is imported and remove the now-unused flag import if it's not used elsewhere in the file. Let me know if you'd like me to plan out this code change.

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@melvinjoseph86: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws c6a1662 link true /test e2e-aws
ci/prow/images c6a1662 link true /test images
ci/prow/e2e-metal-ipi-ovn-ipv6 c6a1662 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/okd-scos-images c6a1662 link true /test okd-scos-images
ci/prow/e2e-aws-upgrade c6a1662 link true /test e2e-aws-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants