Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set by tomerweller · Pull Request #400 · stellar/supercluster

tomerweller · 2026-06-18T17:33:57Z

Problem

Closes #399.

SSC reaches each stellar-core's admin HTTP endpoint through the ingress, whose hostname falls back to <nonce>.<ingress-internal-domain> (default <nonce>.local). On a runner that sits outside the cluster (e.g. a Namespace/nsc k3s runner — a local Docker container with a kubeconfig), that hostname is unreachable, and so is the per-pod *.svc.cluster.local name (cluster-internal DNS only). The result: missions exit non-zero even when core reached consensus — first observed as a teardown metrics-dump crash (#399), but it's the same DNS gap behind WaitUntilReady → GetInfo and every other HTTP call.

Fix

Add an opt-in flag --core-http-via-pod-exec (default false). All stellar-core admin-HTTP GETs now funnel through a single Peer.httpGet path query that selects the transport:

default — ingress (http://<ingress-host>/<pod>/core/<path>), unchanged. Preserves SDF behavior and keeps the in-mission polling hot path (WaitFor* poll every 1s/peer) off the k8s API.
--core-http-via-pod-exec — exec curl http://localhost:11626/<path>?<query> inside the pod via the Kubernetes API (the same mechanism DumpPeerDatabase already uses for sqlite3). Needs no DNS and works from any runner that can reach the API server.

This covers everything — GetInfo/GetState/GetMetrics/GetSorobanInfo, all WaitUntil*/WaitFor* loops, SetUpgrades, GenerateLoad, surveys, tx submit, and the teardown metrics dump — so nothing falls back to the ingress when the flag is on.

Details:

Pod-exec failures are surfaced as WebException, so the existing WebExceptionRetry wrappers retry transient errors (e.g. core still booting) exactly as they do for ingress.
JSON reads (info/metrics/sorobaninfo) go through fetchNonEmpty: a booting core can answer with an empty body (curl exits 0), which is treated as "not ready yet" and retried instead of crashing a JSON parser. Endpoints that legitimately return an empty body on success (e.g. upgrades) are unaffected.
DumpPeerMetrics stays best-effort (a failed scrape logs a warning, never fails the mission) and just calls the transport-aware GetRawMetrics.

Verification in CI

The complete job now runs BootAndSync twice:

the existing ingress run (--ingress-external-host localhost) — proves the default path still works;
a --core-http-via-pod-exec run with no resolvable ingress host — the ingress host falls back to <nonce>.local, so any HTTP that leaked back to the ingress would fail with a name-resolution error. A green run proves all core HTTP goes through pod exec. This mirrors an out-of-cluster nsc runner.

Both runs pass. Build, unit tests, and fantomas were also validated locally.

Usage

From an out-of-cluster runner (e.g. nsc/Namespace), add:

--core-http-via-pod-exec

Requires curl in the core image (present in the stock stellar-core image; already used by init containers).

🤖 Generated with Claude Code

The teardown metrics dump (DumpPeerMetrics -> GetRawMetrics -> Peer.fetch) scraped stellar-core's admin HTTP endpoint through the ingress hostname (<nonce>.<ingress-internal-domain>, defaulting to <nonce>.local). That name does not resolve in all cluster-DNS environments (e.g. non-SDF k3s clusters), so missions exited non-zero during teardown even when the consensus/loadgen logic succeeded. Scrape the raw metrics directly from the per-pod svc.cluster.local DNS name on the core HTTP port (11626) instead, bypassing the ingress. These per-pod service DNS names already work and are used elsewhere in SSC config generation. Fixes #399 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The initial hard switch to the per-pod svc.cluster.local DNS name regressed out-of-cluster runs (including CI's BootAndSync mission): SSC runs outside the cluster via kubeconfig, so cluster-internal DNS does not resolve from there and the teardown scrape failed with repeated name-resolution errors. Gate the behavior behind a new --metrics-via-cluster-dns flag (default false). By default the post-mission metrics scrape continues to use the ingress hostname as before; environments where the ingress hostname does not resolve but cluster DNS does (e.g. non-SDF k3s clusters) can opt in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fantomas --check flagged the multi-line concatenated HelpText. Collapse it to a single-line string like the other Option HelpText attributes in this file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds an opt-in path for the post-mission raw metrics scrape to bypass the ingress hostname (which may be unresolvable in some environments, e.g. .local in certain k3s setups) and instead scrape metrics directly via the peer’s cluster DNS name on Stellar Core’s admin HTTP port.

Changes:

Extend MissionContext with a metricsViaClusterDns boolean to control metrics scraping behavior.
Add --metrics-via-cluster-dns CLI option (default false) and plumb it into mission context construction.
Update Peer.GetRawMetrics() to conditionally scrape either via ingress (default) or via cluster DNS (opt-in).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`src/FSLibrary/StellarMissionContext.fs`	Adds `metricsViaClusterDns` to mission context to carry the opt-in behavior through the system.
`src/FSLibrary/StellarCoreHTTP.fs`	Implements cluster-DNS URL + fetch path and switches `GetRawMetrics()` based on `metricsViaClusterDns`.
`src/FSLibrary.Tests/Tests.fs`	Updates test `MissionContext` construction to include the new field.
`src/App/Program.fs`	Adds the `--metrics-via-cluster-dns` flag and wires it into the mission context used at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace the URL/ingress-based metrics scrape with a `kubectl exec`-style call that runs `curl http://localhost:11626/metrics` inside each stellar-core pod, mirroring how DumpPeerDatabase already exec's sqlite3. This works in every environment SSC runs in, including a runner outside a non-SDF k3s cluster (e.g. Namespace/nsc): pod exec only needs the Kubernetes API that SSC already uses to manage the cluster, so it does not depend on the ingress `.local` hostname or on cluster-internal DNS (`svc.cluster.local`) being resolvable from the runner -- neither of which resolves from an out-of-cluster runner. The scrape is also now best-effort: a failure logs a warning instead of throwing, so a transient or environment-specific scrape problem no longer makes an otherwise-successful mission exit non-zero (the actual impact in #399, which blocks automated pass/fail gating). This supersedes the earlier --metrics-via-cluster-dns approach, which only helped a runner that could already resolve cluster DNS (i.e. in-cluster), so those changes are reverted. Fixes #399 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-exec set The teardown metrics dump was the only thing using pod-exec; in-mission health checks (WaitUntilReady -> GetInfo, WaitUntilSynced, WaitFor*) and actions (SetUpgrades, GenerateLoad, surveys, tx submit) still went through the ingress .local hostname. On an out-of-cluster runner (e.g. Namespace/nsc k3s) that hostname is unreachable, so SSC could not verify readiness even though core reached consensus -- the same DNS problem surfacing earlier than teardown. Funnel every core admin-HTTP GET through a single `Peer.httpGet path query` that selects the transport: - default: ingress (unchanged), preserving SDF behavior and the fast in-mission polling hot path (exec-per-poll would hammer the k8s API); - --core-http-via-pod-exec: exec `curl http://localhost:11626/<path>?<query>` inside the pod via the Kubernetes API, which needs no DNS and works from any runner that can reach the API server. Exec failures are surfaced as WebException so the existing WebExceptionRetry wrappers keep retrying transient errors (e.g. core still booting) uniformly. DumpPeerMetrics now just calls the transport-aware GetRawMetrics (still best-effort). All ~15 fetch/RequestString call sites route through httpGet, so nothing falls back to the ingress when the flag is on. CI: add a second BootAndSync run with --core-http-via-pod-exec and no resolvable --ingress-external-host, so any access that leaks to the ingress fails with name resolution -- proving full pod-exec coverage (mirrors an out-of-cluster nsc runner). Fixes #399 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

During WaitUntilReady, curl against a not-yet-ready core can return an empty body with exit 0, which previously reached Info.Parse and threw a non-retryable JSON error, crashing the mission. Surface an empty response as a WebException so the existing WebExceptionRetry retries until core is ready, matching how the ingress path retries on the 5xx it gets while core boots. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The previous empty-response guard lived in the transport and broke /upgrades, which legitimately returns an empty body on success (SetUpgrades only checks the body for "exception"). Move the empty-is-not-ready check into a fetchNonEmpty helper used by the JSON-parsing reads (info, metrics, sorobaninfo) only; action endpoints keep using fetch/httpGet and tolerate an empty body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tomerweller and others added 3 commits June 18, 2026 13:33

Format metrics-via-cluster-dns option to match house style

3324dbc

fantomas --check flagged the multi-line concatenated HelpText. Collapse it to a single-line string like the other Option HelpText attributes in this file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tomerweller requested a review from Copilot June 18, 2026 17:58

Copilot started reviewing on behalf of tomerweller June 18, 2026 17:58 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

tomerweller changed the title ~~Scrape post-mission metrics via per-pod svc.cluster.local DNS~~ Scrape post-mission metrics via pod exec + curl (best-effort) Jun 18, 2026

tomerweller and others added 4 commits June 18, 2026 14:58

Apply fantomas formatting

b04dc2c

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tomerweller changed the title ~~Scrape post-mission metrics via pod exec + curl (best-effort)~~ Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set#400

Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set#400
tomerweller wants to merge 8 commits into
mainfrom
fix-399-metrics-scrape-cluster-dns

tomerweller commented Jun 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tomerweller commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification in CI

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomerweller commented Jun 18, 2026 •

edited

Loading