Skip to content

Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set#400

Draft
tomerweller wants to merge 8 commits into
mainfrom
fix-399-metrics-scrape-cluster-dns
Draft

Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set#400
tomerweller wants to merge 8 commits into
mainfrom
fix-399-metrics-scrape-cluster-dns

Conversation

@tomerweller

@tomerweller tomerweller commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

Closes #399.

SSC reaches each stellar-core's admin HTTP endpoint through the ingress, whose hostname falls back to <nonce>.<ingress-internal-domain> (default <nonce>.local). On a runner that sits outside the cluster (e.g. a Namespace/nsc k3s runner — a local Docker container with a kubeconfig), that hostname is unreachable, and so is the per-pod *.svc.cluster.local name (cluster-internal DNS only). The result: missions exit non-zero even when core reached consensus — first observed as a teardown metrics-dump crash (#399), but it's the same DNS gap behind WaitUntilReady → GetInfo and every other HTTP call.

Fix

Add an opt-in flag --core-http-via-pod-exec (default false). All stellar-core admin-HTTP GETs now funnel through a single Peer.httpGet path query that selects the transport:

  • default — ingress (http://<ingress-host>/<pod>/core/<path>), unchanged. Preserves SDF behavior and keeps the in-mission polling hot path (WaitFor* poll every 1s/peer) off the k8s API.
  • --core-http-via-pod-exec — exec curl http://localhost:11626/<path>?<query> inside the pod via the Kubernetes API (the same mechanism DumpPeerDatabase already uses for sqlite3). Needs no DNS and works from any runner that can reach the API server.

This covers everything — GetInfo/GetState/GetMetrics/GetSorobanInfo, all WaitUntil*/WaitFor* loops, SetUpgrades, GenerateLoad, surveys, tx submit, and the teardown metrics dump — so nothing falls back to the ingress when the flag is on.

Details:

  • Pod-exec failures are surfaced as WebException, so the existing WebExceptionRetry wrappers retry transient errors (e.g. core still booting) exactly as they do for ingress.
  • JSON reads (info/metrics/sorobaninfo) go through fetchNonEmpty: a booting core can answer with an empty body (curl exits 0), which is treated as "not ready yet" and retried instead of crashing a JSON parser. Endpoints that legitimately return an empty body on success (e.g. upgrades) are unaffected.
  • DumpPeerMetrics stays best-effort (a failed scrape logs a warning, never fails the mission) and just calls the transport-aware GetRawMetrics.

Verification in CI

The complete job now runs BootAndSync twice:

  1. the existing ingress run (--ingress-external-host localhost) — proves the default path still works;
  2. a --core-http-via-pod-exec run with no resolvable ingress host — the ingress host falls back to <nonce>.local, so any HTTP that leaked back to the ingress would fail with a name-resolution error. A green run proves all core HTTP goes through pod exec. This mirrors an out-of-cluster nsc runner.

Both runs pass. Build, unit tests, and fantomas were also validated locally.

Usage

From an out-of-cluster runner (e.g. nsc/Namespace), add:

--core-http-via-pod-exec

Requires curl in the core image (present in the stock stellar-core image; already used by init containers).

🤖 Generated with Claude Code

tomerweller and others added 3 commits June 18, 2026 13:33
The teardown metrics dump (DumpPeerMetrics -> GetRawMetrics -> Peer.fetch)
scraped stellar-core's admin HTTP endpoint through the ingress hostname
(<nonce>.<ingress-internal-domain>, defaulting to <nonce>.local). That name
does not resolve in all cluster-DNS environments (e.g. non-SDF k3s clusters),
so missions exited non-zero during teardown even when the consensus/loadgen
logic succeeded.

Scrape the raw metrics directly from the per-pod svc.cluster.local DNS name
on the core HTTP port (11626) instead, bypassing the ingress. These per-pod
service DNS names already work and are used elsewhere in SSC config
generation.

Fixes #399

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The initial hard switch to the per-pod svc.cluster.local DNS name regressed
out-of-cluster runs (including CI's BootAndSync mission): SSC runs outside the
cluster via kubeconfig, so cluster-internal DNS does not resolve from there and
the teardown scrape failed with repeated name-resolution errors.

Gate the behavior behind a new --metrics-via-cluster-dns flag (default false).
By default the post-mission metrics scrape continues to use the ingress
hostname as before; environments where the ingress hostname does not resolve
but cluster DNS does (e.g. non-SDF k3s clusters) can opt in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fantomas --check flagged the multi-line concatenated HelpText. Collapse it to
a single-line string like the other Option HelpText attributes in this file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in path for the post-mission raw metrics scrape to bypass the ingress hostname (which may be unresolvable in some environments, e.g. .local in certain k3s setups) and instead scrape metrics directly via the peer’s cluster DNS name on Stellar Core’s admin HTTP port.

Changes:

  • Extend MissionContext with a metricsViaClusterDns boolean to control metrics scraping behavior.
  • Add --metrics-via-cluster-dns CLI option (default false) and plumb it into mission context construction.
  • Update Peer.GetRawMetrics() to conditionally scrape either via ingress (default) or via cluster DNS (opt-in).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/FSLibrary/StellarMissionContext.fs Adds metricsViaClusterDns to mission context to carry the opt-in behavior through the system.
src/FSLibrary/StellarCoreHTTP.fs Implements cluster-DNS URL + fetch path and switches GetRawMetrics() based on metricsViaClusterDns.
src/FSLibrary.Tests/Tests.fs Updates test MissionContext construction to include the new field.
src/App/Program.fs Adds the --metrics-via-cluster-dns flag and wires it into the mission context used at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace the URL/ingress-based metrics scrape with a `kubectl exec`-style call
that runs `curl http://localhost:11626/metrics` inside each stellar-core pod,
mirroring how DumpPeerDatabase already exec's sqlite3.

This works in every environment SSC runs in, including a runner outside a
non-SDF k3s cluster (e.g. Namespace/nsc): pod exec only needs the Kubernetes
API that SSC already uses to manage the cluster, so it does not depend on the
ingress `.local` hostname or on cluster-internal DNS (`svc.cluster.local`)
being resolvable from the runner -- neither of which resolves from an
out-of-cluster runner.

The scrape is also now best-effort: a failure logs a warning instead of
throwing, so a transient or environment-specific scrape problem no longer
makes an otherwise-successful mission exit non-zero (the actual impact in #399,
which blocks automated pass/fail gating).

This supersedes the earlier --metrics-via-cluster-dns approach, which only
helped a runner that could already resolve cluster DNS (i.e. in-cluster), so
those changes are reverted.

Fixes #399

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tomerweller tomerweller changed the title Scrape post-mission metrics via per-pod svc.cluster.local DNS Scrape post-mission metrics via pod exec + curl (best-effort) Jun 18, 2026
tomerweller and others added 4 commits June 18, 2026 14:58
…-exec set

The teardown metrics dump was the only thing using pod-exec; in-mission health
checks (WaitUntilReady -> GetInfo, WaitUntilSynced, WaitFor*) and actions
(SetUpgrades, GenerateLoad, surveys, tx submit) still went through the ingress
.local hostname. On an out-of-cluster runner (e.g. Namespace/nsc k3s) that
hostname is unreachable, so SSC could not verify readiness even though core
reached consensus -- the same DNS problem surfacing earlier than teardown.

Funnel every core admin-HTTP GET through a single `Peer.httpGet path query`
that selects the transport:
- default: ingress (unchanged), preserving SDF behavior and the fast in-mission
  polling hot path (exec-per-poll would hammer the k8s API);
- --core-http-via-pod-exec: exec `curl http://localhost:11626/<path>?<query>`
  inside the pod via the Kubernetes API, which needs no DNS and works from any
  runner that can reach the API server.

Exec failures are surfaced as WebException so the existing WebExceptionRetry
wrappers keep retrying transient errors (e.g. core still booting) uniformly.
DumpPeerMetrics now just calls the transport-aware GetRawMetrics (still
best-effort). All ~15 fetch/RequestString call sites route through httpGet, so
nothing falls back to the ingress when the flag is on.

CI: add a second BootAndSync run with --core-http-via-pod-exec and no
resolvable --ingress-external-host, so any access that leaks to the ingress
fails with name resolution -- proving full pod-exec coverage (mirrors an
out-of-cluster nsc runner).

Fixes #399

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
During WaitUntilReady, curl against a not-yet-ready core can return an empty
body with exit 0, which previously reached Info.Parse and threw a non-retryable
JSON error, crashing the mission. Surface an empty response as a WebException
so the existing WebExceptionRetry retries until core is ready, matching how the
ingress path retries on the 5xx it gets while core boots.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous empty-response guard lived in the transport and broke /upgrades,
which legitimately returns an empty body on success (SetUpgrades only checks
the body for "exception"). Move the empty-is-not-ready check into a
fetchNonEmpty helper used by the JSON-parsing reads (info, metrics,
sorobaninfo) only; action endpoints keep using fetch/httpGet and tolerate an
empty body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tomerweller tomerweller changed the title Scrape post-mission metrics via pod exec + curl (best-effort) Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Post-mission DumpData/GetRawMetrics crashes when cluster does not resolve .local TLD

2 participants