Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set#400
Draft
tomerweller wants to merge 8 commits into
Draft
Reach stellar-core HTTP via pod-exec when --core-http-via-pod-exec is set#400tomerweller wants to merge 8 commits into
tomerweller wants to merge 8 commits into
Conversation
The teardown metrics dump (DumpPeerMetrics -> GetRawMetrics -> Peer.fetch) scraped stellar-core's admin HTTP endpoint through the ingress hostname (<nonce>.<ingress-internal-domain>, defaulting to <nonce>.local). That name does not resolve in all cluster-DNS environments (e.g. non-SDF k3s clusters), so missions exited non-zero during teardown even when the consensus/loadgen logic succeeded. Scrape the raw metrics directly from the per-pod svc.cluster.local DNS name on the core HTTP port (11626) instead, bypassing the ingress. These per-pod service DNS names already work and are used elsewhere in SSC config generation. Fixes #399 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The initial hard switch to the per-pod svc.cluster.local DNS name regressed out-of-cluster runs (including CI's BootAndSync mission): SSC runs outside the cluster via kubeconfig, so cluster-internal DNS does not resolve from there and the teardown scrape failed with repeated name-resolution errors. Gate the behavior behind a new --metrics-via-cluster-dns flag (default false). By default the post-mission metrics scrape continues to use the ingress hostname as before; environments where the ingress hostname does not resolve but cluster DNS does (e.g. non-SDF k3s clusters) can opt in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fantomas --check flagged the multi-line concatenated HelpText. Collapse it to a single-line string like the other Option HelpText attributes in this file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in path for the post-mission raw metrics scrape to bypass the ingress hostname (which may be unresolvable in some environments, e.g. .local in certain k3s setups) and instead scrape metrics directly via the peer’s cluster DNS name on Stellar Core’s admin HTTP port.
Changes:
- Extend
MissionContextwith ametricsViaClusterDnsboolean to control metrics scraping behavior. - Add
--metrics-via-cluster-dnsCLI option (defaultfalse) and plumb it into mission context construction. - Update
Peer.GetRawMetrics()to conditionally scrape either via ingress (default) or via cluster DNS (opt-in).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/FSLibrary/StellarMissionContext.fs |
Adds metricsViaClusterDns to mission context to carry the opt-in behavior through the system. |
src/FSLibrary/StellarCoreHTTP.fs |
Implements cluster-DNS URL + fetch path and switches GetRawMetrics() based on metricsViaClusterDns. |
src/FSLibrary.Tests/Tests.fs |
Updates test MissionContext construction to include the new field. |
src/App/Program.fs |
Adds the --metrics-via-cluster-dns flag and wires it into the mission context used at runtime. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace the URL/ingress-based metrics scrape with a `kubectl exec`-style call that runs `curl http://localhost:11626/metrics` inside each stellar-core pod, mirroring how DumpPeerDatabase already exec's sqlite3. This works in every environment SSC runs in, including a runner outside a non-SDF k3s cluster (e.g. Namespace/nsc): pod exec only needs the Kubernetes API that SSC already uses to manage the cluster, so it does not depend on the ingress `.local` hostname or on cluster-internal DNS (`svc.cluster.local`) being resolvable from the runner -- neither of which resolves from an out-of-cluster runner. The scrape is also now best-effort: a failure logs a warning instead of throwing, so a transient or environment-specific scrape problem no longer makes an otherwise-successful mission exit non-zero (the actual impact in #399, which blocks automated pass/fail gating). This supersedes the earlier --metrics-via-cluster-dns approach, which only helped a runner that could already resolve cluster DNS (i.e. in-cluster), so those changes are reverted. Fixes #399 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-exec set The teardown metrics dump was the only thing using pod-exec; in-mission health checks (WaitUntilReady -> GetInfo, WaitUntilSynced, WaitFor*) and actions (SetUpgrades, GenerateLoad, surveys, tx submit) still went through the ingress .local hostname. On an out-of-cluster runner (e.g. Namespace/nsc k3s) that hostname is unreachable, so SSC could not verify readiness even though core reached consensus -- the same DNS problem surfacing earlier than teardown. Funnel every core admin-HTTP GET through a single `Peer.httpGet path query` that selects the transport: - default: ingress (unchanged), preserving SDF behavior and the fast in-mission polling hot path (exec-per-poll would hammer the k8s API); - --core-http-via-pod-exec: exec `curl http://localhost:11626/<path>?<query>` inside the pod via the Kubernetes API, which needs no DNS and works from any runner that can reach the API server. Exec failures are surfaced as WebException so the existing WebExceptionRetry wrappers keep retrying transient errors (e.g. core still booting) uniformly. DumpPeerMetrics now just calls the transport-aware GetRawMetrics (still best-effort). All ~15 fetch/RequestString call sites route through httpGet, so nothing falls back to the ingress when the flag is on. CI: add a second BootAndSync run with --core-http-via-pod-exec and no resolvable --ingress-external-host, so any access that leaks to the ingress fails with name resolution -- proving full pod-exec coverage (mirrors an out-of-cluster nsc runner). Fixes #399 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
During WaitUntilReady, curl against a not-yet-ready core can return an empty body with exit 0, which previously reached Info.Parse and threw a non-retryable JSON error, crashing the mission. Surface an empty response as a WebException so the existing WebExceptionRetry retries until core is ready, matching how the ingress path retries on the 5xx it gets while core boots. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous empty-response guard lived in the transport and broke /upgrades, which legitimately returns an empty body on success (SetUpgrades only checks the body for "exception"). Move the empty-is-not-ready check into a fetchNonEmpty helper used by the JSON-parsing reads (info, metrics, sorobaninfo) only; action endpoints keep using fetch/httpGet and tolerate an empty body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #399.
SSC reaches each stellar-core's admin HTTP endpoint through the ingress, whose hostname falls back to
<nonce>.<ingress-internal-domain>(default<nonce>.local). On a runner that sits outside the cluster (e.g. a Namespace/nsc k3s runner — a local Docker container with a kubeconfig), that hostname is unreachable, and so is the per-pod*.svc.cluster.localname (cluster-internal DNS only). The result: missions exit non-zero even when core reached consensus — first observed as a teardown metrics-dump crash (#399), but it's the same DNS gap behindWaitUntilReady → GetInfoand every other HTTP call.Fix
Add an opt-in flag
--core-http-via-pod-exec(defaultfalse). All stellar-core admin-HTTP GETs now funnel through a singlePeer.httpGet path querythat selects the transport:http://<ingress-host>/<pod>/core/<path>), unchanged. Preserves SDF behavior and keeps the in-mission polling hot path (WaitFor*poll every 1s/peer) off the k8s API.--core-http-via-pod-exec— execcurl http://localhost:11626/<path>?<query>inside the pod via the Kubernetes API (the same mechanismDumpPeerDatabasealready uses forsqlite3). Needs no DNS and works from any runner that can reach the API server.This covers everything —
GetInfo/GetState/GetMetrics/GetSorobanInfo, allWaitUntil*/WaitFor*loops,SetUpgrades,GenerateLoad, surveys,txsubmit, and the teardown metrics dump — so nothing falls back to the ingress when the flag is on.Details:
WebException, so the existingWebExceptionRetrywrappers retry transient errors (e.g. core still booting) exactly as they do for ingress.info/metrics/sorobaninfo) go throughfetchNonEmpty: a booting core can answer with an empty body (curl exits 0), which is treated as "not ready yet" and retried instead of crashing a JSON parser. Endpoints that legitimately return an empty body on success (e.g.upgrades) are unaffected.DumpPeerMetricsstays best-effort (a failed scrape logs a warning, never fails the mission) and just calls the transport-awareGetRawMetrics.Verification in CI
The
completejob now runs BootAndSync twice:--ingress-external-host localhost) — proves the default path still works;--core-http-via-pod-execrun with no resolvable ingress host — the ingress host falls back to<nonce>.local, so any HTTP that leaked back to the ingress would fail with a name-resolution error. A green run proves all core HTTP goes through pod exec. This mirrors an out-of-cluster nsc runner.Both runs pass. Build, unit tests, and fantomas were also validated locally.
Usage
From an out-of-cluster runner (e.g. nsc/Namespace), add:
Requires
curlin the core image (present in the stock stellar-core image; already used by init containers).🤖 Generated with Claude Code