Skip to content

status: distinguish API down from unreachable#170

Open
jarugupj wants to merge 3 commits into
mainfrom
hypeship/status-distinguish-api-down
Open

status: distinguish API down from unreachable#170
jarugupj wants to merge 3 commits into
mainfrom
hypeship/status-distinguish-api-down

Conversation

@jarugupj
Copy link
Copy Markdown
Contributor

@jarugupj jarugupj commented May 28, 2026

Summary

Splits the failure handling for kernel status so the message accurately reflects what happened, instead of collapsing both transport errors and non-2xx responses into one ambiguous "Could not reach Kernel API" line.

Three failure cases now:

  • Transport error on /status (network down, DNS, timeout) → Could not reach Kernel API. Check https://status.kernel.sh for updates.
  • Non-2xx from /status, /health also unhealthyKernel API is down. Check https://status.kernel.sh for updates.
  • Non-2xx from /status, /health returns 200Kernel API is responding but /status is unavailable. Check https://status.kernel.sh for updates.

The /health probe matters because /status has an upstream dependency on incident.io, so a 5xx from /status doesn't necessarily mean the API itself is down. /health is a dependency-free liveness check on the same server, so a 200 there confirms the API is up and only the status endpoint is broken.

No new helpers, no synthetic UI rendering — reserves the dotted status UI for real data from the API (same convention the dashboard's status indicator follows).

Test plan

  • Healthy run against prod still renders the status UI as before
  • Transport error (unreachable host) prints "Could not reach Kernel API…"
  • /status 5xx + /health 5xx prints "Kernel API is down…"
  • /status 5xx + /health 200 prints "Kernel API is responding but /status is unavailable…"

🤖 Generated with Claude Code


Note

Low Risk
User-facing error strings and an extra HTTP GET on failure paths only; no change to successful status rendering or auth/data handling.

Overview
kernel status now reports three distinct failure modes instead of one generic “could not reach” message for every non-success case.

When /status returns a non-2xx response, the CLI probes /health (3s timeout). If /health is healthy, users see that the API is up but /status is unavailable (e.g. incident.io dependency). If /health is also unhealthy, the message says the API is down. Transport errors on /status still use the original “could not reach” wording.

Reviewed by Cursor Bugbot for commit 3d48cd0. Bugbot is set up for automated code reviews on this repo. Configure here.

When /status returns a non-2xx response the API is reachable but
unhealthy, so report it as down rather than reusing the generic
'could not reach' message used for transport errors.
@jarugupj jarugupj force-pushed the hypeship/status-distinguish-api-down branch from 5c1365d to cfc2591 Compare May 29, 2026 14:10
…broken

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jarugupj jarugupj force-pushed the hypeship/status-distinguish-api-down branch from cfc2591 to ad44904 Compare May 29, 2026 15:54
@jarugupj jarugupj marked this pull request as ready for review May 29, 2026 15:57
@jarugupj jarugupj requested a review from masnwilliams May 29, 2026 16:01
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Improve kernel status error message differentiation

What this PR does: Gives the kernel status command a more informative error message when the API's status page is unreachable — distinguishing a /status endpoint outage from a full API outage.

Intended effect:

  • CLI output correctness: No telemetry baseline exists (CLI has no OTel instrumentation). Confirmed by running kernel status post-deploy under normal conditions — output should be identical to pre-deploy (status table displayed). The new messages only appear during failure scenarios.

Risks:

  • Extended timeout during full outage — when the API is completely unreachable, the CLI now makes two sequential HTTP calls (each up to 10s timeout), potentially doubling the wait time before showing an error. No signal to alert on; first user invocation during an outage surfaces this.
  • /health endpoint stability — if https://api.onkernel.com/health is not a stable always-200 endpoint, the fallback message could mislead users. Verify /health reliability before release; alert if any non-2xx is observed from this endpoint during normal operation.
  • API 5xx regression (unrelated path) — overall API error rate baseline is 0.08–0.79% 5xx; alert if sustained above 2% for 2+ hours post-deploy (pre-existing infra noise pattern, not expected to be PR-related).

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Copy link
Copy Markdown
Contributor

@masnwilliams masnwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved — nicely distinguishes API-down from status-unavailable. verified /health exists (registered in the API and used as Railway's healthcheck path) and shares the same base url as /status, so the probe is apples-to-apples.

nit: cmd/status.go:53 — the secondary /health probe reuses the same 10s client timeout, so a fully-hung API could take ~20s before the CLI prints. consider a shorter timeout on the fallback probe.

Use a dedicated http.Client for the /health probe with a 3s timeout instead of
sharing the 10s client used for /status. Caps worst-case wait during a genuine
outage (both /status and /health hung) at ~13s instead of ~20s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jarugupj
Copy link
Copy Markdown
Contributor Author

jarugupj commented Jun 1, 2026

addressed the timeout nit in 3d48cd0 - dedicated healthClient with 3s timeout for the probe, caps worst-case wait at ~13s. lmk if you want to take another look before i merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants