status: distinguish API down from unreachable by jarugupj · Pull Request #170 · kernel/cli

jarugupj · 2026-05-28T17:18:16Z

Summary

Splits the failure handling for kernel status so the message accurately reflects what happened, instead of collapsing both transport errors and non-2xx responses into one ambiguous "Could not reach Kernel API" line.

Three failure cases now:

Transport error on /status (network down, DNS, timeout) → Could not reach Kernel API. Check https://status.kernel.sh for updates.
Non-2xx from /status, /health also unhealthy → Kernel API is down. Check https://status.kernel.sh for updates.
Non-2xx from /status, /health returns 200 → Kernel API is responding but /status is unavailable. Check https://status.kernel.sh for updates.

The /health probe matters because /status has an upstream dependency on incident.io, so a 5xx from /status doesn't necessarily mean the API itself is down. /health is a dependency-free liveness check on the same server, so a 200 there confirms the API is up and only the status endpoint is broken.

No new helpers, no synthetic UI rendering — reserves the dotted status UI for real data from the API (same convention the dashboard's status indicator follows).

Test plan

Healthy run against prod still renders the status UI as before
Transport error (unreachable host) prints "Could not reach Kernel API…"
/status 5xx + /health 5xx prints "Kernel API is down…"
/status 5xx + /health 200 prints "Kernel API is responding but /status is unavailable…"

🤖 Generated with Claude Code

Note

Low Risk
User-facing error strings and an extra HTTP GET on failure paths only; no change to successful status rendering or auth/data handling.

Overview
kernel status now reports three distinct failure modes instead of one generic “could not reach” message for every non-success case.

When /status returns a non-2xx response, the CLI probes /health (3s timeout). If /health is healthy, users see that the API is up but /status is unavailable (e.g. incident.io dependency). If /health is also unhealthy, the message says the API is down. Transport errors on /status still use the original “could not reach” wording.

^{Reviewed by Cursor Bugbot for commit 3d48cd0. Bugbot is set up for automated code reviews on this repo. Configure here.}

When /status returns a non-2xx response the API is reachable but unhealthy, so report it as down rather than reusing the generic 'could not reach' message used for transport errors.

…broken Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

firetiger-agent · 2026-05-29T16:03:30Z

Monitoring Plan: Improve `kernel status` error message differentiation

What this PR does: Gives the kernel status command a more informative error message when the API's status page is unreachable — distinguishing a /status endpoint outage from a full API outage.

Intended effect:

CLI output correctness: No telemetry baseline exists (CLI has no OTel instrumentation). Confirmed by running kernel status post-deploy under normal conditions — output should be identical to pre-deploy (status table displayed). The new messages only appear during failure scenarios.

Risks:

Extended timeout during full outage — when the API is completely unreachable, the CLI now makes two sequential HTTP calls (each up to 10s timeout), potentially doubling the wait time before showing an error. No signal to alert on; first user invocation during an outage surfaces this.
/health endpoint stability — if https://api.onkernel.com/health is not a stable always-200 endpoint, the fallback message could mislead users. Verify /health reliability before release; alert if any non-2xx is observed from this endpoint during normal operation.
API 5xx regression (unrelated path) — overall API error rate baseline is 0.08–0.79% 5xx; alert if sustained above 2% for 2+ hours post-deploy (pre-existing infra noise pattern, not expected to be PR-related).

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

masnwilliams

approved — nicely distinguishes API-down from status-unavailable. verified /health exists (registered in the API and used as Railway's healthcheck path) and shares the same base url as /status, so the probe is apples-to-apples.

nit: cmd/status.go:53 — the secondary /health probe reuses the same 10s client timeout, so a fully-hung API could take ~20s before the CLI prints. consider a shorter timeout on the fallback probe.

Use a dedicated http.Client for the /health probe with a 3s timeout instead of sharing the 10s client used for /status. Caps worst-case wait during a genuine outage (both /status and /health hung) at ~13s instead of ~20s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jarugupj · 2026-06-01T19:29:13Z

addressed the timeout nit in 3d48cd0 - dedicated healthClient with 3s timeout for the probe, caps worst-case wait at ~13s. lmk if you want to take another look before i merge.

distinguish API down from unreachable in status command

42bdba0

When /status returns a non-2xx response the API is reachable but unhealthy, so report it as down rather than reusing the generic 'could not reach' message used for transport errors.

jarugupj force-pushed the hypeship/status-distinguish-api-down branch from 5c1365d to cfc2591 Compare May 29, 2026 14:10

fall back to /health on non-2xx to distinguish API down from /status …

ad44904

…broken Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jarugupj force-pushed the hypeship/status-distinguish-api-down branch from cfc2591 to ad44904 Compare May 29, 2026 15:54

jarugupj marked this pull request as ready for review May 29, 2026 15:57

jarugupj requested a review from masnwilliams May 29, 2026 16:01

masnwilliams approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status: distinguish API down from unreachable#170

status: distinguish API down from unreachable#170
jarugupj wants to merge 3 commits into
mainfrom
hypeship/status-distinguish-api-down

jarugupj commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented May 29, 2026

Uh oh!

masnwilliams left a comment

Uh oh!

jarugupj commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jarugupj commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

firetiger-agent Bot commented May 29, 2026

Monitoring Plan: Improve kernel status error message differentiation

Uh oh!

masnwilliams left a comment

Choose a reason for hiding this comment

Uh oh!

jarugupj commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jarugupj commented May 28, 2026 •

edited by cursor Bot

Loading

Monitoring Plan: Improve `kernel status` error message differentiation