feat: fall back to the control plane when direct-to-VM routing hits a dead browser#109
Draft
rgarcia wants to merge 1 commit into
Draft
feat: fall back to the control plane when direct-to-VM routing hits a dead browser#109rgarcia wants to merge 1 commit into
rgarcia wants to merge 1 commit into
Conversation
Replace the prior broad-trigger fallback (fall back on any 5xx for any
idempotent GET) with a tighter, opt-in design that assumes the paired
metro-api change (kernel#2317): a routed request against a deleted/gone
browser returns HTTP 404 with body {"code":"browser_gone"}.
The routing layer now keeps a small registry of fallback-ELIGIBLE routed
paths (subresource + suffix), default-OFF for everything else. Only the
prospective GET /browsers/{id}/telemetry/events endpoint is pre-registered;
adding future eligible endpoints is a one-line edit.
Kernel.request / AsyncKernel.request fall back to the control plane IFF the
request was actually routed to the VM, the method is GET, the routed path is
eligible, and the VM returned a 404 whose JSON body code == "browser_gone".
On fallback we evict the cached route and re-issue the ORIGINAL request to
the control plane exactly once (CP URL, Authorization restored, jwt param
dropped). Transient 5xx, connection errors, other 4xx, success, non-eligible
paths, POSTs, and non-routed requests all propagate unchanged.
This PR intentionally does NOT modify the default routing subresource list.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fd1d2ec to
cfc1aa1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Opt-in control-plane fallback for
browser_gone404sReworks the prior draft, which fell back to the control plane on any 5xx for any idempotent GET. That broad trigger meant a transient 502 from a healthy VM caused a retry of the dead VM and then a control-plane re-issue, adding latency for failures that should just be returned.
This replaces it with a tighter, opt-in, per-endpoint design that assumes the paired metro-api change kernel#2317 lands.
Depends on kernel#2317
When a routed request targets a deleted/gone browser, the VM proxy returns HTTP 404 with JSON body
{"code":"browser_gone","message":"browser not found"}. There is no special response header — we key off the bodycodeonly. A transient/real upstream failure still returns 5xx; a live VM's own 404 has nobrowser_gonecode.New fallback semantics
The routing layer keeps a small registry of fallback-eligible routed paths, expressed as
(subresource, suffix)and default-OFF for everything else.We fall back to the control plane iff ALL hold:
GET;code == "browser_gone".We do NOT fall back on: success, transient 5xx (502/503/504), connection/network errors, other 4xx, or a 404 whose body code is not
browser_gone. Those propagate unchanged.On fallback we evict the cached route for the session (it's authoritatively gone), then re-issue the original request to the control plane exactly once (original CP URL, Authorization restored,
jwtquery param dropped). Never loops.Prospective endpoint
Only the forthcoming pull endpoint
GET /browsers/{id}/telemetry/events(("telemetry", "/events")) is pre-registered, clearly commented as PROSPECTIVE — thetelemetry.events(...)method does not exist yet. This wires the opt-in so fallback works the moment that method ships. Adding any future eligible endpoint is a one-line registry edit.Scoping
This PR is focused on the fallback. It does NOT modify the default routing subresource list (that belongs to the separate telemetry-default-routing PR).
git diff origin/nextdoes not touch the default-subresources constant; tests enabletelemetryrouting locally.Verification
tests/test_browser_routing.py: 33 passing. Covers eligible+GET+browser_gone -> CP fallback (Authorization restored, jwt dropped, route evicted, exactly one CP re-issue), browser_gone+CP-also-errors -> returned as-is no loop, non-eligible path / 502 / connection error / 200 / POST / plain-404 -> no fallback, and non-routed requests untouched, plus a unit test of the eligibility registry. Sync and async parity covered.ruff checkclean on all changed files.api.onkernel.com; (B) after deleting the browser, the stale telemetry-stream route returned the VM 502 to the caller with ZERO control-plane re-issues — correct, since/streamis not eligible and 502 is notbrowser_gone.Files:
src/kernel/_client.py,src/kernel/lib/browser_routing/routing.py,tests/test_browser_routing.py.