[AI-864] fix: receive hosted-agent permission decisions via push (no held invocation)#158
Conversation
Code Review by Qodo
1. Ready despite reregister failures
|
PR Summary by QodoFix hosted-agent permission prompts via push-based resolution and readiness gating WalkthroughsDescription• Replace held SignalR permission invocation with requestId + server push decision. • Add pending permission correlation to survive reconnects and early decision races. • Gate readiness until daemon + agents re-register to prevent ownership-race denies. Diagramgraph TD
AO["AgentOrchestrator"] --> SC["ServerConnection"] --> RG["RegistrationGate"]
HA([Hosted agent]) --> SC --> HUB["SignalR hub"] --> SVR["kcap server"]
SVR --> HUB --> SC --> PPR["PendingPermissionRegistry"] --> HA
High-Level AssessmentThe following are alternative approaches to this PR: 1. Raise SignalR MaximumParallelInvocationsPerClient
2. Dedicated hub connection per hosted agent
3. Polling for permission decision (client queries by requestId)
Recommendation: Keep the push-based design (RequestPermission2 + PermissionResolved) as implemented: it removes the held-invocation bottleneck that starved DaemonPing under SignalR’s default single parallel invocation, and it cleanly tolerates reconnects by decoupling decision delivery from invocation lifetime. The added readiness bracketing via RunRegistrationAsync appropriately closes the remaining session-ownership race window. File ChangesBug fix (4)
Tests (2)
|
| public async Task RunRegistrationAsync(Func<Task> daemonConnect, Func<Task> reRegisterAgents) { | ||
| MarkUnregistered(); | ||
| await daemonConnect(); | ||
| await reRegisterAgents(); | ||
| MarkRegistered(); | ||
| } |
There was a problem hiding this comment.
2. Ready despite reregister failures 🐞 Bug ≡ Correctness
RegistrationGate.RunRegistrationAsync always calls MarkRegistered after reRegisterAgents returns, but AgentOrchestrator.ReRegisterAgentsAsync swallows per-agent registration failures, so ServerConnection.IsReady can become true even when session ownership was not restored for some agents. That reopens the documented “Caller is not the daemon owning session” HubException path for affected sessions, which the retry layer treats as fatal and turns into a deny.
Agent Prompt
## Issue description
`RunRegistrationAsync()` restores readiness unconditionally, even if per-agent re-registration did not actually succeed. Because `ReRegisterAgentsAsync()` catches and logs exceptions per agent and never fails the overall step, the connection can report `IsReady == true` while the server still has not re-established session ownership for some running agents, which is the exact failure mode the PR is trying to eliminate.
## Issue Context
- `ServerConnection.IsReady` gates permission retries.
- Comments in `RegistrationGate` describe the ownership-mismatch HubException and its deny consequence.
- `AgentOrchestrator.ReRegisterAgentsAsync` logs and skips failures, ensuring `RunRegistrationAsync` proceeds to `MarkRegistered()`.
## Fix Focus Areas
- src/Capacitor.Cli.Daemon/Services/RegistrationGate.cs[27-47]
- src/Capacitor.Cli.Daemon/Services/AgentOrchestrator.cs[888-920]
- src/Capacitor.Cli.Daemon/Services/ServerConnection.cs[292-306]
## Suggested fix approach
Pick one (or combine):
1) **Fail readiness when any required agent fails to re-register**
- Have `ReRegisterAgentsAsync` aggregate failures and throw at the end if any occurred (or at least for hosted/non-private running agents).
- Update `RunRegistrationAsync` to only `MarkRegistered()` after successful completion; leave unready on exception.
2) **Retry per-agent re-registration before restoring readiness**
- Add a bounded retry/backoff loop around `AgentRegisteredAsync`/`AgentStatusChangedAsync` for transient failures.
3) **Partial readiness model** (more work)
- Keep daemon-level readiness, but track per-agent ownership-restored state; block permission requests (or retry) until the specific session/agent has re-registered.
Ensure whichever approach you choose avoids bringing back the original reconnect-storm issue (i.e., do not hold an invocation open).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Addressed in 6df3efe: ReRegisterAgentsAsync now retries each agent's re-registration a bounded number of times (ReRegisterMaxAttempts, 250ms backoff) before giving up, so a transient per-agent failure no longer leaves ownership unrestored while the daemon flips ready — it's restored within the registration pass for the realistic (transient) case.
I deliberately did not take the "fail readiness on any agent failure" option: one agent's persistent failure would then withhold readiness for the whole daemon and loop reconnects, blocking every other agent's permissions — a worse failure mode than today's graceful per-agent degradation. For the residual (a genuinely persistent single-agent failure), the bounded retry on the "Caller is not the daemon owning session" HubException in RequestPermissionAsync is the final safety net: the permission is retried briefly, then denied only if ownership truly can't be restored — rather than blocking the fleet.
(Separately, the unbounded _early buffer finding is fixed: PendingPermissionRegistry now bounds it with FIFO eviction.)
5242e16 to
a8ea944
Compare
|
Addressed the Qodo review:
|
a8ea944 to
336a437
Compare
… a held invocation Pairs with the kcap-server change: the daemon obtains a permission decision via RequestPermission2 (a SHORT invocation returning a requestId) plus a server->client "PermissionResolved" push, instead of holding a RequestPermission invocation open for the whole elicitation wait. That blocking invocation occupied the connection's single parallel-invocation slot and starved DaemonPing, causing reconnect storms / spurious denies whenever an elicitation appeared. - PendingPermissionRegistry: correlates requestId -> pushed decision, with a bounded (FIFO-evicted) early-arrival buffer so duplicate/late/unknown pushes can't grow memory unbounded over the daemon's lifetime. - ServerConnection: handle the "PermissionResolved" push (single PermissionResolution record, arity 1); RequestPermissionAsync calls RequestPermission2 with a single HostedPermissionRequest record and awaits the push. Both directions use records so the contract can evolve without breaking mixed-version servers. Bounded retry on the post-reconnect "Caller is not the daemon owning session" HubException so a brief ownership gap doesn't become a spurious deny. - ConnectionRetry: add bounded isRetriableServerError / maxServerErrorRetries (transient disconnects still retry unbounded; server errors are capped). - Readiness fix: fold per-agent re-registration into RegistrationGate.RunRegistrationAsync so IsReady can't flip true after DaemonConnect but before session ownership is restored. ReRegisterAgentsAsync retries each agent's re-registration a bounded number of times before giving up (narrows the "ready despite reregister failure" window without withholding readiness for the whole daemon on one agent's persistent failure). Replaces the fire-and-forget OnReconnectedCallback re-register path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
336a437 to
6df3efe
Compare
Problem
When a hosted agent raised a permission prompt, the daemon went into a reconnect storm (garbled terminal mirror) and the prompt was often auto-denied before the user could answer.
Root cause:
RequestPermissionAsyncobtained the decision as the return value of aRequestPermissionhub invocation held open for the entire elicitation wait. A daemon multiplexes all its agents over one connection, and SignalR's defaultMaximumParallelInvocationsPerClient=1meant that one pending invocation starvedDaemonPing→ 5s ping deadline → forced reconnect → retry re-blocks → storm.Changes (pairs with kcap-server#743)
PendingPermissionRegistry— correlates arequestIdto the server's later decision push, with an early-arrival buffer for the race where the push beats the await.ServerConnection— handle the new server→clientPermissionResolvedmessage;RequestPermissionAsyncnow callsRequestPermission2(a short invocation returning arequestId) and awaits the push. The short invoke staysConnectionRetry-gated; the await then survives reconnects on its own (the server re-resolves the daemon connection at push time).RegistrationGate.RunRegistrationAsync, soIsReadycan't flip true afterDaemonConnectbut before session ownership is re-established. This was the residual half of the silent-deny: a retry gated only on daemon registration could fire beforeAgentRegistered/AgentStatusChangedand get aHubException("Caller is not the daemon owning session") → treated as fatal → deny. Replaces the fire-and-forgetOnReconnectedCallbackre-register path.Tests
1512/1512 unit tests pass. New:
PendingPermissionRegistryTests(4 — resolve/await, early-arrival buffer, cancellation, cross-request isolation) and 3RegistrationGateTestsforRunRegistrationAsyncordering. AOT publish clean (no IL2026/IL3050).Rollout
Deploys after kcap-server#743 (server-before-CLI):
RequestPermission2/PermissionResolvedmust exist server-side first. No user-facing CLI surface change.🤖 Generated with Claude Code