Skip to content

[AI-864] fix: receive hosted-agent permission decisions via push (no held invocation)#158

Merged
alexeyzimarev merged 1 commit into
mainfrom
worktree-permission-resolved-push
Jun 14, 2026
Merged

[AI-864] fix: receive hosted-agent permission decisions via push (no held invocation)#158
alexeyzimarev merged 1 commit into
mainfrom
worktree-permission-resolved-push

Conversation

@alexeyzimarev

Copy link
Copy Markdown
Member

Problem

When a hosted agent raised a permission prompt, the daemon went into a reconnect storm (garbled terminal mirror) and the prompt was often auto-denied before the user could answer.

Root cause: RequestPermissionAsync obtained the decision as the return value of a RequestPermission hub invocation held open for the entire elicitation wait. A daemon multiplexes all its agents over one connection, and SignalR's default MaximumParallelInvocationsPerClient=1 meant that one pending invocation starved DaemonPing → 5s ping deadline → forced reconnect → retry re-blocks → storm.

Changes (pairs with kcap-server#743)

  • PendingPermissionRegistry — correlates a requestId to the server's later decision push, with an early-arrival buffer for the race where the push beats the await.
  • ServerConnection — handle the new server→client PermissionResolved message; RequestPermissionAsync now calls RequestPermission2 (a short invocation returning a requestId) and awaits the push. The short invoke stays ConnectionRetry-gated; the await then survives reconnects on its own (the server re-resolves the daemon connection at push time).
  • Readiness-race fix — fold per-agent re-registration into RegistrationGate.RunRegistrationAsync, so IsReady can't flip true after DaemonConnect but before session ownership is re-established. This was the residual half of the silent-deny: a retry gated only on daemon registration could fire before AgentRegistered/AgentStatusChanged and get a HubException ("Caller is not the daemon owning session") → treated as fatal → deny. Replaces the fire-and-forget OnReconnectedCallback re-register path.

Tests

1512/1512 unit tests pass. New: PendingPermissionRegistryTests (4 — resolve/await, early-arrival buffer, cancellation, cross-request isolation) and 3 RegistrationGateTests for RunRegistrationAsync ordering. AOT publish clean (no IL2026/IL3050).

Rollout

Deploys after kcap-server#743 (server-before-CLI): RequestPermission2 / PermissionResolved must exist server-side first. No user-facing CLI surface change.

🤖 Generated with Claude Code

@qodo-code-review

qodo-code-review Bot commented Jun 14, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. Ready despite reregister failures 🐞 Bug ≡ Correctness
Description
RegistrationGate.RunRegistrationAsync always calls MarkRegistered after reRegisterAgents returns,
but AgentOrchestrator.ReRegisterAgentsAsync swallows per-agent registration failures, so
ServerConnection.IsReady can become true even when session ownership was not restored for some
agents. That reopens the documented “Caller is not the daemon owning session” HubException path for
affected sessions, which the retry layer treats as fatal and turns into a deny.
Code

src/Capacitor.Cli.Daemon/Services/RegistrationGate.cs[R42-47]

+    public async Task RunRegistrationAsync(Func<Task> daemonConnect, Func<Task> reRegisterAgents) {
+        MarkUnregistered();
+        await daemonConnect();
+        await reRegisterAgents();
+        MarkRegistered();
+    }
Evidence
The gate sets readiness true after the re-registration step regardless of success, and the current
hook implementation suppresses errors; the code comments themselves document that missing ownership
yields a HubException that becomes a deny.

src/Capacitor.Cli.Daemon/Services/RegistrationGate.cs[27-47]
src/Capacitor.Cli.Daemon/Services/AgentOrchestrator.cs[888-920]
src/Capacitor.Cli.Daemon/Services/ServerConnection.cs[292-306]
src/Capacitor.Cli.Daemon/Services/ServerConnection.cs[492-526]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`RunRegistrationAsync()` restores readiness unconditionally, even if per-agent re-registration did not actually succeed. Because `ReRegisterAgentsAsync()` catches and logs exceptions per agent and never fails the overall step, the connection can report `IsReady == true` while the server still has not re-established session ownership for some running agents, which is the exact failure mode the PR is trying to eliminate.

## Issue Context
- `ServerConnection.IsReady` gates permission retries.
- Comments in `RegistrationGate` describe the ownership-mismatch HubException and its deny consequence.
- `AgentOrchestrator.ReRegisterAgentsAsync` logs and skips failures, ensuring `RunRegistrationAsync` proceeds to `MarkRegistered()`.

## Fix Focus Areas
- src/Capacitor.Cli.Daemon/Services/RegistrationGate.cs[27-47]
- src/Capacitor.Cli.Daemon/Services/AgentOrchestrator.cs[888-920]
- src/Capacitor.Cli.Daemon/Services/ServerConnection.cs[292-306]

## Suggested fix approach
Pick one (or combine):
1) **Fail readiness when any required agent fails to re-register**
  - Have `ReRegisterAgentsAsync` aggregate failures and throw at the end if any occurred (or at least for hosted/non-private running agents).
  - Update `RunRegistrationAsync` to only `MarkRegistered()` after successful completion; leave unready on exception.

2) **Retry per-agent re-registration before restoring readiness**
  - Add a bounded retry/backoff loop around `AgentRegisteredAsync`/`AgentStatusChangedAsync` for transient failures.

3) **Partial readiness model** (more work)
  - Keep daemon-level readiness, but track per-agent ownership-restored state; block permission requests (or retry) until the specific session/agent has re-registered.

Ensure whichever approach you choose avoids bringing back the original reconnect-storm issue (i.e., do not hold an invocation open).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Unbounded early decision buffer 🐞 Bug ☼ Reliability
Description
PendingPermissionRegistry.Resolve stores unmatched PermissionResolved pushes into _early with no
eviction, so late/duplicate/unknown requestIds can accumulate and grow memory over the daemon’s
lifetime. This is especially likely when a request await is canceled/removed and a later Resolve
arrives, which is explicitly routed into _early.
Code

src/Capacitor.Cli.Daemon/Services/PendingPermissionRegistry.cs[R74-79]

+    public void Resolve(string requestId, PermissionDecision decision) {
+        if (_pending.TryRemove(requestId, out var tcs))
+            tcs.TrySetResult(decision);
+        else
+            _early[requestId] = decision;
+    }
Evidence
The registry explicitly retains decisions when no waiter exists and provides no cleanup path, so
entries can persist indefinitely.

src/Capacitor.Cli.Daemon/Services/PendingPermissionRegistry.cs[24-26]
src/Capacitor.Cli.Daemon/Services/PendingPermissionRegistry.cs[54-66]
src/Capacitor.Cli.Daemon/Services/PendingPermissionRegistry.cs[68-79]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`PendingPermissionRegistry` buffers decisions in `_early` whenever a `PermissionResolved` push arrives with no registered waiter, and never evicts those entries. Over time (duplicates, late pushes after cancellation, or unexpected requestIds), `_early` can grow without bound.

## Issue Context
The current code path intentionally retains unmatched decisions:
- If `_pending` has no entry, `Resolve()` stores into `_early`.
- Cancellation removes `_pending`, so a later `Resolve()` will also store into `_early`.

## Fix Focus Areas
- src/Capacitor.Cli.Daemon/Services/PendingPermissionRegistry.cs[24-79]

## Suggested fix approach
- Change `_early` to store `(PermissionDecision decision, DateTimeOffset receivedAt)`.
- Add a bounded policy:
 - TTL eviction (e.g., drop entries older than N minutes), and/or
 - size cap (e.g., keep last N entries; drop oldest).
- Optionally: when `Resolve()` sees no pending waiter, log at Debug/Warn (with requestId) and *only* buffer if it’s plausibly “early” (e.g., buffer for a short TTL window).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Fix hosted-agent permission prompts via push-based resolution and readiness gating
🐞 Bug fix 🧪 Tests 🕐 40+ Minutes

Grey Divider

Walkthroughs

Description
• Replace held SignalR permission invocation with requestId + server push decision.
• Add pending permission correlation to survive reconnects and early decision races.
• Gate readiness until daemon + agents re-register to prevent ownership-race denies.
Diagram
graph TD
AO["AgentOrchestrator"] --> SC["ServerConnection"] --> RG["RegistrationGate"]
HA([Hosted agent]) --> SC --> HUB["SignalR hub"] --> SVR["kcap server"]
SVR --> HUB --> SC --> PPR["PendingPermissionRegistry"] --> HA
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Raise SignalR MaximumParallelInvocationsPerClient
  • ➕ Minimal client/server code changes
  • ➕ Avoids ping starvation from a single long-running invocation
  • ➖ Server-side setting change and rollout coordination required
  • ➖ Long-running invocations can still accumulate; harder to reason about backpressure
  • ➖ Doesn’t address reconnect survivability/ownership races by design
2. Dedicated hub connection per hosted agent
  • ➕ Isolates long-running invocations per agent; avoids multiplexing contention
  • ➖ More connections (cost/limits), more lifecycle complexity
  • ➖ Still leaves the design coupling decision delivery to invocation lifetimes
3. Polling for permission decision (client queries by requestId)
  • ➕ No server→client push channel required
  • ➕ Reconnect-friendly if state is server-side
  • ➖ Adds latency and load (poll interval tuning)
  • ➖ More complex failure semantics vs a single definitive push

Recommendation: Keep the push-based design (RequestPermission2 + PermissionResolved) as implemented: it removes the held-invocation bottleneck that starved DaemonPing under SignalR’s default single parallel invocation, and it cleanly tolerates reconnects by decoupling decision delivery from invocation lifetime. The added readiness bracketing via RunRegistrationAsync appropriately closes the remaining session-ownership race window.

Grey Divider

File Changes

Bug fix (4)
AgentOrchestrator.cs Wire agent re-registration into ServerConnection as an awaited hook +31/-28

Wire agent re-registration into ServerConnection as an awaited hook

• Replaces the fire-and-forget reconnect callback with an async ReRegisterAgentsHook assigned on ServerConnection. Extracts/clarifies the agent re-registration routine and documents why it must be awaited before readiness is restored.

src/Capacitor.Cli.Daemon/Services/AgentOrchestrator.cs


PendingPermissionRegistry.cs Add requestId→decision registry with early-arrival buffering +80/-0

Add requestId→decision registry with early-arrival buffering

• Introduces a concurrent registry that bridges a server push (PermissionResolved) back to the awaiting permission request. Handles the race where the push arrives before the await is registered, and cleans up pending entries on cancellation.

src/Capacitor.Cli.Daemon/Services/PendingPermissionRegistry.cs


RegistrationGate.cs Bracket full daemon+agent registration before declaring readiness +22/-0

Bracket full daemon+agent registration before declaring readiness

• Adds RunRegistrationAsync to ensure IsReady only becomes true after both DaemonConnect and per-agent re-registration complete. Prevents permission retries from firing during the ownership-recovery gap.

src/Capacitor.Cli.Daemon/Services/RegistrationGate.cs


ServerConnection.cs Switch permission flow to short invoke + server push and tighten reconnect registration +61/-24

Switch permission flow to short invoke + server push and tighten reconnect registration

• Registers a PermissionResolved server→client handler and uses PendingPermissionRegistry to correlate decisions by requestId. Updates RequestPermissionAsync to call RequestPermission2 (short invocation) and await the pushed decision. Reworks registration so readiness is restored only after daemon connect and agent re-registration, removing the prior reconnected callback path.

src/Capacitor.Cli.Daemon/Services/ServerConnection.cs


Tests (2)
PendingPermissionRegistryTests.cs Add unit coverage for permission request correlation and races +77/-0

Add unit coverage for permission request correlation and races

• Tests resolve/await completion, early-arrival buffering, cancellation behavior, and cross-request isolation for PendingPermissionRegistry.

test/Capacitor.Cli.Tests.Unit/PendingPermissionRegistryTests.cs


RegistrationGateTests.cs Add tests for RunRegistrationAsync readiness ordering and failure handling +67/-0

Add tests for RunRegistrationAsync readiness ordering and failure handling

• Validates that readiness remains false until agent re-registration completes, enforces daemon-connect-before-reregister ordering, and ensures failures leave the gate unready while skipping re-registration.

test/Capacitor.Cli.Tests.Unit/RegistrationGateTests.cs


Grey Divider

Qodo Logo

Comment on lines +42 to +47
public async Task RunRegistrationAsync(Func<Task> daemonConnect, Func<Task> reRegisterAgents) {
MarkUnregistered();
await daemonConnect();
await reRegisterAgents();
MarkRegistered();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Ready despite reregister failures 🐞 Bug ≡ Correctness

RegistrationGate.RunRegistrationAsync always calls MarkRegistered after reRegisterAgents returns,
but AgentOrchestrator.ReRegisterAgentsAsync swallows per-agent registration failures, so
ServerConnection.IsReady can become true even when session ownership was not restored for some
agents. That reopens the documented “Caller is not the daemon owning session” HubException path for
affected sessions, which the retry layer treats as fatal and turns into a deny.
Agent Prompt
## Issue description
`RunRegistrationAsync()` restores readiness unconditionally, even if per-agent re-registration did not actually succeed. Because `ReRegisterAgentsAsync()` catches and logs exceptions per agent and never fails the overall step, the connection can report `IsReady == true` while the server still has not re-established session ownership for some running agents, which is the exact failure mode the PR is trying to eliminate.

## Issue Context
- `ServerConnection.IsReady` gates permission retries.
- Comments in `RegistrationGate` describe the ownership-mismatch HubException and its deny consequence.
- `AgentOrchestrator.ReRegisterAgentsAsync` logs and skips failures, ensuring `RunRegistrationAsync` proceeds to `MarkRegistered()`.

## Fix Focus Areas
- src/Capacitor.Cli.Daemon/Services/RegistrationGate.cs[27-47]
- src/Capacitor.Cli.Daemon/Services/AgentOrchestrator.cs[888-920]
- src/Capacitor.Cli.Daemon/Services/ServerConnection.cs[292-306]

## Suggested fix approach
Pick one (or combine):
1) **Fail readiness when any required agent fails to re-register**
   - Have `ReRegisterAgentsAsync` aggregate failures and throw at the end if any occurred (or at least for hosted/non-private running agents).
   - Update `RunRegistrationAsync` to only `MarkRegistered()` after successful completion; leave unready on exception.

2) **Retry per-agent re-registration before restoring readiness**
   - Add a bounded retry/backoff loop around `AgentRegisteredAsync`/`AgentStatusChangedAsync` for transient failures.

3) **Partial readiness model** (more work)
   - Keep daemon-level readiness, but track per-agent ownership-restored state; block permission requests (or retry) until the specific session/agent has re-registered.

Ensure whichever approach you choose avoids bringing back the original reconnect-storm issue (i.e., do not hold an invocation open).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6df3efe: ReRegisterAgentsAsync now retries each agent's re-registration a bounded number of times (ReRegisterMaxAttempts, 250ms backoff) before giving up, so a transient per-agent failure no longer leaves ownership unrestored while the daemon flips ready — it's restored within the registration pass for the realistic (transient) case.

I deliberately did not take the "fail readiness on any agent failure" option: one agent's persistent failure would then withhold readiness for the whole daemon and loop reconnects, blocking every other agent's permissions — a worse failure mode than today's graceful per-agent degradation. For the residual (a genuinely persistent single-agent failure), the bounded retry on the "Caller is not the daemon owning session" HubException in RequestPermissionAsync is the final safety net: the permission is retried briefly, then denied only if ownership truly can't be restored — rather than blocking the fleet.

(Separately, the unbounded _early buffer finding is fixed: PendingPermissionRegistry now bounds it with FIFO eviction.)

@linear-code

linear-code Bot commented Jun 14, 2026

Copy link
Copy Markdown

AI-864

@alexeyzimarev alexeyzimarev changed the title fix: receive hosted-agent permission decisions via push (no held invocation) [AI-864] fix: receive hosted-agent permission decisions via push (no held invocation) Jun 14, 2026
@alexeyzimarev alexeyzimarev force-pushed the worktree-permission-resolved-push branch 2 times, most recently from 5242e16 to a8ea944 Compare June 14, 2026 19:00
@alexeyzimarev

Copy link
Copy Markdown
Member Author

Addressed the Qodo review:

  • Bug C2 (unbounded buffer)PendingPermissionRegistry._early is now bounded with FIFO eviction, so duplicate/late/unknown pushes can't grow memory over the daemon's lifetime. New test.
  • Bug C1 (ready despite reregister failures) — added a bounded retry on the post-reconnect "Caller is not the daemon owning session" HubException (via ConnectionRetry.isRetriableServerError/maxServerErrorRetries), so a brief ownership gap after reconnect no longer becomes a spurious deny, while a genuinely-permanent ownership error still surfaces. Chose this over withholding readiness on a single agent's failure, which would risk a fleet-wide reconnect loop. New ConnectionRetry tests.

@alexeyzimarev alexeyzimarev force-pushed the worktree-permission-resolved-push branch from a8ea944 to 336a437 Compare June 14, 2026 19:05
… a held invocation

Pairs with the kcap-server change: the daemon obtains a permission decision via
RequestPermission2 (a SHORT invocation returning a requestId) plus a server->client
"PermissionResolved" push, instead of holding a RequestPermission invocation open
for the whole elicitation wait. That blocking invocation occupied the connection's
single parallel-invocation slot and starved DaemonPing, causing reconnect storms /
spurious denies whenever an elicitation appeared.

- PendingPermissionRegistry: correlates requestId -> pushed decision, with a bounded
  (FIFO-evicted) early-arrival buffer so duplicate/late/unknown pushes can't grow
  memory unbounded over the daemon's lifetime.
- ServerConnection: handle the "PermissionResolved" push (single PermissionResolution
  record, arity 1); RequestPermissionAsync calls RequestPermission2 with a single
  HostedPermissionRequest record and awaits the push. Both directions use records so
  the contract can evolve without breaking mixed-version servers. Bounded retry on the
  post-reconnect "Caller is not the daemon owning session" HubException so a brief
  ownership gap doesn't become a spurious deny.
- ConnectionRetry: add bounded isRetriableServerError / maxServerErrorRetries
  (transient disconnects still retry unbounded; server errors are capped).
- Readiness fix: fold per-agent re-registration into RegistrationGate.RunRegistrationAsync
  so IsReady can't flip true after DaemonConnect but before session ownership is
  restored. ReRegisterAgentsAsync retries each agent's re-registration a bounded number
  of times before giving up (narrows the "ready despite reregister failure" window
  without withholding readiness for the whole daemon on one agent's persistent failure).
  Replaces the fire-and-forget OnReconnectedCallback re-register path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexeyzimarev alexeyzimarev force-pushed the worktree-permission-resolved-push branch from 336a437 to 6df3efe Compare June 14, 2026 20:48
@alexeyzimarev alexeyzimarev merged commit 1b3c869 into main Jun 14, 2026
5 checks passed
@alexeyzimarev alexeyzimarev deleted the worktree-permission-resolved-push branch June 14, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant