Skip to content

Add periodic TEE registry refresh to detect stale gateways#16

Merged
adambalogh merged 1 commit into
mainfrom
claude/veil-periodic-tees-refresh-9qhhev
Jun 26, 2026
Merged

Add periodic TEE registry refresh to detect stale gateways#16
adambalogh merged 1 commit into
mainfrom
claude/veil-periodic-tees-refresh-9qhhev

Conversation

@adambalogh

Copy link
Copy Markdown
Contributor

Summary

Adds a background refresh loop that periodically re-checks the on-chain TEE registry and drops the cached gateway when it rotates out or rotates its cryptographic keys. This prevents the proxy from getting stuck hammering a stale TEE endpoint until restart.

Problem

Previously, the gateway only reselected a TEE reactively when a request raised a network error. However, when a TEE rotates out of the registry or rotates its OHTTP/signing keys, the failures surface as RelayError/VerificationError rather than network errors, so the reactive retry path doesn't catch them. This left the proxy stuck failing every request indefinitely.

Changes

  • veil/gateway.py: Added background refresh loop infrastructure:

    • start_refresh_loop() / stop_refresh_loop() to manage the background thread
    • _refresh_loop() that periodically calls _refresh_once() with configurable interval
    • _refresh_once() that checks if the cached TEE is still active and unchanged
    • _tee_still_current() static method that validates both TEE ID and key material (OHTTP public key, key ID, and signing key) match the registry
  • veil/config.py: Added tee_refresh_interval configuration (default 300 seconds, mirrors SDK's RegistryTEEConnection refresh cadence)

  • veil/server.py: Integrated refresh loop lifecycle:

    • Calls gateway.start_refresh_loop() after gateway initialization
    • Calls gateway.stop_refresh_loop() in a finally block to clean up on shutdown
  • tests/test_gateway_refresh.py: Comprehensive test coverage for:

    • TEE staleness detection (rotation out, key rotation)
    • Refresh behavior (keeping vs. dropping cached client)
    • Loop lifecycle (enable/disable, thread management)
  • README.md: Updated documentation to explain the refresh behavior and new OG_VEIL_TEE_REFRESH_INTERVAL configuration option

Implementation Details

  • The refresh loop is daemon-threaded and uses threading.Event.wait() for both sleeping and graceful shutdown signaling
  • Thread-safe cache clearing uses the existing _lock to avoid races with reactive resets
  • Robust error handling ensures transient registry failures don't kill the background loop
  • Matching on key material (not just TEE ID) catches silent key rotations that would break the cached client

https://claude.ai/code/session_018yT2skdLRrv9w4YVMxMpvX

The gateway cached its selected TEE + OHTTP relay client and only
reselected reactively, when a chat call raised a network error. When the
on-chain registry changed instead — a gateway rotated out, or rotated its
OHTTP/signing keys — the resulting failures surfaced as RelayError /
VerificationError, which the reactive path does not retry. So the proxy
kept hammering a stale TEE and every request failed until a manual restart.

Add a background loop (mirroring the SDK's RegistryTEEConnection refresh)
that re-checks the registry every OG_VEIL_TEE_REFRESH_INTERVAL seconds
(default 300, 0 to disable) and drops the cached gateway when the active
TEE is no longer present or its key material changed, so the next request
reselects a live one. Started/stopped around the server's run loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_018yT2skdLRrv9w4YVMxMpvX
@adambalogh adambalogh marked this pull request as ready for review June 26, 2026 19:21
@adambalogh adambalogh merged commit 6b3010b into main Jun 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants