Add periodic TEE registry refresh to detect stale gateways#16
Merged
Conversation
The gateway cached its selected TEE + OHTTP relay client and only reselected reactively, when a chat call raised a network error. When the on-chain registry changed instead — a gateway rotated out, or rotated its OHTTP/signing keys — the resulting failures surfaced as RelayError / VerificationError, which the reactive path does not retry. So the proxy kept hammering a stale TEE and every request failed until a manual restart. Add a background loop (mirroring the SDK's RegistryTEEConnection refresh) that re-checks the registry every OG_VEIL_TEE_REFRESH_INTERVAL seconds (default 300, 0 to disable) and drops the cached gateway when the active TEE is no longer present or its key material changed, so the next request reselects a live one. Started/stopped around the server's run loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018yT2skdLRrv9w4YVMxMpvX
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a background refresh loop that periodically re-checks the on-chain TEE registry and drops the cached gateway when it rotates out or rotates its cryptographic keys. This prevents the proxy from getting stuck hammering a stale TEE endpoint until restart.
Problem
Previously, the gateway only reselected a TEE reactively when a request raised a network error. However, when a TEE rotates out of the registry or rotates its OHTTP/signing keys, the failures surface as
RelayError/VerificationErrorrather than network errors, so the reactive retry path doesn't catch them. This left the proxy stuck failing every request indefinitely.Changes
veil/gateway.py: Added background refresh loop infrastructure:start_refresh_loop()/stop_refresh_loop()to manage the background thread_refresh_loop()that periodically calls_refresh_once()with configurable interval_refresh_once()that checks if the cached TEE is still active and unchanged_tee_still_current()static method that validates both TEE ID and key material (OHTTP public key, key ID, and signing key) match the registryveil/config.py: Addedtee_refresh_intervalconfiguration (default 300 seconds, mirrors SDK'sRegistryTEEConnectionrefresh cadence)veil/server.py: Integrated refresh loop lifecycle:gateway.start_refresh_loop()after gateway initializationgateway.stop_refresh_loop()in a finally block to clean up on shutdowntests/test_gateway_refresh.py: Comprehensive test coverage for:README.md: Updated documentation to explain the refresh behavior and newOG_VEIL_TEE_REFRESH_INTERVALconfiguration optionImplementation Details
threading.Event.wait()for both sleeping and graceful shutdown signaling_lockto avoid races with reactive resetshttps://claude.ai/code/session_018yT2skdLRrv9w4YVMxMpvX