fix: circuit breaker cold-start trap (closes #94 #95 #96)#150
fix: circuit breaker cold-start trap (closes #94 #95 #96)#150petersimmons1972 wants to merge 3 commits into
Conversation
Adds discovery.type: "fc" — Olla polls Flight Controller /registry every 15s
and atomically reconciles its backend set, meeting the <30s convergence SLO.
- internal/adapter/discovery/fc_repository.go: FCDiscoveryPoller with Poll()
and RunLoop(). Converts FC RegistryEntry{Host,Models[]{Name,Port}} into
EndpointConfig slices and calls LoadFromConfig (thread-safe atomic map swap
already in StaticEndpointRepository). Fail-open: FC unreachable → warn and
preserve existing endpoint set.
- internal/adapter/discovery/fc_repository_test.go: 3 tests — convert,
remove-stale (core acceptance criterion), fail-open on FC unavailability
- internal/config/types.go: FCDiscoveryConfig{RegistryURL, PollInterval};
DiscoveryConfig.FC added
- internal/app/services/discovery.go: "fc" switch case validates registry_url,
defaults poll_interval to 15s, runs initial blocking poll at startup,
launches background RunLoop goroutine; fcPollerCancel called on Stop()
ADV.1 advisory recommendation: Path A chosen over Path C (Reconciler) because
Olla has no hot-reload path — ConfigMap volume mount convergence is 30-60s+
making <30s SLO unreliable without an Olla fork change anyway.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mons1972/aifleet#94, #95) - Add CircuitBreakerTimeout/Threshold to EndpointConfig for per-endpoint overrides - Add global CircuitBreakerConfig stanza under proxy.circuit_breaker in config - Add OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT / OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD env vars - Wire config through factory → OllaConfig → circuitBreaker.timeout field - circuitBreaker.IsOpen() now uses per-endpoint timeout if set, else global, else 30s default - Service.SetEndpointCircuitBreakerConfig() API for runtime per-endpoint registration - All existing defaults preserved (threshold=5, timeout=30s unchanged unless overridden) - All tests pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WalkthroughThis PR introduces Flight Controller (FC) registry discovery polling and extends circuit breaker observability across the proxy service. The adapter periodically fetches a registry endpoint, converts discovered models to endpoint configurations, and reconciles the static repository. Circuit breaker enhancements add per-endpoint timeout/threshold overrides, configuration management through all config layers, and state inspection APIs exposed via the status handler. ChangesFlight Controller Discovery and Circuit Breaker Enhancement
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 1 | ❌ 4❌ Failed checks (3 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
internal/app/services/discovery.go (1)
121-127: 💤 Low valueDuplicate poll at startup.
Line 121 performs a blocking poll, then line 127 starts
RunLoopwhich immediately polls again (seefc_repository.go:95). This briefly doubles load on the FC registry at startup.Consider either removing the immediate poll inside
RunLoopor passing a flag to skip the first poll when the caller has already primed the repository.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/services/discovery.go` around lines 121 - 127, The startup code calls poller.Poll(ctx) then immediately starts poller.RunLoop(pollerCtx, pollInterval), which causes a duplicate immediate poll (see RunLoop's initial poll in fc_repository.go). Modify RunLoop to accept a flag (e.g., skipInitial bool) that prevents the first immediate Poll when true, or remove the initial Poll inside RunLoop and ensure callers always perform the first Poll; then update the startup call that sets s.fcPoller and s.fcPollerCancel to invoke RunLoop with skipInitial=true (or to rely on callers doing the initial Poll) so the FC registry is not polled twice at startup.internal/app/handlers/handler_status_endpoints.go (1)
155-160: 💤 Low valueConsider handling the
json.Marshalerror.Line 156 discards the error from
json.Marshal. Whilst this is unlikely to fail for a map of primitives, silently ignoring errors can mask unexpected issues during debugging.🔧 Suggested improvement
// Marshal the map into the CircuitBreakerState struct -data, _ := json.Marshal(stateMap) +data, err := json.Marshal(stateMap) +if err != nil { + return nil +} var cbs CircuitBreakerState if err := json.Unmarshal(data, &cbs); err != nil { return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/handlers/handler_status_endpoints.go` around lines 155 - 160, The call to json.Marshal on stateMap currently ignores its error; update the block around json.Marshal/stateMap/CircuitBreakerState to capture and handle the marshal error (e.g., data, err := json.Marshal(stateMap); if err != nil { return nil } or log and return) before attempting json.Unmarshal into cbs, keeping the existing Unmarshal error handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/adapter/discovery/fc_repository.go`:
- Line 45: The constructed registryURL uses registryBaseURL + "/registry" which
produces a double-slash when registryBaseURL ends with '/', so update the
construction to trim any trailing slashes before appending the segment (e.g.,
use strings.TrimRight(registryBaseURL, "/") + "/registry") and add "strings" to
the imports; locate the registryURL assignment in fc_repository.go (the struct
literal or New function that sets registryURL) and replace the concatenation
with the trimmed form to ensure a single slash.
In `@internal/config/config.go`:
- Around line 425-433: The environment parsing currently accepts zero or
negative durations for Proxy.CircuitBreaker.Timeout because time.ParseDuration
allows them; update the env branch in internal/config/config.go to only assign
config.Proxy.CircuitBreaker.Timeout when parsed duration d > 0 (reject d <= 0)
and keep the existing positive-integer check for Proxy.CircuitBreaker.Threshold
as-is; additionally add the same invariant checks to Config.Validate() (e.g.,
verify Proxy.CircuitBreaker.Timeout > 0 and Proxy.CircuitBreaker.Threshold > 0,
returning validation error if not) so YAML/config-file values cannot bypass the
constraint.
---
Nitpick comments:
In `@internal/app/handlers/handler_status_endpoints.go`:
- Around line 155-160: The call to json.Marshal on stateMap currently ignores
its error; update the block around json.Marshal/stateMap/CircuitBreakerState to
capture and handle the marshal error (e.g., data, err := json.Marshal(stateMap);
if err != nil { return nil } or log and return) before attempting json.Unmarshal
into cbs, keeping the existing Unmarshal error handling intact.
In `@internal/app/services/discovery.go`:
- Around line 121-127: The startup code calls poller.Poll(ctx) then immediately
starts poller.RunLoop(pollerCtx, pollInterval), which causes a duplicate
immediate poll (see RunLoop's initial poll in fc_repository.go). Modify RunLoop
to accept a flag (e.g., skipInitial bool) that prevents the first immediate Poll
when true, or remove the initial Poll inside RunLoop and ensure callers always
perform the first Poll; then update the startup call that sets s.fcPoller and
s.fcPollerCancel to invoke RunLoop with skipInitial=true (or to rely on callers
doing the initial Poll) so the FC registry is not polled twice at startup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1fdf8aa4-d797-4cca-b792-c644f308cd56
📒 Files selected for processing (11)
internal/adapter/discovery/fc_repository.gointernal/adapter/discovery/fc_repository_test.gointernal/adapter/proxy/config.gointernal/adapter/proxy/config/unified.gointernal/adapter/proxy/factory.gointernal/adapter/proxy/olla/service.gointernal/app/handlers/handler_status_endpoints.gointernal/app/services/discovery.gointernal/app/services/proxy.gointernal/config/config.gointernal/config/types.go
| func NewFCDiscoveryPoller(repo *StaticEndpointRepository, registryBaseURL string, log logger.StyledLogger) *FCDiscoveryPoller { | ||
| return &FCDiscoveryPoller{ | ||
| repo: repo, | ||
| registryURL: registryBaseURL + "/registry", |
There was a problem hiding this comment.
Double-slash possible when base URL has trailing slash.
If registryBaseURL ends with /, the constructed URL becomes http://host//registry. Some HTTP servers or reverse proxies treat double-slashes differently, potentially causing 404s.
🛠️ Proposed fix: normalise trailing slash
- registryURL: registryBaseURL + "/registry",
+ registryURL: strings.TrimSuffix(registryBaseURL, "/") + "/registry",Add "strings" to imports.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/adapter/discovery/fc_repository.go` at line 45, The constructed
registryURL uses registryBaseURL + "/registry" which produces a double-slash
when registryBaseURL ends with '/', so update the construction to trim any
trailing slashes before appending the segment (e.g., use
strings.TrimRight(registryBaseURL, "/") + "/registry") and add "strings" to the
imports; locate the registryURL assignment in fc_repository.go (the struct
literal or New function that sets registryURL) and replace the concatenation
with the trimmed form to ensure a single slash.
| if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" { | ||
| if d, err := time.ParseDuration(val); err == nil { | ||
| config.Proxy.CircuitBreaker.Timeout = d | ||
| } | ||
| } | ||
| if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD"); val != "" { | ||
| if n, err := strconv.Atoi(val); err == nil && n > 0 { | ||
| config.Proxy.CircuitBreaker.Threshold = n | ||
| } |
There was a problem hiding this comment.
Reject non-positive circuit-breaker timeout values.
time.ParseDuration accepts 0 and negative durations, so this path can set an invalid cooldown and destabilise breaker state transitions. Also add the same invariant in Config.Validate() so YAML/config-file values cannot bypass it.
Suggested fix
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" {
- if d, err := time.ParseDuration(val); err == nil {
+ if d, err := time.ParseDuration(val); err == nil && d > 0 {
config.Proxy.CircuitBreaker.Timeout = d
}
} func (c *Config) Validate() error {
+ if c.Proxy.CircuitBreaker.Timeout <= 0 {
+ return fmt.Errorf("proxy.circuit_breaker.timeout must be > 0, got %s", c.Proxy.CircuitBreaker.Timeout)
+ }
+ if c.Proxy.CircuitBreaker.Threshold <= 0 {
+ return fmt.Errorf("proxy.circuit_breaker.threshold must be > 0, got %d", c.Proxy.CircuitBreaker.Threshold)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" { | |
| if d, err := time.ParseDuration(val); err == nil { | |
| config.Proxy.CircuitBreaker.Timeout = d | |
| } | |
| } | |
| if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD"); val != "" { | |
| if n, err := strconv.Atoi(val); err == nil && n > 0 { | |
| config.Proxy.CircuitBreaker.Threshold = n | |
| } | |
| if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" { | |
| if d, err := time.ParseDuration(val); err == nil && d > 0 { | |
| config.Proxy.CircuitBreaker.Timeout = d | |
| } | |
| } | |
| if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD"); val != "" { | |
| if n, err := strconv.Atoi(val); err == nil && n > 0 { | |
| config.Proxy.CircuitBreaker.Threshold = n | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/config/config.go` around lines 425 - 433, The environment parsing
currently accepts zero or negative durations for Proxy.CircuitBreaker.Timeout
because time.ParseDuration allows them; update the env branch in
internal/config/config.go to only assign config.Proxy.CircuitBreaker.Timeout
when parsed duration d > 0 (reject d <= 0) and keep the existing
positive-integer check for Proxy.CircuitBreaker.Threshold as-is; additionally
add the same invariant checks to Config.Validate() (e.g., verify
Proxy.CircuitBreaker.Timeout > 0 and Proxy.CircuitBreaker.Threshold > 0,
returning validation error if not) so YAML/config-file values cannot bypass the
constraint.
Completes the per-endpoint circuit-breaker config and status JSON exposure work.
Tests pass. Closes #94 #95 #96
Summary by CodeRabbit
Release Notes
New Features