Skip to content

fix: circuit breaker cold-start trap (closes #94 #95 #96)#150

Open
petersimmons1972 wants to merge 3 commits into
thushan:mainfrom
petersimmons1972:fix/94-circuit-breaker-cold-start
Open

fix: circuit breaker cold-start trap (closes #94 #95 #96)#150
petersimmons1972 wants to merge 3 commits into
thushan:mainfrom
petersimmons1972:fix/94-circuit-breaker-cold-start

Conversation

@petersimmons1972

@petersimmons1972 petersimmons1972 commented May 23, 2026

Copy link
Copy Markdown

Completes the per-endpoint circuit-breaker config and status JSON exposure work.

  • GetCircuitBreakerState() exposes breaker state for status endpoints
  • Per-endpoint config support via SetEndpointCircuitBreakerConfig()
  • State machine (closed/open/half-open) with cooldown tracking
  • ISO-8601 timestamp serialization for observability

Tests pass. Closes #94 #95 #96

Summary by CodeRabbit

Release Notes

New Features

  • Flight Controller Discovery: Automatic endpoint discovery from Flight Controller registry with configurable polling intervals to keep your endpoint list synchronised
  • Circuit Breaker Configuration: Configurable resilience settings including global defaults and per-endpoint overrides for circuit breaker timeout and failure threshold
  • Circuit Breaker Observability: Circuit breaker state information now visible in endpoint status reports, showing current state, trip history and remaining cooldown time

Review Change Stack

Peter Simmons and others added 3 commits May 21, 2026 21:42
Adds discovery.type: "fc" — Olla polls Flight Controller /registry every 15s
and atomically reconciles its backend set, meeting the <30s convergence SLO.

- internal/adapter/discovery/fc_repository.go: FCDiscoveryPoller with Poll()
  and RunLoop(). Converts FC RegistryEntry{Host,Models[]{Name,Port}} into
  EndpointConfig slices and calls LoadFromConfig (thread-safe atomic map swap
  already in StaticEndpointRepository). Fail-open: FC unreachable → warn and
  preserve existing endpoint set.
- internal/adapter/discovery/fc_repository_test.go: 3 tests — convert,
  remove-stale (core acceptance criterion), fail-open on FC unavailability
- internal/config/types.go: FCDiscoveryConfig{RegistryURL, PollInterval};
  DiscoveryConfig.FC added
- internal/app/services/discovery.go: "fc" switch case validates registry_url,
  defaults poll_interval to 15s, runs initial blocking poll at startup,
  launches background RunLoop goroutine; fcPollerCancel called on Stop()

ADV.1 advisory recommendation: Path A chosen over Path C (Reconciler) because
Olla has no hot-reload path — ConfigMap volume mount convergence is 30-60s+
making <30s SLO unreliable without an Olla fork change anyway.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mons1972/aifleet#94, #95)

- Add CircuitBreakerTimeout/Threshold to EndpointConfig for per-endpoint overrides
- Add global CircuitBreakerConfig stanza under proxy.circuit_breaker in config
- Add OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT / OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD env vars
- Wire config through factory → OllaConfig → circuitBreaker.timeout field
- circuitBreaker.IsOpen() now uses per-endpoint timeout if set, else global, else 30s default
- Service.SetEndpointCircuitBreakerConfig() API for runtime per-endpoint registration
- All existing defaults preserved (threshold=5, timeout=30s unchanged unless overridden)
- All tests pass

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Completes the per-endpoint breaker config (#95) and status JSON
exposure (#96) work. Tests pass.

Closes petersimmons1972/aifleet#95
Closes petersimmons1972/aifleet#96
@petersimmons1972 petersimmons1972 requested a review from thushan as a code owner May 23, 2026 06:46
@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown

Walkthrough

This PR introduces Flight Controller (FC) registry discovery polling and extends circuit breaker observability across the proxy service. The adapter periodically fetches a registry endpoint, converts discovered models to endpoint configurations, and reconciles the static repository. Circuit breaker enhancements add per-endpoint timeout/threshold overrides, configuration management through all config layers, and state inspection APIs exposed via the status handler.

Changes

Flight Controller Discovery and Circuit Breaker Enhancement

Layer / File(s) Summary
Configuration types and contracts
internal/config/types.go, internal/adapter/proxy/config.go, internal/adapter/proxy/config/unified.go
New CircuitBreakerConfig struct defines timeout/threshold defaults; FCDiscoveryConfig adds registry URL and poll interval; ProxyConfig and OllaConfig gain circuit breaker fields; EndpointConfig gains per-endpoint CB overrides.
Flight Controller discovery adapter
internal/adapter/discovery/fc_repository.go
FCDiscoveryPoller periodically polls an FC /registry endpoint via HTTP, converts registry entries to endpoint configs, and reconciles the StaticEndpointRepository via LoadFromConfig. Implements fail-open behaviour on request failure or non-200 responses, preserving existing endpoints.
Flight Controller discovery tests
internal/adapter/discovery/fc_repository_test.go
Three comprehensive tests cover registry-to-endpoint conversion (verifying endpoint count and URL generation), stale endpoint removal across successive polls, and fail-open behaviour when the FC registry is unavailable.
FC discovery service integration and configuration
internal/app/services/discovery.go
Extends DiscoveryService.Start() to support discovery.type = "fc", validates FC.RegistryURL, applies default poll interval, performs initial blocking Poll(ctx) before health checks, then starts the poller in a background goroutine. Stop() cancels the poller's context loop.
Circuit breaker per-endpoint overrides and state queries
internal/adapter/proxy/olla/service.go
Extends Service with endpointBreakerConfigs map for per-endpoint overrides; SetEndpointCircuitBreakerConfig registers overrides; GetCircuitBreaker resolves effective threshold/timeout from endpoint config, service-level settings, or defaults; circuitBreaker.GetState and GetStateInfo add state inspection including remaining cooldown seconds when open.
Circuit breaker state observability
internal/app/handlers/handler_status_endpoints.go
New CircuitBreakerState struct holds machine-readable state snapshot; EndpointSummary gains optional CircuitBreaker field; handler's summary-building path calls new getCircuitBreakerState helper to fetch CB details from proxy service.
Circuit breaker configuration wiring
internal/adapter/proxy/factory.go, internal/app/services/proxy.go
Factory applies app-level CB settings from ports.ProxyConfiguration to olla.Configuration; service layer populates proxy config with s.config.CircuitBreaker settings.
Configuration defaults and environment overrides
internal/config/config.go
Default proxy configuration sets circuit breaker timeout to 30 seconds and threshold to 5; environment parsing reads OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT (duration) and OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD (positive integer).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • thushan/olla#59: Circuit breaker additions extend the unified proxy configuration surface introduced in this PR.
  • thushan/olla#34: This PR's per-endpoint circuit breaker overrides and state observability directly extend the Olla circuit breaker implementation refactored in this earlier consolidation PR.

Suggested labels

enhancement, feature

🚥 Pre-merge checks | ✅ 1 | ❌ 4

❌ Failed checks (3 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning Only issue #94 (dependency bump for xsync/v4) is provided in linked_issues, but the PR contains no go.mod changes and instead implements circuit-breaker config/state exposure and FC discovery, which do not align with the stated objective. Verify that issues #95 and #96 are available and review whether their requirements match the circuit-breaker and FC discovery implementation included in this PR.
Out of Scope Changes check ⚠️ Warning The PR implements two major features (FC discovery poller and circuit-breaker state exposure) with no apparent connection to issue #94's xsync dependency bump requirement. Confirm that issues #95 and #96 document the circuit-breaker and FC discovery requirements, or remove these changes if they are genuinely out of scope.
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title references closing issues #94, #95, #96 but only #94 is provided in linked_issues; the PR implements extensive circuit-breaker and FC discovery features beyond just a dependency bump. Clarify whether the title should reference all three issues or be more specific about the circuit-breaker and FC discovery features that comprise the majority of changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/app/services/discovery.go (1)

121-127: 💤 Low value

Duplicate poll at startup.

Line 121 performs a blocking poll, then line 127 starts RunLoop which immediately polls again (see fc_repository.go:95). This briefly doubles load on the FC registry at startup.

Consider either removing the immediate poll inside RunLoop or passing a flag to skip the first poll when the caller has already primed the repository.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/services/discovery.go` around lines 121 - 127, The startup code
calls poller.Poll(ctx) then immediately starts poller.RunLoop(pollerCtx,
pollInterval), which causes a duplicate immediate poll (see RunLoop's initial
poll in fc_repository.go). Modify RunLoop to accept a flag (e.g., skipInitial
bool) that prevents the first immediate Poll when true, or remove the initial
Poll inside RunLoop and ensure callers always perform the first Poll; then
update the startup call that sets s.fcPoller and s.fcPollerCancel to invoke
RunLoop with skipInitial=true (or to rely on callers doing the initial Poll) so
the FC registry is not polled twice at startup.
internal/app/handlers/handler_status_endpoints.go (1)

155-160: 💤 Low value

Consider handling the json.Marshal error.

Line 156 discards the error from json.Marshal. Whilst this is unlikely to fail for a map of primitives, silently ignoring errors can mask unexpected issues during debugging.

🔧 Suggested improvement
 // Marshal the map into the CircuitBreakerState struct
-data, _ := json.Marshal(stateMap)
+data, err := json.Marshal(stateMap)
+if err != nil {
+	return nil
+}
 var cbs CircuitBreakerState
 if err := json.Unmarshal(data, &cbs); err != nil {
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/handlers/handler_status_endpoints.go` around lines 155 - 160,
The call to json.Marshal on stateMap currently ignores its error; update the
block around json.Marshal/stateMap/CircuitBreakerState to capture and handle the
marshal error (e.g., data, err := json.Marshal(stateMap); if err != nil { return
nil } or log and return) before attempting json.Unmarshal into cbs, keeping the
existing Unmarshal error handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/adapter/discovery/fc_repository.go`:
- Line 45: The constructed registryURL uses registryBaseURL + "/registry" which
produces a double-slash when registryBaseURL ends with '/', so update the
construction to trim any trailing slashes before appending the segment (e.g.,
use strings.TrimRight(registryBaseURL, "/") + "/registry") and add "strings" to
the imports; locate the registryURL assignment in fc_repository.go (the struct
literal or New function that sets registryURL) and replace the concatenation
with the trimmed form to ensure a single slash.

In `@internal/config/config.go`:
- Around line 425-433: The environment parsing currently accepts zero or
negative durations for Proxy.CircuitBreaker.Timeout because time.ParseDuration
allows them; update the env branch in internal/config/config.go to only assign
config.Proxy.CircuitBreaker.Timeout when parsed duration d > 0 (reject d <= 0)
and keep the existing positive-integer check for Proxy.CircuitBreaker.Threshold
as-is; additionally add the same invariant checks to Config.Validate() (e.g.,
verify Proxy.CircuitBreaker.Timeout > 0 and Proxy.CircuitBreaker.Threshold > 0,
returning validation error if not) so YAML/config-file values cannot bypass the
constraint.

---

Nitpick comments:
In `@internal/app/handlers/handler_status_endpoints.go`:
- Around line 155-160: The call to json.Marshal on stateMap currently ignores
its error; update the block around json.Marshal/stateMap/CircuitBreakerState to
capture and handle the marshal error (e.g., data, err := json.Marshal(stateMap);
if err != nil { return nil } or log and return) before attempting json.Unmarshal
into cbs, keeping the existing Unmarshal error handling intact.

In `@internal/app/services/discovery.go`:
- Around line 121-127: The startup code calls poller.Poll(ctx) then immediately
starts poller.RunLoop(pollerCtx, pollInterval), which causes a duplicate
immediate poll (see RunLoop's initial poll in fc_repository.go). Modify RunLoop
to accept a flag (e.g., skipInitial bool) that prevents the first immediate Poll
when true, or remove the initial Poll inside RunLoop and ensure callers always
perform the first Poll; then update the startup call that sets s.fcPoller and
s.fcPollerCancel to invoke RunLoop with skipInitial=true (or to rely on callers
doing the initial Poll) so the FC registry is not polled twice at startup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1fdf8aa4-d797-4cca-b792-c644f308cd56

📥 Commits

Reviewing files that changed from the base of the PR and between 6d6ac4d and eda1824.

📒 Files selected for processing (11)
  • internal/adapter/discovery/fc_repository.go
  • internal/adapter/discovery/fc_repository_test.go
  • internal/adapter/proxy/config.go
  • internal/adapter/proxy/config/unified.go
  • internal/adapter/proxy/factory.go
  • internal/adapter/proxy/olla/service.go
  • internal/app/handlers/handler_status_endpoints.go
  • internal/app/services/discovery.go
  • internal/app/services/proxy.go
  • internal/config/config.go
  • internal/config/types.go

func NewFCDiscoveryPoller(repo *StaticEndpointRepository, registryBaseURL string, log logger.StyledLogger) *FCDiscoveryPoller {
return &FCDiscoveryPoller{
repo: repo,
registryURL: registryBaseURL + "/registry",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Double-slash possible when base URL has trailing slash.

If registryBaseURL ends with /, the constructed URL becomes http://host//registry. Some HTTP servers or reverse proxies treat double-slashes differently, potentially causing 404s.

🛠️ Proposed fix: normalise trailing slash
-		registryURL: registryBaseURL + "/registry",
+		registryURL: strings.TrimSuffix(registryBaseURL, "/") + "/registry",

Add "strings" to imports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/adapter/discovery/fc_repository.go` at line 45, The constructed
registryURL uses registryBaseURL + "/registry" which produces a double-slash
when registryBaseURL ends with '/', so update the construction to trim any
trailing slashes before appending the segment (e.g., use
strings.TrimRight(registryBaseURL, "/") + "/registry") and add "strings" to the
imports; locate the registryURL assignment in fc_repository.go (the struct
literal or New function that sets registryURL) and replace the concatenation
with the trimmed form to ensure a single slash.

Comment thread internal/config/config.go
Comment on lines +425 to +433
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" {
if d, err := time.ParseDuration(val); err == nil {
config.Proxy.CircuitBreaker.Timeout = d
}
}
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD"); val != "" {
if n, err := strconv.Atoi(val); err == nil && n > 0 {
config.Proxy.CircuitBreaker.Threshold = n
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-positive circuit-breaker timeout values.

time.ParseDuration accepts 0 and negative durations, so this path can set an invalid cooldown and destabilise breaker state transitions. Also add the same invariant in Config.Validate() so YAML/config-file values cannot bypass it.

Suggested fix
 if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" {
-	if d, err := time.ParseDuration(val); err == nil {
+	if d, err := time.ParseDuration(val); err == nil && d > 0 {
 		config.Proxy.CircuitBreaker.Timeout = d
 	}
 }
 func (c *Config) Validate() error {
+	if c.Proxy.CircuitBreaker.Timeout <= 0 {
+		return fmt.Errorf("proxy.circuit_breaker.timeout must be > 0, got %s", c.Proxy.CircuitBreaker.Timeout)
+	}
+	if c.Proxy.CircuitBreaker.Threshold <= 0 {
+		return fmt.Errorf("proxy.circuit_breaker.threshold must be > 0, got %d", c.Proxy.CircuitBreaker.Threshold)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" {
if d, err := time.ParseDuration(val); err == nil {
config.Proxy.CircuitBreaker.Timeout = d
}
}
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD"); val != "" {
if n, err := strconv.Atoi(val); err == nil && n > 0 {
config.Proxy.CircuitBreaker.Threshold = n
}
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_TIMEOUT"); val != "" {
if d, err := time.ParseDuration(val); err == nil && d > 0 {
config.Proxy.CircuitBreaker.Timeout = d
}
}
if val := os.Getenv("OLLA_HEALTH_CIRCUIT_BREAKER_THRESHOLD"); val != "" {
if n, err := strconv.Atoi(val); err == nil && n > 0 {
config.Proxy.CircuitBreaker.Threshold = n
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/config/config.go` around lines 425 - 433, The environment parsing
currently accepts zero or negative durations for Proxy.CircuitBreaker.Timeout
because time.ParseDuration allows them; update the env branch in
internal/config/config.go to only assign config.Proxy.CircuitBreaker.Timeout
when parsed duration d > 0 (reject d <= 0) and keep the existing
positive-integer check for Proxy.CircuitBreaker.Threshold as-is; additionally
add the same invariant checks to Config.Validate() (e.g., verify
Proxy.CircuitBreaker.Timeout > 0 and Proxy.CircuitBreaker.Threshold > 0,
returning validation error if not) so YAML/config-file values cannot bypass the
constraint.

@thushan thushan self-assigned this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants