fix(flags): Retry flag requests on transient network errors#198
fix(flags): Retry flag requests on transient network errors#198haacked wants to merge 5 commits into
Conversation
|
Reviews (1): Last reviewed commit: "fix: Retry flag requests on transient ne..." | Re-trigger Greptile |
f582c77 to
75b8626
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds bounded retry-with-backoff behavior to the feature flag request path (PostHog::FeatureFlagsPoller#_request) so transient network errors (e.g., read/write timeouts, connection resets) don’t immediately surface as customer-visible failures, aligning behavior more closely with the event-capture transport.
Changes:
- Introduces
Defaults::FeatureFlags::FLAG_REQUEST_MAX_RETRIES(default1) to bound retries for flag-related HTTP requests. - Adds retry logic with
BackoffPolicyfor a defined set of transient, retry-safe network exceptions inFeatureFlagsPoller#_request. - Updates and adds specs to validate retry behavior and adjusts existing expectations for the additional attempt.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
lib/posthog/feature_flags.rb |
Adds retry-with-backoff around transient network errors in _request and introduces retryable error set. |
lib/posthog/defaults.rb |
Adds a default max-retries constant for feature-flag requests. |
spec/posthog/flags_spec.rb |
Adds new test cases covering retry success/failure behavior for transient errors. |
spec/posthog/feature_flag_spec.rb |
Updates timeout-related assertions to account for one retry and avoids real sleeps. |
public_api_snapshot.txt |
Records the newly added public constant in the API snapshot. |
.changeset/retry-flag-requests.md |
Adds a changeset entry for patch release of posthog-ruby and posthog-rails. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Reviews (2): Last reviewed commit: "Mask tokens in non-retryable error failu..." | Re-trigger Greptile |
Removes redundant error entries from RETRYABLE_REQUEST_ERRORS that are already covered by Timeout::Error (Net::ReadTimeout, Net::WriteTimeout). Updates the comment to clarify that Timeout::Error covers all timeout types. Improves test coverage by: - Stubbing sleep() calls to prevent backoff delays in tests - Adding expectation to verify retry sleep is called - Adding test case for non-retryable network errors
|
This retry happens synchronously in the feature flag request path, so callers of get_feature_flag is_feature_enabled will block during the backoff sleep and retry. That may be okay since it’s bounded to one retry, but it does increase worst-case latency under failures. Should we document this latency tradeoff or make the retry count configurable? (or even opt-in/out?) the other point: Per-call retry is simple and acceptable because it is bounded to one retry, but under a broad connectivity issue every request thread will retry independently. If we want stronger protection, a circuit breaker or retry throttle would be safer than making retry attempts/backoff global, since global mutable attempts can cause cross-request interference and thread-safety issues. Example behavior:
That avoids this bad outage behavior: |
|
i have a few drafts open to port this fix but didnt test/review yet, will wait this one to land + resolve this comment if needed
|
Adds a new `feature_flag_request_max_retries` option that lets callers control how many times to retry feature flag requests after a transient network error. Defaults to 1 (retry once) and can be set to 0 to disable retrying. Each retry sleeps on the calling thread before retrying, so higher values add to worst-case latency. Includes tests for the new configurable behavior.
|
Yeah, a configurable retry makes sense. I added a Regarding the circuit breaker, I agree that would offer stronger protection. But I think that should be a separate issue. Its more than I want to take on in this PR. 😄 |
💡 Motivation and Context
Fixes #195.
The feature flag request path (
FeatureFlagsPoller#_request, used by_request_feature_flag_evaluationforPOST /flags/?v=2and the definitions/remote-config reads) opened a fresh connection per request with a short timeout and no retry. On any transient network error it rescued and immediately re-raised, so a single blip in the client to CDN to gateway path (packet loss plus TCP retransmit, TLS setup jitter, an edge or proxy hiccup) that exceeded the timeout budget became a customer-visibleNet::ReadTimeout, even when the flags service itself was healthy and responding in single-digit milliseconds.The event-capture path (
Transport) already retries with backoff, so the flags path was the inconsistent one.This adds a bounded retry (default 1 retry) with backoff to
_requestfor the transient, idempotent network errors it already rescues:Timeout::Error(includesNet::OpenTimeout),Net::ReadTimeout,Net::WriteTimeout,Errno::ECONNRESET, andEOFError. Flag evaluation is a side-effect-free read, so retrying is safe, and the existingBackoffPolicyis reused. The other rescued errors (Errno::EINVAL,Net::HTTPBadResponse,Net::HTTPHeaderSyntaxError,Net::ProtocolError) and connection-refused (Errno::ECONNREFUSED) still re-raise immediately, so upstream$feature_flag_errortracking is unchanged. After retries are exhausted the original exception still propagates. The retry count lives inDefaults::FeatureFlags::FLAG_REQUEST_MAX_RETRIES.💚 How did you test it?
New
flags_speccases cover the three paths: retries once and succeeds after a transientNet::ReadTimeout, retries onErrno::ECONNRESETthen re-raises once retries are exhausted, and does not retry onErrno::ECONNREFUSED. The existing timeout test infeature_flag_specwas updated to account for the extra retry request, with the result stillnilso the intent is preserved. Full suite is green (553 examples) and rubocop is clean.📝 Checklist
If releasing new changes
pnpm changesetto generate a changeset file🤖 Agent context
Autonomy: Human-driven (agent-assisted)
Phil directed this fix from the GitHub issue. Implemented with Claude Code (session).
Decisions along the way: the retry is exception-only and deliberately does not copy
Transport's status-code retry, since_requestreturns the status code inside the parsed body and the upstream error-tracking layer turns 4xx/5xx intoapi_error_*. Adding status-code retry would have changed that behavior. The retryable set mirrors the transient subset already rescued and excludesErrno::ECONNREFUSEDso a refused connection still fails fast. Placement is in_requestrather than only the evaluation call because all three callers are idempotent reads.