Skip to content

fix(flags): Retry flag requests on transient network errors#198

Open
haacked wants to merge 5 commits into
mainfrom
haacked/195-retry-flags-requests
Open

fix(flags): Retry flag requests on transient network errors#198
haacked wants to merge 5 commits into
mainfrom
haacked/195-retry-flags-requests

Conversation

@haacked

@haacked haacked commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

💡 Motivation and Context

Fixes #195.

The feature flag request path (FeatureFlagsPoller#_request, used by _request_feature_flag_evaluation for POST /flags/?v=2 and the definitions/remote-config reads) opened a fresh connection per request with a short timeout and no retry. On any transient network error it rescued and immediately re-raised, so a single blip in the client to CDN to gateway path (packet loss plus TCP retransmit, TLS setup jitter, an edge or proxy hiccup) that exceeded the timeout budget became a customer-visible Net::ReadTimeout, even when the flags service itself was healthy and responding in single-digit milliseconds.

The event-capture path (Transport) already retries with backoff, so the flags path was the inconsistent one.

This adds a bounded retry (default 1 retry) with backoff to _request for the transient, idempotent network errors it already rescues: Timeout::Error (includes Net::OpenTimeout), Net::ReadTimeout, Net::WriteTimeout, Errno::ECONNRESET, and EOFError. Flag evaluation is a side-effect-free read, so retrying is safe, and the existing BackoffPolicy is reused. The other rescued errors (Errno::EINVAL, Net::HTTPBadResponse, Net::HTTPHeaderSyntaxError, Net::ProtocolError) and connection-refused (Errno::ECONNREFUSED) still re-raise immediately, so upstream $feature_flag_error tracking is unchanged. After retries are exhausted the original exception still propagates. The retry count lives in Defaults::FeatureFlags::FLAG_REQUEST_MAX_RETRIES.

💚 How did you test it?

New flags_spec cases cover the three paths: retries once and succeeds after a transient Net::ReadTimeout, retries on Errno::ECONNRESET then re-raises once retries are exhausted, and does not retry on Errno::ECONNREFUSED. The existing timeout test in feature_flag_spec was updated to account for the extra retry request, with the result still nil so the intent is preserved. Full suite is green (553 examples) and rubocop is clean.

📝 Checklist

  • I reviewed the submitted code.
  • I added tests to verify the changes.
  • I updated the docs if needed.
  • No breaking change or entry added to the changelog.

If releasing new changes

  • Ran pnpm changeset to generate a changeset file

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Phil directed this fix from the GitHub issue. Implemented with Claude Code (session).

Decisions along the way: the retry is exception-only and deliberately does not copy Transport's status-code retry, since _request returns the status code inside the parsed body and the upstream error-tracking layer turns 4xx/5xx into api_error_*. Adding status-code retry would have changed that behavior. The retryable set mirrors the transient subset already rescued and excludes Errno::ECONNREFUSED so a refused connection still fails fast. Placement is in _request rather than only the evaluation call because all three callers are idempotent reads.

@haacked haacked self-assigned this Jun 24, 2026
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "fix: Retry flag requests on transient ne..." | Re-trigger Greptile

Comment thread spec/posthog/flags_spec.rb
Comment thread lib/posthog/feature_flags.rb Outdated
@haacked haacked force-pushed the haacked/195-retry-flags-requests branch from f582c77 to 75b8626 Compare June 24, 2026 22:16
@haacked haacked requested a review from Copilot June 24, 2026 22:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds bounded retry-with-backoff behavior to the feature flag request path (PostHog::FeatureFlagsPoller#_request) so transient network errors (e.g., read/write timeouts, connection resets) don’t immediately surface as customer-visible failures, aligning behavior more closely with the event-capture transport.

Changes:

  • Introduces Defaults::FeatureFlags::FLAG_REQUEST_MAX_RETRIES (default 1) to bound retries for flag-related HTTP requests.
  • Adds retry logic with BackoffPolicy for a defined set of transient, retry-safe network exceptions in FeatureFlagsPoller#_request.
  • Updates and adds specs to validate retry behavior and adjusts existing expectations for the additional attempt.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
lib/posthog/feature_flags.rb Adds retry-with-backoff around transient network errors in _request and introduces retryable error set.
lib/posthog/defaults.rb Adds a default max-retries constant for feature-flag requests.
spec/posthog/flags_spec.rb Adds new test cases covering retry success/failure behavior for transient errors.
spec/posthog/feature_flag_spec.rb Updates timeout-related assertions to account for one retry and avoids real sleeps.
public_api_snapshot.txt Records the newly added public constant in the API snapshot.
.changeset/retry-flag-requests.md Adds a changeset entry for patch release of posthog-ruby and posthog-rails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/posthog/feature_flags.rb Outdated
Comment thread lib/posthog/feature_flags.rb Outdated
@haacked haacked marked this pull request as ready for review June 24, 2026 22:30
@haacked haacked requested a review from a team as a code owner June 24, 2026 22:30
@haacked haacked changed the title fix: Retry flag requests on transient network errors fix(flags): Retry flag requests on transient network errors Jun 24, 2026
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "Mask tokens in non-retryable error failu..." | Re-trigger Greptile

Removes redundant error entries from RETRYABLE_REQUEST_ERRORS that are
already covered by Timeout::Error (Net::ReadTimeout, Net::WriteTimeout).
Updates the comment to clarify that Timeout::Error covers all timeout types.

Improves test coverage by:
- Stubbing sleep() calls to prevent backoff delays in tests
- Adding expectation to verify retry sleep is called
- Adding test case for non-retryable network errors
@haacked haacked requested a review from marandaneto June 24, 2026 22:49
@marandaneto

marandaneto commented Jun 25, 2026

Copy link
Copy Markdown
Member

This retry happens synchronously in the feature flag request path, so callers of get_feature_flag is_feature_enabled will block during the backoff sleep and retry. That may be okay since it’s bounded to one retry, but it does increase worst-case latency under failures. Should we document this latency tradeoff or make the retry count configurable? (or even opt-in/out?)

the other point:

Per-call retry is simple and acceptable because it is bounded to one retry, but under a broad connectivity issue every request thread will retry independently. If we want stronger protection, a circuit breaker or retry throttle would be safer than making retry attempts/backoff global, since global mutable attempts can cause cross-request interference and thread-safety issues.

Example behavior:

  • Each call still gets its own normal attempt.
  • On transient failure, allow bounded retry.
  • Track consecutive transient failures globally.
  • After threshold, open circuit for e.g. 5–30s.
  • While circuit is open, fail fast / return default / local evaluation instead of sleeping and retrying.
  • After cooldown, allow one “half-open” probe.
  • If probe succeeds, close circuit and reset failures.

That avoids this bad outage behavior:

  100 incoming requests
  => 100 initial /flags calls
  => 100 sleeps
  => 100 retries

Adds a new `feature_flag_request_max_retries` option that lets callers
control how many times to retry feature flag requests after a transient
network error. Defaults to 1 (retry once) and can be set to 0 to disable
retrying. Each retry sleeps on the calling thread before retrying, so
higher values add to worst-case latency.

Includes tests for the new configurable behavior.
@haacked

haacked commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Yeah, a configurable retry makes sense. I added a feature_flag_request_max_retries option (defaults to 1, set to 0 to opt out entirely) and documented the latency tradeoff on the option.

Regarding the circuit breaker, I agree that would offer stronger protection. But I think that should be a separate issue. Its more than I want to take on in this PR. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature flag evaluation request has no retry on transient network errors

3 participants