fix(flags): Retry flag requests on transient network errors by haacked · Pull Request #198 · PostHog/posthog-ruby

haacked · 2026-06-24T21:31:45Z

💡 Motivation and Context

Fixes #195.

The feature flag request path (FeatureFlagsPoller#_request, used by _request_feature_flag_evaluation for POST /flags/?v=2 and the definitions/remote-config reads) opened a fresh connection per request with a short timeout and no retry. On any transient network error it rescued and immediately re-raised, so a single blip in the client to CDN to gateway path (packet loss plus TCP retransmit, TLS setup jitter, an edge or proxy hiccup) that exceeded the timeout budget became a customer-visible Net::ReadTimeout, even when the flags service itself was healthy and responding in single-digit milliseconds.

The event-capture path (Transport) already retries with backoff, so the flags path was the inconsistent one.

This adds a bounded retry (default 1 retry) with backoff to _request for the transient, idempotent network errors it already rescues: Timeout::Error (includes Net::OpenTimeout), Net::ReadTimeout, Net::WriteTimeout, Errno::ECONNRESET, and EOFError. Flag evaluation is a side-effect-free read, so retrying is safe, and the existing BackoffPolicy is reused. The other rescued errors (Errno::EINVAL, Net::HTTPBadResponse, Net::HTTPHeaderSyntaxError, Net::ProtocolError) and connection-refused (Errno::ECONNREFUSED) still re-raise immediately, so upstream $feature_flag_error tracking is unchanged. After retries are exhausted the original exception still propagates. The retry count lives in Defaults::FeatureFlags::FLAG_REQUEST_MAX_RETRIES.

💚 How did you test it?

New flags_spec cases cover the three paths: retries once and succeeds after a transient Net::ReadTimeout, retries on Errno::ECONNRESET then re-raises once retries are exhausted, and does not retry on Errno::ECONNREFUSED. The existing timeout test in feature_flag_spec was updated to account for the extra retry request, with the result still nil so the intent is preserved. Full suite is green (553 examples) and rubocop is clean.

📝 Checklist

I reviewed the submitted code.
I added tests to verify the changes.
I updated the docs if needed.
No breaking change or entry added to the changelog.

If releasing new changes

Ran pnpm changeset to generate a changeset file

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Phil directed this fix from the GitHub issue. Implemented with Claude Code (session).

Decisions along the way: the retry is exception-only and deliberately does not copy Transport's status-code retry, since _request returns the status code inside the parsed body and the upstream error-tracking layer turns 4xx/5xx into api_error_*. Adding status-code retry would have changed that behavior. The retryable set mirrors the transient subset already rescued and excludes Errno::ECONNREFUSED so a refused connection still fails fast. Placement is in _request rather than only the evaluation call because all three callers are idempotent reads.

greptile-apps · 2026-06-24T21:35:25Z

_{Reviews (1): Last reviewed commit: "fix: Retry flag requests on transient ne..." | Re-trigger Greptile}

Copilot

Pull request overview

This PR adds bounded retry-with-backoff behavior to the feature flag request path (PostHog::FeatureFlagsPoller#_request) so transient network errors (e.g., read/write timeouts, connection resets) don’t immediately surface as customer-visible failures, aligning behavior more closely with the event-capture transport.

Changes:

Introduces Defaults::FeatureFlags::FLAG_REQUEST_MAX_RETRIES (default 1) to bound retries for flag-related HTTP requests.
Adds retry logic with BackoffPolicy for a defined set of transient, retry-safe network exceptions in FeatureFlagsPoller#_request.
Updates and adds specs to validate retry behavior and adjusts existing expectations for the additional attempt.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`lib/posthog/feature_flags.rb`	Adds retry-with-backoff around transient network errors in `_request` and introduces retryable error set.
`lib/posthog/defaults.rb`	Adds a default max-retries constant for feature-flag requests.
`spec/posthog/flags_spec.rb`	Adds new test cases covering retry success/failure behavior for transient errors.
`spec/posthog/feature_flag_spec.rb`	Updates timeout-related assertions to account for one retry and avoids real sleeps.
`public_api_snapshot.txt`	Records the newly added public constant in the API snapshot.
`.changeset/retry-flag-requests.md`	Adds a changeset entry for patch release of `posthog-ruby` and `posthog-rails`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fixes #195

greptile-apps · 2026-06-24T22:34:08Z

_{Reviews (2): Last reviewed commit: "Mask tokens in non-retryable error failu..." | Re-trigger Greptile}

Removes redundant error entries from RETRYABLE_REQUEST_ERRORS that are already covered by Timeout::Error (Net::ReadTimeout, Net::WriteTimeout). Updates the comment to clarify that Timeout::Error covers all timeout types. Improves test coverage by: - Stubbing sleep() calls to prevent backoff delays in tests - Adding expectation to verify retry sleep is called - Adding test case for non-retryable network errors

marandaneto · 2026-06-25T07:02:56Z

This retry happens synchronously in the feature flag request path, so callers of get_feature_flag is_feature_enabled will block during the backoff sleep and retry. That may be okay since it’s bounded to one retry, but it does increase worst-case latency under failures. Should we document this latency tradeoff or make the retry count configurable? (or even opt-in/out?)

the other point:

Per-call retry is simple and acceptable because it is bounded to one retry, but under a broad connectivity issue every request thread will retry independently. If we want stronger protection, a circuit breaker or retry throttle would be safer than making retry attempts/backoff global, since global mutable attempts can cause cross-request interference and thread-safety issues.

Example behavior:

Each call still gets its own normal attempt.
On transient failure, allow bounded retry.
Track consecutive transient failures globally.
After threshold, open circuit for e.g. 5–30s.
While circuit is open, fail fast / return default / local evaluation instead of sleeping and retrying.
After cooldown, allow one “half-open” probe.
If probe succeeds, close circuit and reset failures.

That avoids this bad outage behavior:

  100 incoming requests
  => 100 initial /flags calls
  => 100 sleeps
  => 100 retries

marandaneto · 2026-06-25T07:06:12Z

i have a few drafts open to port this fix but didnt test/review yet, will wait this one to land + resolve this comment if needed

posthog-js: fix(flags): retry flag requests on transient network errors posthog-js#3961
posthog-rs: fix(flags): retry flag requests on transient network errors posthog-rs#156
posthog-dotnet: fix(flags): retry flag requests on transient network errors posthog-dotnet#245
posthog-go: fix(flags): retry flag requests on transient network errors posthog-go#240
posthog-php: fix(flags): retry flag requests on transient network errors posthog-php#184
posthog-android: fix(flags): retry flag requests on transient network errors posthog-android#587
posthog-ios: fix(flags): retry flag requests on transient network errors posthog-ios#674
posthog-unity: fix(flags): retry flag requests on transient network errors posthog-unity#101

Adds a new `feature_flag_request_max_retries` option that lets callers control how many times to retry feature flag requests after a transient network error. Defaults to 1 (retry once) and can be set to 0 to disable retrying. Each retry sleeps on the calling thread before retrying, so higher values add to worst-case latency. Includes tests for the new configurable behavior.

haacked · 2026-06-25T16:11:34Z

Yeah, a configurable retry makes sense. I added a feature_flag_request_max_retries option (defaults to 1, set to 0 to opt out entirely) and documented the latency tradeoff on the option.

Regarding the circuit breaker, I agree that would offer stronger protection. But I think that should be a separate issue. Its more than I want to take on in this PR. 😄

haacked self-assigned this Jun 24, 2026

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread spec/posthog/flags_spec.rb

Comment thread lib/posthog/feature_flags.rb Outdated

Retry flag requests on transient network errors

75b8626

haacked force-pushed the haacked/195-retry-flags-requests branch from f582c77 to 75b8626 Compare June 24, 2026 22:16

haacked requested a review from Copilot June 24, 2026 22:16

Copilot started reviewing on behalf of haacked June 24, 2026 22:16 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread lib/posthog/feature_flags.rb Outdated

Comment thread lib/posthog/feature_flags.rb Outdated

haacked added 2 commits June 24, 2026 15:21

Mask tokens in exhausted-retry failure log

7d37e25

Fixes #195

Mask tokens in non-retryable error failure log

2ea40f4

haacked marked this pull request as ready for review June 24, 2026 22:30

haacked requested a review from a team as a code owner June 24, 2026 22:30

haacked changed the title ~~fix: Retry flag requests on transient network errors~~ fix(flags): Retry flag requests on transient network errors Jun 24, 2026

haacked requested a review from marandaneto June 24, 2026 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(flags): Retry flag requests on transient network errors#198

fix(flags): Retry flag requests on transient network errors#198
haacked wants to merge 5 commits into
mainfrom
haacked/195-retry-flags-requests

haacked commented Jun 24, 2026

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Uh oh!

marandaneto commented Jun 25, 2026 •

edited

Loading

Uh oh!

marandaneto commented Jun 25, 2026

Uh oh!

haacked commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haacked commented Jun 24, 2026

💡 Motivation and Context

💚 How did you test it?

📝 Checklist

If releasing new changes

🤖 Agent context

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Uh oh!

marandaneto commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marandaneto commented Jun 25, 2026

Uh oh!

haacked commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marandaneto commented Jun 25, 2026 •

edited

Loading