Skip to content

BB-782: Fix bootstrap consumer stealing messages#2752

Open
anurag4DSB wants to merge 2 commits into
development/9.0from
bugfix/BB-782-deflake-consumer-metrics-test
Open

BB-782: Fix bootstrap consumer stealing messages#2752
anurag4DSB wants to merge 2 commits into
development/9.0from
bugfix/BB-782-deflake-consumer-metrics-test

Conversation

@anurag4DSB
Copy link
Copy Markdown
Contributor

Intent: why does this change exist?

The "read messages … and publish topic metrics" functional test fails CI with a bare 30s
timeout on roughly half the runs on loaded runners — it blocked PR #2749 twice and fails on
pristine development/9.0. Root cause, confirmed by tracing: the test-only bootstrap mechanism
issues a kafka consume request every 200ms while each request waits up to 1s inside the client,
so a backlog of stale requests outlives bootstrap and steals the first real messages a test
produces — bootstrap's callback silently drops anything that isn't a bootstrap sentinel.

System impact: what's affected, including downstream?

Only _bootstrapConsumer — a test-only path (bootstrap: true is set exclusively from
tests/; the option is documented "TEST ONLY"). Every bootstrap-using suite (lib, replication
queueProcessor) inherits the fix. No production data path changes.

Preserved behavior: what explicitly stays the same?

The bootstrap protocol is unchanged: sentinel every 5s, commit past the sentinel, unsubscribe,
hand off to the real loop. The stacked requests were not useful parallelism — bootstrap fetches
one sentinel per 5s, and the regular consume loop already enforces the same one-request-in-flight
rule via its _nConsumePendingRequests gate. Production consumers never enter this code.

Intended change: what's different after this PR?

At most one in-flight bootstrap consume request, so the final sentinel necessarily arrives on
the only outstanding request and none survive bootstrap to steal messages — plus a warn log if
bootstrap ever drops a non-bootstrap message again. In the test itself: the ZK offsets check
polls with bounded attempts instead of a fixed 5s sleep + one-shot read, and the prometheus
check is chained into done (it was a floating promise whose assertion failures were lost as
unhandled rejections).

Verification: how do we know this worked, or how would we know if it didn't?

Reproduced locally at ~10% by CPU-throttling the CI kafka image; tracing captured the theft
directly (3 messages produced, the pipeline received 2, the commit skipped over the eaten one).
With the fix: 40 consecutive local passes under the same throttle and a green CI matrix. For
the before/after on CI itself: an empty commit on this branch failed lib tests with zero code
changes (run 27012249826), the
fix went green (run 27014743787).
If bootstrap ever eats a message again, the new warn makes it visible in the logs instead of
silent.

The test slept a fixed 5 seconds and then read the published offsets
from zookeeper exactly once, and its prometheus check was a floating
promise whose assertion failures were lost as unhandled rejections.

Poll the offsets instead (10 attempts, 1s apart, 2s cap per attempt),
failing with a message that names the offsets, and chain the
prometheus check into done() so its assertions count. This hardens
the test; the root cause of the flake is the bootstrap consumer,
fixed in the next commit.
The test-only bootstrap mechanism issued a consume request every
200ms unconditionally, while each request waits up to one second in
the kafka client. Bootstrap takes several seconds, so a backlog of
stale one-shot requests kept draining long after bootstrap completed.
The first messages produced by a test then landed in a stale request,
whose callback drops anything without a bootstrapId: the message was
consumed but never delivered to the processing pipeline, and the test
timed out waiting for it. Slow CI runners stretch bootstrap, widening
the theft window, hence the flake.

Allow only one in-flight bootstrap consume request, the same rule the
regular consume loop enforces with its pending-requests gate, so the
final bootstrap message arrives on the only outstanding request and
none survive bootstrap. Also warn if a non-bootstrap message is ever
dropped here again instead of staying silent.

Reproduced at ~10% locally against a CPU-throttled broker before the
fix; 40 consecutive passes after.
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Hello anurag4dsb,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Incorrect fix version

The Fix Version/s in issue BB-782 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.27

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-782, or the target
branch of this pull request.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

LGTM — clean fix scoped to the test-only bootstrap path. The one-request-in-flight gate mirrors the regular consume loop's existing _nConsumePendingRequests pattern, and the test improvements (retry-based ZK offset check, properly chained promise for the Prometheus assertion) are correct. No production data paths changed.

Review by Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.34%. Comparing base (7e5b5f6) to head (65f98cc).

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/BackbeatConsumer.js 94.08% <100.00%> (+0.08%) ⬆️

... and 2 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.20% <ø> (ø)
Core Library 81.12% <100.00%> (+0.64%) ⬆️
Ingestion 70.30% <ø> (ø)
Lifecycle 78.63% <ø> (ø)
Oplog Populator 85.06% <ø> (ø)
Replication 58.54% <ø> (ø)
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.0    #2752      +/-   ##
===================================================
+ Coverage            74.07%   74.34%   +0.26%     
===================================================
  Files                  201      201              
  Lines                13480    13485       +5     
===================================================
+ Hits                  9985    10025      +40     
+ Misses                3485     3450      -35     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.47% <0.00%> (-0.01%) ⬇️
api:routes 9.28% <0.00%> (-0.01%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.84% <0.00%> (+1.91%) ⬆️
ingestion 12.56% <0.00%> (-0.01%) ⬇️
lib 7.53% <100.00%> (+0.03%) ⬆️
lifecycle 18.59% <83.33%> (+0.02%) ⬆️
notification 1.03% <0.00%> (-0.01%) ⬇️
replication 18.58% <83.33%> (+0.02%) ⬆️
unit 50.05% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anurag4DSB anurag4DSB requested a review from nicolas2bert June 5, 2026 13:09
@anurag4DSB anurag4DSB marked this pull request as ready for review June 5, 2026 13:54
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Incorrect fix version

The Fix Version/s in issue BB-782 contains:

  • 9.0.27

  • 9.1.12

  • 9.2.7

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.27

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-782, or the target
branch of this pull request.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Incorrect fix version

The Fix Version/s in issue BB-782 contains:

  • 9.0.27

  • 9.1.12

  • 9.2.7

  • 9.3.5

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.27

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-782, or the target
branch of this pull request.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants