BB-782: Fix bootstrap consumer stealing messages#2752
Conversation
The test slept a fixed 5 seconds and then read the published offsets from zookeeper exactly once, and its prometheus check was a floating promise whose assertion failures were lost as unhandled rejections. Poll the offsets instead (10 attempts, 1s apart, 2s cap per attempt), failing with a message that names the offsets, and chain the prometheus check into done() so its assertions count. This hardens the test; the root cause of the flake is the bootstrap consumer, fixed in the next commit.
The test-only bootstrap mechanism issued a consume request every 200ms unconditionally, while each request waits up to one second in the kafka client. Bootstrap takes several seconds, so a backlog of stale one-shot requests kept draining long after bootstrap completed. The first messages produced by a test then landed in a stale request, whose callback drops anything without a bootstrapId: the message was consumed but never delivered to the processing pipeline, and the test timed out waiting for it. Slow CI runners stretch bootstrap, widening the theft window, hence the flake. Allow only one in-flight bootstrap consume request, the same rule the regular consume loop enforces with its pending-requests gate, so the final bootstrap message arrives on the only outstanding request and none survive bootstrap. Also warn if a non-bootstrap message is ever dropped here again instead of staying silent. Reproduced at ~10% locally against a CPU-throttled broker before the fix; 40 consecutive passes after.
Hello anurag4dsb,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
|
LGTM — clean fix scoped to the test-only bootstrap path. The one-request-in-flight gate mirrors the regular consume loop's existing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
... and 2 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.0 #2752 +/- ##
===================================================
+ Coverage 74.07% 74.34% +0.26%
===================================================
Files 201 201
Lines 13480 13485 +5
===================================================
+ Hits 9985 10025 +40
+ Misses 3485 3450 -35
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Intent: why does this change exist?
The "read messages … and publish topic metrics" functional test fails CI with a bare 30s
timeout on roughly half the runs on loaded runners — it blocked PR #2749 twice and fails on
pristine
development/9.0. Root cause, confirmed by tracing: the test-only bootstrap mechanismissues a kafka consume request every 200ms while each request waits up to 1s inside the client,
so a backlog of stale requests outlives bootstrap and steals the first real messages a test
produces — bootstrap's callback silently drops anything that isn't a bootstrap sentinel.
System impact: what's affected, including downstream?
Only
_bootstrapConsumer— a test-only path (bootstrap: trueis set exclusively fromtests/; the option is documented "TEST ONLY"). Every bootstrap-using suite (lib, replicationqueueProcessor) inherits the fix. No production data path changes.
Preserved behavior: what explicitly stays the same?
The bootstrap protocol is unchanged: sentinel every 5s, commit past the sentinel, unsubscribe,
hand off to the real loop. The stacked requests were not useful parallelism — bootstrap fetches
one sentinel per 5s, and the regular consume loop already enforces the same one-request-in-flight
rule via its
_nConsumePendingRequestsgate. Production consumers never enter this code.Intended change: what's different after this PR?
At most one in-flight bootstrap consume request, so the final sentinel necessarily arrives on
the only outstanding request and none survive bootstrap to steal messages — plus a warn log if
bootstrap ever drops a non-bootstrap message again. In the test itself: the ZK offsets check
polls with bounded attempts instead of a fixed 5s sleep + one-shot read, and the prometheus
check is chained into
done(it was a floating promise whose assertion failures were lost asunhandled rejections).
Verification: how do we know this worked, or how would we know if it didn't?
Reproduced locally at ~10% by CPU-throttling the CI kafka image; tracing captured the theft
directly (3 messages produced, the pipeline received 2, the commit skipped over the eaten one).
With the fix: 40 consecutive local passes under the same throttle and a green CI matrix. For
the before/after on CI itself: an empty commit on this branch failed
lib testswith zero codechanges (run 27012249826), the
fix went green (run 27014743787).
If bootstrap ever eats a message again, the new warn makes it visible in the logs instead of
silent.