BB-780: Configure CRR queue processor poll interval by anurag4DSB · Pull Request #2749 · scality/backbeat

anurag4DSB · 2026-06-05T09:51:27Z

Note to reviewers: the target branch is for another fix, unrelated to this, we are debating if we add that or not. But it was needed for a green CI.

Intent: why does this change exist?

Lowering mpu_parts_concurrency (to stop large MPUs overloading Metadata) makes a single large-MPU replication run longer than Kafka's 5-min max.poll.interval.ms, so the queue processor's consumer gets evicted mid-transfer — the partition is redelivered and you get duplicate work, orphan sproxyd parts, and ghost processes. This exposes that timeout so an operator can raise it and let long transfers finish. It's the short-term mitigation from the "CRR on large MPUs" root-cause plan; the real fix (keep polling while a slow task runs) is separate, larger work.

System impact: what's affected, including downstream?

Only the replication queue processor's Kafka consumer. The value rides through to BackbeatConsumer, which already accepted maxPollIntervalMs — so there are no lib/ changes. Downstream: federation has to render the key for S3C to use it (S3C-11270); Zenko sets it through config directly.

Preserved behavior: what explicitly stays the same?

Unset behaves exactly as today — BackbeatConsumer's existing 300000 default applies, so deployments that set nothing are unchanged. Every other consumer (status processor, lifecycle, GC, notification, metrics, queue populator) is deliberately untouched: this is the shortest, lowest-risk fix for the incident, and the queue processor is the only consumer with the documented slow-task problem. Making the timeout global, or tuning the other consumers, is intentionally left for a separate consistent methodology rather than bolted on here.

Intended change: what's different after this PR?

extensions.replication.queueProcessor.maxPollIntervalMs is now an optional config key (minimum 45000 — librdkafka's session.timeout.ms floor, below which the client refuses to start) wired into the consumer, plus a KAFKA_..._MAX_POLL_INTERVAL_MS entrypoint mapping for container deployments.

Verification: how do we know this worked, or how would we know if it didn't?

New unit tests assert the value passes through from config, stays optional, and is rejected below 45000. Built TDD-style: the first commit adds the failing tests, the second implements the schema + wiring — check out the test commit alone and the spec goes red, proving the tests actually bite.

bert-e · 2026-06-05T09:51:30Z

Hello anurag4dsb,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-06-05T09:51:35Z

Incorrect fix version

The Fix Version/s in issue BB-780 contains:

9.0.4.3

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.0.27
9.1.12
9.2.7
9.3.5
9.4.1
9.5.0

Please check the Fix Version/s of BB-780, or the target
branch of this pull request.

claude · 2026-06-05T09:54:11Z

LGTM - clean, well-scoped change. The new config key wires correctly through the existing BackbeatConsumer path (which already validates and defaults maxPollIntervalMs). Schema validation, test coverage, and docker-entrypoint mapping are all solid.

Review by Claude Code

bert-e · 2026-06-05T09:54:18Z

Incorrect fix version

The Fix Version/s in issue BB-780 contains:

None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.0.27
9.1.12
9.2.7
9.3.5
9.4.1
9.5.0

Please check the Fix Version/s of BB-780, or the target
branch of this pull request.

codecov · 2026-06-05T10:12:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.06%. Comparing base (65f98cc) to head (39325b7).

Additional details and impacted files

Files with missing lines	Coverage Δ
...tensions/replication/ReplicationConfigValidator.js	`69.23% <100.00%> (+1.23%)`	⬆️
...sions/replication/queueProcessor/QueueProcessor.js	`72.96% <ø> (ø)`

... and 3 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.20% <ø> (ø)`
Core Library	`80.44% <ø> (-0.69%)`	⬇️
Ingestion	`70.30% <ø> (ø)`
Lifecycle	`78.63% <ø> (ø)`
Oplog Populator	`85.06% <ø> (ø)`
Replication	`58.56% <100.00%> (+0.01%)`	⬆️
Bucket Scanner	`85.76% <ø> (ø)`

@@                               Coverage Diff                               @@
##           bugfix/BB-782-deflake-consumer-metrics-test    #2749      +/-   ##
===============================================================================
- Coverage                                        74.34%   74.06%   -0.28%     
===============================================================================
  Files                                              201      201              
  Lines                                            13485    13486       +1     
===============================================================================
- Hits                                             10025     9988      -37     
- Misses                                            3450     3488      +38     
  Partials                                            10       10

Flag	Coverage Δ
api:retry	`9.47% <100.00%> (+<0.01%)`	⬆️
api:routes	`9.28% <100.00%> (+<0.01%)`	⬆️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`8.93% <100.00%> (-1.92%)`	⬇️
ingestion	`12.56% <100.00%> (+<0.01%)`	⬆️
lib	`7.52% <100.00%> (-0.02%)`	⬇️
lifecycle	`18.61% <100.00%> (+0.02%)`	⬆️
notification	`1.03% <0.00%> (-0.01%)`	⬇️
replication	`18.59% <100.00%> (+<0.01%)`	⬆️
unit	`50.05% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

TDD: specify the contract first. The replication queue processor accepts an optional maxPollIntervalMs in its config block; tests/config.json sets a distinctive value (350000) so the pass-through is provable, the value is optional, and values below 45000 are rejected. These tests fail until the next commit adds the schema and wiring (both pushed together so CI sees the green tip).

max.poll.interval.ms was pinned to its 5 minute default. Lowering mpu_parts_concurrency to protect Metadata makes a large-MPU replication exceed it, which evicts the queue processor's consumer and redelivers the partition - losing the partial transfer and leaving orphan parts. Let the replication queue processor set an optional maxPollIntervalMs in its config block (min 45000, the librdkafka session.timeout.ms floor); unset falls back to BackbeatConsumer's existing 300000 default. Scoped to the queue processor - the only consumer with the slow-task problem (the status processor does fast metadata writes). Adds a matching EXTENSIONS_REPLICATION_QUEUE_PROCESSOR_MAX_POLL_INTERVAL_MS mapping.

claude · 2026-06-05T13:18:37Z

docker-entrypoint.sh:245 — unquoted env var breaks the file's quoting convention; a non-numeric value would silently produce an empty config (see inline suggestion)

Config wiring, Joi validation, BackbeatConsumer passthrough, and tests all look correct. The bootstrap consume-guard (BB-782) and warning log are solid improvements.

Review by Claude Code

claude · 2026-06-05T13:24:15Z

LGTM — wiring is correct end-to-end: env var (docker-entrypoint.sh) -> Joi-validated config (ReplicationConfigValidator.js) -> QueueProcessor -> BackbeatConsumer. The property name maxPollIntervalMs matches BackbeatConsumer's existing interface, which already accepted it with a 300000 default, so unset deployments are unchanged. The 45000 floor matches librdkafka's session.timeout.ms constraint. Tests cover pass-through, optional-when-unset, and below-minimum rejection.

Review by Claude Code

anurag4DSB · 2026-06-05T14:16:54Z

docker-entrypoint.sh:245 — unquoted env var breaks the file's quoting convention; a non-numeric value would silently produce an empty config (see inline suggestion)Config wiring, Joi validation, BackbeatConsumer passthrough, and tests all look correct. The bootstrap consume-guard (BB-782) and warning log are solid improvements.Review by Claude Code

This is handled

BourgoisMickael · 2026-06-05T16:56:06Z

        groupId: joi.string().required(),
        retry: qpRetryJoi,
        concurrency: joi.number().greater(0).default(10),
+        maxPollIntervalMs: joi.number().min(MAX_POLL_INTERVAL_MS_MIN),


Should you also validate a maximum ?

What happens if we put 1 hour, or 1 day ? I suppose on a crash of the consumer, kafka will not notice it until the full interval is elasped, so we might wait a long time before another consumer can take its place

nicolas2bert

let's add a max limit to 30 minutes which let 1 hour for a task to finish (drain window) which should be more the enough for all our cases.

anurag4DSB mentioned this pull request Jun 5, 2026

BB-782: Fix bootstrap consumer stealing messages #2752

Open

anurag4DSB added 2 commits June 5, 2026 15:14

anurag4DSB force-pushed the improvement/BB-780-crr-poll-interval branch from 7ceacca to efc1331 Compare June 5, 2026 13:14

claude Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread docker-entrypoint.sh Outdated

anurag4DSB changed the base branch from development/9.0 to bugfix/BB-782-deflake-consumer-metrics-test June 5, 2026 13:19

fixup! BB-780: Configure CRR queue processor poll interval

39325b7

anurag4DSB marked this pull request as ready for review June 5, 2026 13:54

BourgoisMickael approved these changes Jun 5, 2026

View reviewed changes

BourgoisMickael reviewed Jun 5, 2026

View reviewed changes

nicolas2bert approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BB-780: Configure CRR queue processor poll interval#2749

BB-780: Configure CRR queue processor poll interval#2749
anurag4DSB wants to merge 3 commits into
bugfix/BB-782-deflake-consumer-metrics-testfrom
improvement/BB-780-crr-poll-interval

anurag4DSB commented Jun 5, 2026 •

edited

Loading

Uh oh!

bert-e commented Jun 5, 2026

Uh oh!

bert-e commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

bert-e commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

anurag4DSB commented Jun 5, 2026 •

edited by atlassian Bot

Loading

Uh oh!

BourgoisMickael Jun 5, 2026

Uh oh!

nicolas2bert left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anurag4DSB commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intent: why does this change exist?

System impact: what's affected, including downstream?

Preserved behavior: what explicitly stays the same?

Intended change: what's different after this PR?

Verification: how do we know this worked, or how would we know if it didn't?

Uh oh!

bert-e commented Jun 5, 2026

Hello anurag4dsb,

Uh oh!

bert-e commented Jun 5, 2026

Incorrect fix version

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

bert-e commented Jun 5, 2026

Incorrect fix version

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

anurag4DSB commented Jun 5, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BourgoisMickael Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

nicolas2bert left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anurag4DSB commented Jun 5, 2026 •

edited

Loading

codecov Bot commented Jun 5, 2026 •

edited

Loading

anurag4DSB commented Jun 5, 2026 •

edited by atlassian Bot

Loading