Skip to content

BB-780: Configure CRR queue processor poll interval#2749

Open
anurag4DSB wants to merge 3 commits into
bugfix/BB-782-deflake-consumer-metrics-testfrom
improvement/BB-780-crr-poll-interval
Open

BB-780: Configure CRR queue processor poll interval#2749
anurag4DSB wants to merge 3 commits into
bugfix/BB-782-deflake-consumer-metrics-testfrom
improvement/BB-780-crr-poll-interval

Conversation

@anurag4DSB
Copy link
Copy Markdown
Contributor

@anurag4DSB anurag4DSB commented Jun 5, 2026

Note to reviewers: the target branch is for another fix, unrelated to this, we are debating if we add that or not. But it was needed for a green CI.

Intent: why does this change exist?

Lowering mpu_parts_concurrency (to stop large MPUs overloading Metadata) makes a single large-MPU replication run longer than Kafka's 5-min max.poll.interval.ms, so the queue processor's consumer gets evicted mid-transfer — the partition is redelivered and you get duplicate work, orphan sproxyd parts, and ghost processes. This exposes that timeout so an operator can raise it and let long transfers finish. It's the short-term mitigation from the "CRR on large MPUs" root-cause plan; the real fix (keep polling while a slow task runs) is separate, larger work.

System impact: what's affected, including downstream?

Only the replication queue processor's Kafka consumer. The value rides through to BackbeatConsumer, which already accepted maxPollIntervalMs — so there are no lib/ changes. Downstream: federation has to render the key for S3C to use it (S3C-11270); Zenko sets it through config directly.

Preserved behavior: what explicitly stays the same?

Unset behaves exactly as today — BackbeatConsumer's existing 300000 default applies, so deployments that set nothing are unchanged. Every other consumer (status processor, lifecycle, GC, notification, metrics, queue populator) is deliberately untouched: this is the shortest, lowest-risk fix for the incident, and the queue processor is the only consumer with the documented slow-task problem. Making the timeout global, or tuning the other consumers, is intentionally left for a separate consistent methodology rather than bolted on here.

Intended change: what's different after this PR?

extensions.replication.queueProcessor.maxPollIntervalMs is now an optional config key (minimum 45000 — librdkafka's session.timeout.ms floor, below which the client refuses to start) wired into the consumer, plus a KAFKA_..._MAX_POLL_INTERVAL_MS entrypoint mapping for container deployments.

Verification: how do we know this worked, or how would we know if it didn't?

New unit tests assert the value passes through from config, stays optional, and is rejected below 45000. Built TDD-style: the first commit adds the failing tests, the second implements the schema + wiring — check out the test commit alone and the spec goes red, proving the tests actually bite.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Hello anurag4dsb,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Incorrect fix version

The Fix Version/s in issue BB-780 contains:

  • 9.0.4.3

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.27

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-780, or the target
branch of this pull request.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

LGTM - clean, well-scoped change. The new config key wires correctly through the existing BackbeatConsumer path (which already validates and defaults maxPollIntervalMs). Schema validation, test coverage, and docker-entrypoint mapping are all solid.

Review by Claude Code

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 5, 2026

Incorrect fix version

The Fix Version/s in issue BB-780 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.27

  • 9.1.12

  • 9.2.7

  • 9.3.5

  • 9.4.1

  • 9.5.0

Please check the Fix Version/s of BB-780, or the target
branch of this pull request.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.06%. Comparing base (65f98cc) to head (39325b7).

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
...tensions/replication/ReplicationConfigValidator.js 69.23% <100.00%> (+1.23%) ⬆️
...sions/replication/queueProcessor/QueueProcessor.js 72.96% <ø> (ø)

... and 3 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.20% <ø> (ø)
Core Library 80.44% <ø> (-0.69%) ⬇️
Ingestion 70.30% <ø> (ø)
Lifecycle 78.63% <ø> (ø)
Oplog Populator 85.06% <ø> (ø)
Replication 58.56% <100.00%> (+0.01%) ⬆️
Bucket Scanner 85.76% <ø> (ø)
@@                               Coverage Diff                               @@
##           bugfix/BB-782-deflake-consumer-metrics-test    #2749      +/-   ##
===============================================================================
- Coverage                                        74.34%   74.06%   -0.28%     
===============================================================================
  Files                                              201      201              
  Lines                                            13485    13486       +1     
===============================================================================
- Hits                                             10025     9988      -37     
- Misses                                            3450     3488      +38     
  Partials                                            10       10              
Flag Coverage Δ
api:retry 9.47% <100.00%> (+<0.01%) ⬆️
api:routes 9.28% <100.00%> (+<0.01%) ⬆️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 8.93% <100.00%> (-1.92%) ⬇️
ingestion 12.56% <100.00%> (+<0.01%) ⬆️
lib 7.52% <100.00%> (-0.02%) ⬇️
lifecycle 18.61% <100.00%> (+0.02%) ⬆️
notification 1.03% <0.00%> (-0.01%) ⬇️
replication 18.59% <100.00%> (+<0.01%) ⬆️
unit 50.05% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

TDD: specify the contract first. The replication queue processor
accepts an optional maxPollIntervalMs in its config block;
tests/config.json sets a distinctive value (350000) so the
pass-through is provable, the value is optional, and values below
45000 are rejected. These tests fail until the next commit adds the
schema and wiring (both pushed together so CI sees the green tip).
max.poll.interval.ms was pinned to its 5 minute default. Lowering
mpu_parts_concurrency to protect Metadata makes a large-MPU
replication exceed it, which evicts the queue processor's consumer
and redelivers the partition - losing the partial transfer and
leaving orphan parts. Let the replication queue processor set an
optional maxPollIntervalMs in its config block (min 45000, the
librdkafka session.timeout.ms floor); unset falls back to
BackbeatConsumer's existing 300000 default. Scoped to the queue
processor - the only consumer with the slow-task problem (the status
processor does fast metadata writes). Adds a matching
EXTENSIONS_REPLICATION_QUEUE_PROCESSOR_MAX_POLL_INTERVAL_MS mapping.
@anurag4DSB anurag4DSB force-pushed the improvement/BB-780-crr-poll-interval branch from 7ceacca to efc1331 Compare June 5, 2026 13:14
Comment thread docker-entrypoint.sh Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

  • docker-entrypoint.sh:245 — unquoted env var breaks the file's quoting convention; a non-numeric value would silently produce an empty config (see inline suggestion)

    Config wiring, Joi validation, BackbeatConsumer passthrough, and tests all look correct. The bootstrap consume-guard (BB-782) and warning log are solid improvements.

    Review by Claude Code

@anurag4DSB anurag4DSB changed the base branch from development/9.0 to bugfix/BB-782-deflake-consumer-metrics-test June 5, 2026 13:19
@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

LGTM — wiring is correct end-to-end: env var (docker-entrypoint.sh) -> Joi-validated config (ReplicationConfigValidator.js) -> QueueProcessor -> BackbeatConsumer. The property name maxPollIntervalMs matches BackbeatConsumer's existing interface, which already accepted it with a 300000 default, so unset deployments are unchanged. The 45000 floor matches librdkafka's session.timeout.ms constraint. Tests cover pass-through, optional-when-unset, and below-minimum rejection.

Review by Claude Code

@anurag4DSB anurag4DSB marked this pull request as ready for review June 5, 2026 13:54
@anurag4DSB
Copy link
Copy Markdown
Contributor Author

anurag4DSB commented Jun 5, 2026

  • docker-entrypoint.sh:245 — unquoted env var breaks the file's quoting convention; a non-numeric value would silently produce an empty config (see inline suggestion)Config wiring, Joi validation, BackbeatConsumer passthrough, and tests all look correct. The bootstrap consume-guard (BB-782) and warning log are solid improvements.Review by Claude Code

This is handled

groupId: joi.string().required(),
retry: qpRetryJoi,
concurrency: joi.number().greater(0).default(10),
maxPollIntervalMs: joi.number().min(MAX_POLL_INTERVAL_MS_MIN),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also validate a maximum ?

What happens if we put 1 hour, or 1 day ? I suppose on a crash of the consumer, kafka will not notice it until the full interval is elasped, so we might wait a long time before another consumer can take its place

Copy link
Copy Markdown
Contributor

@nicolas2bert nicolas2bert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a max limit to 30 minutes which let 1 hour for a task to finish (drain window) which should be more the enough for all our cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants