Skip to content

nri-kafka: fix sendSync hanging forever on delivery failure#159

Open
omnibs wants to merge 1 commit into
trunkfrom
fix-sendsync-delivery-failure
Open

nri-kafka: fix sendSync hanging forever on delivery failure#159
omnibs wants to merge 1 commit into
trunkfrom
fix-sendsync-delivery-failure

Conversation

@omnibs

@omnibs omnibs commented Jun 13, 2026

Copy link
Copy Markdown
Member

Problem

Kafka.sendSync blocked on a TMVar that the delivery callback only signalled on the success branch. When a message was enqueued but the broker ultimately failed to deliver it (delivery.timeout.ms exceeded, retries exhausted, a non-retriable broker error, or no available partition leader), the TMVar was never written and the calling request handler parked forever — and the failure was muted from observability.

(The synchronous ImmediateError enqueue-refused path already failed correctly; this was specifically the asynchronous, post-enqueue delivery-report path.)

Fix

  • Add a pure Internal.deliveryReportToResult :: DeliveryReport -> Result Error () that maps every delivery report to a result, and signal the terminator on all branches.
  • A failed delivery now surfaces as a descriptive Task.fail:
    • DeliveryFailed (ProducerRecord, KafkaError) — broker-side delivery failure (carries the record).
    • NoMessageDelivered KafkaError — the message-less NoMessageError report (librdkafka's delivery callback fired with a null message pointer; error read from errno).
  • sendHelperAsync's callback now receives the whole DeliveryReport and fires on every report (was success-only). sendAsync keeps its prior success-only callback contract.
  • The exactly-once-signal contract is preserved (librdkafka invokes the delivery callback exactly once per message).

Error lives in the unexposed Kafka.Internal module, so the new constructors are not a public-API change.

Testing

  • New pure unit tests in test/Spec/Kafka.hs cover all three report branches (developed test-first: watched them fail against a stub, then pass).
  • Full worker integration suite passes against a live broker (12/12), confirming the rewiring doesn't regress sendSync's success path.
  • The broker-side delivery-failure path has no automated integration test by design — a deterministic delivery failure is inherently flaky to reproduce, which is why the dispatch logic was extracted into the pure, unit-tested function.

Docs

  • docs/known-issues.md entry marked resolved (original write-up kept for context).
  • CHANGELOG.md entry added under a new # Unreleased heading.

🤖 Generated with Claude Code

sendSync blocked on a TMVar that the delivery callback only signalled on
the success branch, so a broker-side delivery failure (delivery.timeout.ms
exceeded, retries exhausted, non-retriable error, no partition leader) left
the caller parked forever and muted the error from observability.

Add a pure Internal.deliveryReportToResult that maps every DeliveryReport to
a Result Error (), and signal the terminator on all branches. A failed
delivery now surfaces as a descriptive Task.fail: DeliveryFailed for a
broker-side failure, NoMessageDelivered for the message-less NoMessageError
report. sendAsync keeps its prior success-only callback contract.

The dispatch is unit-tested in test/Spec/Kafka.hs; the full worker
integration suite passes against a live broker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 13, 2026 05:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Kafka.sendSync potentially hanging forever when a message is enqueued successfully but later fails delivery by ensuring the delivery callback always signals completion and propagates a failure result.

Changes:

  • Add Internal.deliveryReportToResult and new internal Error constructors to represent all delivery outcomes.
  • Rewire sendHelperAsync/sendSync to receive the full DeliveryReport and signal the TMVar on both success and failure.
  • Add unit tests for delivery-report mapping and update docs/changelog accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
nri-kafka/test/Spec/Kafka.hs Adds pure unit tests for deliveryReportToResult.
nri-kafka/test/Main.hs Registers the new Spec.Kafka tests in the test runner.
nri-kafka/src/Kafka/Internal.hs Adds DeliveryFailed/NoMessageDelivered and implements deliveryReportToResult.
nri-kafka/src/Kafka.hs Updates async/sync send logic so sync send terminates on all delivery reports; updates helper callback shape.
nri-kafka/nri-kafka.cabal Adds Spec.Kafka to the test-suite module list.
nri-kafka/docs/known-issues.md Marks the sendSync hang issue as resolved with an explanation.
nri-kafka/CHANGELOG.md Adds an Unreleased entry describing the fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

exampleRecord :: Producer.ProducerRecord
exampleRecord =
Producer.ProducerRecord
{ Producer.prTopic = "the-topic",
Comment on lines 74 to 75
errorToText :: Error -> Text
errorToText err = Text.fromList (Prelude.show err)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants