Skip to content

Durability: ack-after-delivery + drain on shutdown, proven by an e2e suite#16

Merged
cjimti merged 3 commits into
masterfrom
1-durability-ack-after-delivery
Jun 17, 2026
Merged

Durability: ack-after-delivery + drain on shutdown, proven by an e2e suite#16
cjimti merged 3 commits into
masterfrom
1-durability-ack-after-delivery

Conversation

@cjimti

@cjimti cjimti commented Jun 17, 2026

Copy link
Copy Markdown
Member

Closes #1.

Problem

rtbeat acknowledged the inbound HTTP request before the event was delivered to the output, and closed the publisher on shutdown without draining. An upstream sender that treats the 200 as durable (e.g. rxtx, which drops a batch from its on-disk queue on a 200) would discard its only copy while the event was still just in rtbeat's in-memory pipeline — so a normal pod restart (rolling deploy, node drain, OOM) could silently lose in-flight events. Observed in production upstream of a financial-reporting index.

Fix

Ack-after-delivery. POST /in now publishes the batch with GuaranteedSend and waits for the output to acknowledge it before responding. Per-batch correlation uses a batchAck token carried in each event's Private field, surfaced via acker.EventPrivateReporter:

  • ack within timeout seconds → 200 ("delivered")
  • otherwise → 504, so the sender keeps its durable copy and retries (pair with a deterministic downstream document id to dedupe a retried batch)

Drain on shutdown. ClientConfig gains WaitClose; on SIGTERM the shutdown is ordered correctly — stop HTTP intake → let in-flight handlers finish → drain outstanding events (up to shutdown_timeout) → close. (Closing before intake stops would let a late batch publish into a closing pipeline and be lost — an ordering bug caught and fixed during development.)

Also:

  • Removed the stray leading zero-value event that was published with every batch (a blank doc per batch; also incompatible with clean ack correlation).
  • Non-positive timeout / shutdown_timeout clamp to defaults (a zero ack timeout would 504 everything; a zero shutdown timeout would silently disable the drain).

Config

Key Default Meaning
timeout 5 Seconds POST /in waits for the output to ack before returning 504. Previously defined but unused.
shutdown_timeout 30 Seconds Stop() waits to drain in-flight events before closing the publisher. New.

rtbeat.yml, rtbeat.reference.yml defaults, and docs/configuration.md (new "Delivery semantics" section) are updated.

⚠️ Behavior change — operator-visible

This changes the delivery contract of a production path. Worth a staged rollout:

  • /in latency now tracks delivery (was fire-and-forget). A slow/down output makes requests block up to timeout, then 504.
  • A stalled/down output now yields 504 + retries instead of a silent 200 and lost data.
  • The published event stream no longer contains the per-batch blank document.

Testing

Unit (beater ~92%, -race): delivered / timeout(504) / empty / bad-JSON handler paths; partial and multi-batch ack resolution; the New→Run→Stop lifecycle through a fake pipeline; config defaults, overrides, and clamping.

End-to-end (make e2e / go test -tags e2e ./test/e2e/...) — a black-box suite that builds and runs the real rtbeat binary shipping to a real libbeat Logstash output backed by a controllable go-lumber server (real TCP, real lumberjack ack protocol, real SIGTERM):

Test Proves
AckAfterDelivery 200 after a genuine over-the-wire ack
WaitsForAckBeforeOK the 200 waits for the ack (delayed-ack, asserts elapsed ≥ delay) — positive ordering proof, not a proxy
StallReturns504 a stalled output → 504, and the event still reached the output (sender retries)
MultiMessageFragmentedAck per-batch correlation when one /in batch is acked in fragments across the wire (bulk_max_size: 1)
DrainsInFlightOnShutdown an in-flight request is delivered on SIGTERM (graceful drain); clean exit

The e2e is tag-gated (default go test / make verify stay fast and offline) and runs as a dedicated CI job. go-lumber was promoted to a direct dependency.

Verification

  • make verify ✅ (lint clean, go test -race, tidy-check, SHA-pinned actions)
  • make e2e ✅ — all 5 e2e tests pass, flake-free across repeated runs
  • Two rounds of adversarial review (the change, then the e2e harness); both confirmed the durability guarantee holds and the e2e genuinely exercises it — their findings (shutdown ordering, config clamp, ack-correlation contract tests, a cmd.Wait() race, positive-ordering + fragmented-ack coverage) are applied.

Out of scope (explicitly not proven here)

Durability across a hard crash (SIGKILL/panic/power loss) or process restart — rtbeat holds in-flight events in libbeat's in-memory queue. Crash-safety relies on the sender retrying on a non-2xx and, optionally, configuring libbeat's disk queue/spool (noted in docs/configuration.md). This PR covers graceful delivery/ack/drain only.

cjimti added 3 commits June 16, 2026 18:56
Close the durability gap: rtbeat previously returned HTTP 200 before the
event was delivered and closed the publisher without draining, so an
upstream sender that treats the 200 as durable (e.g. rxtx removing a
batch from its on-disk queue) could lose in-flight events on a restart.

Delivery:
- POST /in now publishes the batch with GuaranteedSend and waits for the
  output to acknowledge it before responding. Per-batch correlation uses
  a batchAck token carried in each event's Private field, surfaced via
  acker.EventPrivateReporter. On ack -> 200; on timeout -> 504 so the
  sender keeps its durable copy and retries (pair with a deterministic
  doc id downstream to dedupe). The ack budget is the previously-unused
  config "timeout" (default 5s).

Shutdown:
- ClientConfig gains WaitClose; Stop signals and Run drains in the right
  order: stop HTTP intake -> let in-flight handlers finish -> close the
  client (drains outstanding events up to the new "shutdown_timeout",
  default 30s). Closing before intake stops would let a late batch
  publish into a closing pipeline and be lost.

Also:
- Remove the stray leading zero-value event published with every batch.
- Clamp non-positive timeout / shutdown_timeout to defaults (a zero ack
  timeout would 504 everything; a zero shutdown timeout would silently
  disable the drain).
- Tests cover delivered/timeout/empty/bad-JSON handler paths, partial
  and multi-batch ack resolution, the lifecycle, and config clamping
  (beater 92%); config example, docs, and CHANGELOG updated.

Closes #1
Prove the #1 durability guarantees against real components instead of
fakes: a black-box suite (build tag e2e) builds and runs the actual
rtbeat binary shipping to a real libbeat logstash output backed by a
controllable go-lumber lumberjack server.

Tests:
- AckAfterDelivery: 200 only after a real over-the-wire ack.
- WaitsForAckBeforeOK: delayed ack proves the 200 waits for delivery
  (asserts elapsed >= ack delay), not acked on receipt.
- StallReturns504: a stalled output yields 504, and the event still
  reached the output (so the sender retries rather than dropping it).
- MultiMessageFragmentedAck: with bulk_max_size=1 a single /in batch is
  acked in fragments across multiple callbacks over the wire, exercising
  the per-batch ack correlation.
- DrainsInFlightOnShutdown: an in-flight request is delivered on SIGTERM
  (graceful drain) and the process exits cleanly.

Out of scope (documented in the package doc): hard-crash and
cross-restart durability, which rely on sender retry-on-non-2xx and an
optional libbeat disk queue.

Also: a dedicated e2e CI job, a `make e2e` target, and go-lumber
promoted to a direct dependency. The suite is tag-gated, so the default
`go test` / `make verify` stay fast and offline.
Stop hand-maintaining a Keep-a-Changelog file; GoReleaser already
generates per-release notes on GitHub Releases. CHANGELOG.md now just
points there.
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.30%. Comparing base (1cef2bf) to head (24878d5).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #16      +/-   ##
==========================================
+ Coverage   86.99%   90.30%   +3.31%     
==========================================
  Files           2        2              
  Lines         123      165      +42     
==========================================
+ Hits          107      149      +42     
  Misses         10       10              
  Partials        6        6              
Files with missing lines Coverage Δ
beater/rtbeat.go 91.97% <100.00%> (+2.80%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjimti cjimti merged commit c3fe7d2 into master Jun 17, 2026
10 checks passed
@cjimti cjimti deleted the 1-durability-ack-after-delivery branch June 17, 2026 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Durability gap: HTTP 200 returned before delivery + no drain on Stop() can lose in-flight events on restart

1 participant