Durability: ack-after-delivery + drain on shutdown, proven by an e2e suite#16
Merged
Conversation
Close the durability gap: rtbeat previously returned HTTP 200 before the event was delivered and closed the publisher without draining, so an upstream sender that treats the 200 as durable (e.g. rxtx removing a batch from its on-disk queue) could lose in-flight events on a restart. Delivery: - POST /in now publishes the batch with GuaranteedSend and waits for the output to acknowledge it before responding. Per-batch correlation uses a batchAck token carried in each event's Private field, surfaced via acker.EventPrivateReporter. On ack -> 200; on timeout -> 504 so the sender keeps its durable copy and retries (pair with a deterministic doc id downstream to dedupe). The ack budget is the previously-unused config "timeout" (default 5s). Shutdown: - ClientConfig gains WaitClose; Stop signals and Run drains in the right order: stop HTTP intake -> let in-flight handlers finish -> close the client (drains outstanding events up to the new "shutdown_timeout", default 30s). Closing before intake stops would let a late batch publish into a closing pipeline and be lost. Also: - Remove the stray leading zero-value event published with every batch. - Clamp non-positive timeout / shutdown_timeout to defaults (a zero ack timeout would 504 everything; a zero shutdown timeout would silently disable the drain). - Tests cover delivered/timeout/empty/bad-JSON handler paths, partial and multi-batch ack resolution, the lifecycle, and config clamping (beater 92%); config example, docs, and CHANGELOG updated. Closes #1
Prove the #1 durability guarantees against real components instead of fakes: a black-box suite (build tag e2e) builds and runs the actual rtbeat binary shipping to a real libbeat logstash output backed by a controllable go-lumber lumberjack server. Tests: - AckAfterDelivery: 200 only after a real over-the-wire ack. - WaitsForAckBeforeOK: delayed ack proves the 200 waits for delivery (asserts elapsed >= ack delay), not acked on receipt. - StallReturns504: a stalled output yields 504, and the event still reached the output (so the sender retries rather than dropping it). - MultiMessageFragmentedAck: with bulk_max_size=1 a single /in batch is acked in fragments across multiple callbacks over the wire, exercising the per-batch ack correlation. - DrainsInFlightOnShutdown: an in-flight request is delivered on SIGTERM (graceful drain) and the process exits cleanly. Out of scope (documented in the package doc): hard-crash and cross-restart durability, which rely on sender retry-on-non-2xx and an optional libbeat disk queue. Also: a dedicated e2e CI job, a `make e2e` target, and go-lumber promoted to a direct dependency. The suite is tag-gated, so the default `go test` / `make verify` stay fast and offline.
Stop hand-maintaining a Keep-a-Changelog file; GoReleaser already generates per-release notes on GitHub Releases. CHANGELOG.md now just points there.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #16 +/- ##
==========================================
+ Coverage 86.99% 90.30% +3.31%
==========================================
Files 2 2
Lines 123 165 +42
==========================================
+ Hits 107 149 +42
Misses 10 10
Partials 6 6
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1.
Problem
rtbeat acknowledged the inbound HTTP request before the event was delivered to the output, and closed the publisher on shutdown without draining. An upstream sender that treats the 200 as durable (e.g. rxtx, which drops a batch from its on-disk queue on a 200) would discard its only copy while the event was still just in rtbeat's in-memory pipeline — so a normal pod restart (rolling deploy, node drain, OOM) could silently lose in-flight events. Observed in production upstream of a financial-reporting index.
Fix
Ack-after-delivery.
POST /innow publishes the batch withGuaranteedSendand waits for the output to acknowledge it before responding. Per-batch correlation uses abatchAcktoken carried in each event'sPrivatefield, surfaced viaacker.EventPrivateReporter:timeoutseconds → 200 ("delivered")Drain on shutdown.
ClientConfiggainsWaitClose; on SIGTERM the shutdown is ordered correctly — stop HTTP intake → let in-flight handlers finish → drain outstanding events (up toshutdown_timeout) → close. (Closing before intake stops would let a late batch publish into a closing pipeline and be lost — an ordering bug caught and fixed during development.)Also:
timeout/shutdown_timeoutclamp to defaults (a zero ack timeout would 504 everything; a zero shutdown timeout would silently disable the drain).Config
timeout5POST /inwaits for the output to ack before returning 504. Previously defined but unused.shutdown_timeout30Stop()waits to drain in-flight events before closing the publisher. New.rtbeat.yml,rtbeat.reference.ymldefaults, anddocs/configuration.md(new "Delivery semantics" section) are updated.This changes the delivery contract of a production path. Worth a staged rollout:
/inlatency now tracks delivery (was fire-and-forget). A slow/down output makes requests block up totimeout, then 504.Testing
Unit (
beater~92%,-race): delivered / timeout(504) / empty / bad-JSON handler paths; partial and multi-batch ack resolution; the New→Run→Stop lifecycle through a fake pipeline; config defaults, overrides, and clamping.End-to-end (
make e2e/go test -tags e2e ./test/e2e/...) — a black-box suite that builds and runs the real rtbeat binary shipping to a real libbeat Logstash output backed by a controllable go-lumber server (real TCP, real lumberjack ack protocol, real SIGTERM):AckAfterDeliveryWaitsForAckBeforeOKStallReturns504MultiMessageFragmentedAck/inbatch is acked in fragments across the wire (bulk_max_size: 1)DrainsInFlightOnShutdownThe e2e is tag-gated (default
go test/make verifystay fast and offline) and runs as a dedicated CI job.go-lumberwas promoted to a direct dependency.Verification
make verify✅ (lint clean,go test -race, tidy-check, SHA-pinned actions)make e2e✅ — all 5 e2e tests pass, flake-free across repeated runscmd.Wait()race, positive-ordering + fragmented-ack coverage) are applied.Out of scope (explicitly not proven here)
Durability across a hard crash (SIGKILL/panic/power loss) or process restart — rtbeat holds in-flight events in libbeat's in-memory queue. Crash-safety relies on the sender retrying on a non-2xx and, optionally, configuring libbeat's disk queue/spool (noted in
docs/configuration.md). This PR covers graceful delivery/ack/drain only.