fix(storage): serve the write buffer on read so acked records are consumable (read-after-ack) by kamir · Pull Request #149 · KafScale/platform

kamir · 2026-06-04T15:38:48Z

Summary

Make PartitionLog.Read serve the in-memory write buffer in addition to flushed
segments, so an acknowledged-but-not-yet-flushed record is consumable
(Kafka read-after-ack).

When this matters (scope)

This affects the configuration where the per-acknowledgement flush is disabled
(KAFSCALE_PRODUCE_SYNC_FLUSH=false) or for acks=0 produces. There, an
acknowledged record can sit in the in-memory WriteBuffer until a size/interval
threshold flushes it; Read served only flushed segments and returned
ErrOffsetOutOfRange for the buffered tail, so a consumer reading up to the
high-watermark could miss just-acknowledged records.

In the default configuration (KAFSCALE_PRODUCE_SYNC_FLUSH=true) every
acknowledged produce is flushed immediately, so this change is a no-op there. It
is a correctness hardening for the flush-disabled / acks=0 path.

Root cause

AppendBatch appends to the in-memory WriteBuffer and returns the assigned
offset (the ack basis). A flush to a segment happens on a WriteBuffer
threshold, or per acknowledged produce in the default path.
Read scanned only flushed segments and returned ErrOffsetOutOfRange on a
miss; it never consulted the buffer.

Fix

Add WriteBuffer.RecordsFrom(offset, maxBytes): a non-destructive read of the
buffered batch bytes from a given offset onward.
Read falls back to it when the offset is not in a flushed segment.

No wire-protocol change; the fetch handler calls PartitionLog.Read unchanged.

Tests

TestPartitionLogReadAfterAckBeforeFlush (new): appends with flush thresholds
set so nothing flushes, asserts every acked offset is readable. Fails before,
passes after.
TestPartitionLogMultiFlushAllOffsetsReadable (new): every acked offset stays
readable across many flush rotations.
go test ./pkg/storage/... ./pkg/broker/... ./cmd/broker/... all green.

Correction to an earlier version of this description

An earlier draft attributed a large "acknowledged but unreadable at volume"
effect to a flush/segment data-loss path. That was wrong. It was traced to a test
that produced one record per produce request against the flush-on-ack broker
(one record per segment, expensive to read back one segment at a time, so the
consumer hit its read deadline after a fraction). The data was durable and
complete; a batched producer round-trips byte-clean. This PR is scoped only to
the read-after-ack consistency described above.

AppendBatch returns an AppendResult (the produce ACK basis) as soon as the batch is buffered, but flush-to-segment only happens when a WriteBuffer threshold trips, and flushing is evaluated only inside AppendBatch (no background flusher). Read serves flushed segments only and returns ErrOffsetOutOfRange for buffered offsets. So a just-acked record whose partition then goes quiet stays unreadable (and is lost on broker restart, since the buffer is in-memory), violating Kafka read-after-ack. This test appends 10 batches with flush thresholds set so nothing flushes, then asserts every acked offset is readable. It FAILS on the current code (offset 0 -> ErrOffsetOutOfRange) and must pass once Read serves the buffer (or produce flushes before acking under acks=all). Existing tests use MaxBytes:1 so every append flushes immediately, which is why this path was never exercised. Refs: scalytics UPSTREAM/2026-06-04-kafscale-consume-readpath.md

…sumable PartitionLog.Read served only flushed segments and returned ErrOffsetOutOfRange for any offset still in the in-memory WriteBuffer. Because flush is append-triggered (ShouldFlush is evaluated only inside AppendBatch; there is no background flusher), a partition that goes quiet below the flush threshold keeps its just-acked tail in the buffer, where it was unreadable — breaking Kafka's read-after-ack contract (observed end-to-end as 1015 acked -> 588 readable on v1.6.0). Read now falls back to the buffer when the offset is not in a flushed segment: new WriteBuffer.RecordsFrom(offset, maxBytes) returns the buffered batch bytes for the requested offset onward, non-destructively. The fetch handler (cmd/broker fetch -> plog.Read) picks this up unchanged. Makes TestPartitionLogReadAfterAckBeforeFlush pass; full pkg/storage, pkg/broker and cmd/broker suites stay green. Note: this fixes READABILITY (read-after-ack). Durability-on-restart is separate — the buffer is still in-memory, so acked-but-unflushed records are lost if the broker restarts before flush. A complete acks=all guarantee additionally needs flush-before-ack or a WAL; tracked separately. Refs: scalytics UPSTREAM/2026-06-04-kafscale-consume-readpath.md

…ross rotations Appends 30 batches with MaxBatches=3 (frequent flush rotations) and asserts every acked offset stays readable. Passes over MemoryS3, isolating the end-to-end '1019 acked -> ~32 readable' loss OUT of the pkg/storage state machine (no segment overwrite, no drop across rotations). The live loss is therefore in the real S3 client / proxy fetch-forward / concurrency, not the storage logic.

Scalytics added 3 commits June 4, 2026 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage): serve the write buffer on read so acked records are consumable (read-after-ack)#149

fix(storage): serve the write buffer on read so acked records are consumable (read-after-ack)#149
kamir wants to merge 3 commits into
KafScale:mainfrom
kamir:fix/broker-ack-but-lost-v1.6.0

kamir commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kamir commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

When this matters (scope)

Root cause

Fix

Tests

Correction to an earlier version of this description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kamir commented Jun 4, 2026 •

edited

Loading