feat(lake): validate replay completion (NF-6)#496
Merged
Conversation
A drained replay job was unconditionally marked COMPLETED on a short read, so a job could reach COMPLETED with replayedEvents < totalEvents — a silent partial replay. Fold a completion check into nextReplayBatch's finish path: on drain, COMPLETED only when replayedEvents >= totalEvents, otherwise FAILED with a reason stamped on ReplayJob.error. An over-count (lake grew after create) still satisfies >= and completes cleanly. Org scoping (withOrgTx) and the short-read drain signal are unchanged; no migration (reuses ReplayJob.error). Tests: nextReplayBatch marks FAILED with a reason on a shortfall and COMPLETED (clearing error) on a full/over-count drain.
TerrifiedBug
added a commit
that referenced
this pull request
Jun 8, 2026
* feat(release): gate canary broaden on replay error-budget (NF-6) The plan's NF-6 (replay-driven promotion validation) was only partially shipped by #496, which added replay COMPLETION validation but no promotion gate. This adds the actual gate: a canary -> full-fleet broaden can be gated on how the candidate behaved over a replayed sample of past lake events. - sli-evaluator: extract evaluateSliOverWindow + rollUpSliStatus (rolling health behavior unchanged) so a fixed [startedAt, completedAt] window can be scored. - replay-validation service: score a completed replay against the target pipeline's OWN SLIs (error_rate/discard_rate/latency_mean) over the replay window -> PASS/FAIL/NO_DATA. throughput_floor excluded (time-rate gate, meaningless over an artificial replay window). - replay.validate query (VIEWER, target-scoped) surfaces the verdict. - canary.broaden: opt-in replayJobId gate. Absent -> byte-identical prior behavior. FAIL blocks unless force (override audited). Job must target the rollout's pipeline. Unit-tested (verdict logic, window scoping, gate branches incl. override + pipeline-mismatch). Live canary-replay E2E is infra-dependent. * fix(sli): preserve empty-window 'breached' contract for rolling health Reviewer P1: extracting evaluateSliOverWindow flipped evaluatePipelineHealth's zero-metric-row case from breached(value:0) to no_data, so a dark pipeline stopped reporting degraded and diverged from batch-health.ts / fleet-metrics.ts. Add an emptyWindowStatus param: rolling health passes 'breached' (its documented contract); replay validation keeps 'no_data' (an unscored window is no opinion). Covered by a new dark-pipeline test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NF-6 — replay completion validation
Problem. A drained replay job was unconditionally marked
COMPLETEDon a short read (fetched < batchSize). A job could therefore reachCOMPLETEDwithreplayedEvents < totalEvents— a silent partial replay.Change. Folded a completion check into
nextReplayBatch's finish path (src/server/services/lake/replay.ts):COMPLETEDonly whenreplayedEvents >= totalEvents; otherwiseFAILEDwith a clear reason stamped onReplayJob.error(e.g.Replay drained after 3 of 10 expected events).>=and completes cleanly;erroris cleared (null) on a clean finish.donestays the short-read drain signal; org scoping (withOrgTx) is unchanged.ReplayJob.error String?column.FAILEDwas already a declaredREPLAY_STATUS;getReplayJob/listReplayJobsalready surfacestatus+error, so the reason is observable with no extra wiring.UX-2-replay — already implemented on main
The agent replay-batch endpoint already exists at
src/app/api/agent/replay/route.ts(agent auth viaresolveAgentOrg+authenticateAgentInOrg, org + environment pipeline scoping, callsnextReplayBatch) with a passing route test. It returns NDJSON +X-VF-Replay-*headers — the deliberate, documented Vectorhttp_clientdecoding integration — rather than a JSON{events,cursor,done}envelope. Per direction, this route is left untouched (rewriting it to JSON would break the Vector integration + its test for no functional gain). UX-2-replay is considered satisfied by the existing implementation; this PR adds no route changes.Tests
src/server/services/lake/__tests__/replay.test.ts— added two NF-6 cases (FAILED-with-reason on a shortfall; COMPLETED-clearing-error on a full/over-count drain).pnpm exec vitest run→ 19 passed.src/app/api/agent/replay/__tests__/route.test.ts→ 8 passed (confirms UX-2 endpoint behaviour).tsc --noEmitclean for the changed files.