feat(lake): validate replay completion (NF-6) by TerrifiedBug · Pull Request #496 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-06-08T12:13:12Z

NF-6 — replay completion validation

Problem. A drained replay job was unconditionally marked COMPLETED on a short read (fetched < batchSize). A job could therefore reach COMPLETED with replayedEvents < totalEvents — a silent partial replay.

Change. Folded a completion check into nextReplayBatch's finish path (src/server/services/lake/replay.ts):

On drain: COMPLETED only when replayedEvents >= totalEvents; otherwise FAILED with a clear reason stamped on ReplayJob.error (e.g. Replay drained after 3 of 10 expected events).
An over-count (lake grew after create) still satisfies >= and completes cleanly; error is cleared (null) on a clean finish.
done stays the short-read drain signal; org scoping (withOrgTx) is unchanged.
No migration — reuses the existing ReplayJob.error String? column. FAILED was already a declared REPLAY_STATUS; getReplayJob/listReplayJobs already surface status + error, so the reason is observable with no extra wiring.

UX-2-replay — already implemented on main

The agent replay-batch endpoint already exists at src/app/api/agent/replay/route.ts (agent auth via resolveAgentOrg + authenticateAgentInOrg, org + environment pipeline scoping, calls nextReplayBatch) with a passing route test. It returns NDJSON + X-VF-Replay-* headers — the deliberate, documented Vector http_client decoding integration — rather than a JSON {events,cursor,done} envelope. Per direction, this route is left untouched (rewriting it to JSON would break the Vector integration + its test for no functional gain). UX-2-replay is considered satisfied by the existing implementation; this PR adds no route changes.

Tests

src/server/services/lake/__tests__/replay.test.ts — added two NF-6 cases (FAILED-with-reason on a shortfall; COMPLETED-clearing-error on a full/over-count drain). pnpm exec vitest run → 19 passed.
Existing route test src/app/api/agent/replay/__tests__/route.test.ts → 8 passed (confirms UX-2 endpoint behaviour).
Filtered tsc --noEmit clean for the changed files.

The lake/clickhouse + agent paths are not locally/visually verifiable (@clickhouse/client is not installed in this env; clickhouse is mocked in tests). Unit tests + types are the bar here.

A drained replay job was unconditionally marked COMPLETED on a short read, so a job could reach COMPLETED with replayedEvents < totalEvents — a silent partial replay. Fold a completion check into nextReplayBatch's finish path: on drain, COMPLETED only when replayedEvents >= totalEvents, otherwise FAILED with a reason stamped on ReplayJob.error. An over-count (lake grew after create) still satisfies >= and completes cleanly. Org scoping (withOrgTx) and the short-read drain signal are unchanged; no migration (reuses ReplayJob.error). Tests: nextReplayBatch marks FAILED with a reason on a shortfall and COMPLETED (clearing error) on a full/over-count drain.

* feat(release): gate canary broaden on replay error-budget (NF-6) The plan's NF-6 (replay-driven promotion validation) was only partially shipped by #496, which added replay COMPLETION validation but no promotion gate. This adds the actual gate: a canary -> full-fleet broaden can be gated on how the candidate behaved over a replayed sample of past lake events. - sli-evaluator: extract evaluateSliOverWindow + rollUpSliStatus (rolling health behavior unchanged) so a fixed [startedAt, completedAt] window can be scored. - replay-validation service: score a completed replay against the target pipeline's OWN SLIs (error_rate/discard_rate/latency_mean) over the replay window -> PASS/FAIL/NO_DATA. throughput_floor excluded (time-rate gate, meaningless over an artificial replay window). - replay.validate query (VIEWER, target-scoped) surfaces the verdict. - canary.broaden: opt-in replayJobId gate. Absent -> byte-identical prior behavior. FAIL blocks unless force (override audited). Job must target the rollout's pipeline. Unit-tested (verdict logic, window scoping, gate branches incl. override + pipeline-mismatch). Live canary-replay E2E is infra-dependent. * fix(sli): preserve empty-window 'breached' contract for rolling health Reviewer P1: extracting evaluateSliOverWindow flipped evaluatePipelineHealth's zero-metric-row case from breached(value:0) to no_data, so a dark pipeline stopped reporting degraded and diverged from batch-health.ts / fleet-metrics.ts. Add an emptyWindowStatus param: rolling health passes 'breached' (its documented contract); replay validation keeps 'no_data' (an unscored window is no opinion). Covered by a new dark-pipeline test.

github-actions Bot added the feature label Jun 8, 2026

TerrifiedBug merged commit 2a4ff13 into main Jun 8, 2026
17 checks passed

TerrifiedBug deleted the feat/nf-6-replay-validation branch June 8, 2026 12:39

TerrifiedBug mentioned this pull request Jun 8, 2026

feat(release): gate canary broaden on replay error-budget (NF-6) #504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lake): validate replay completion (NF-6)#496

feat(lake): validate replay completion (NF-6)#496
TerrifiedBug merged 1 commit into
mainfrom
feat/nf-6-replay-validation

TerrifiedBug commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TerrifiedBug commented Jun 8, 2026

NF-6 — replay completion validation

UX-2-replay — already implemented on main

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant