Skip to content

feat(lake): validate replay completion (NF-6)#496

Merged
TerrifiedBug merged 1 commit into
mainfrom
feat/nf-6-replay-validation
Jun 8, 2026
Merged

feat(lake): validate replay completion (NF-6)#496
TerrifiedBug merged 1 commit into
mainfrom
feat/nf-6-replay-validation

Conversation

@TerrifiedBug

Copy link
Copy Markdown
Owner

NF-6 — replay completion validation

Problem. A drained replay job was unconditionally marked COMPLETED on a short read (fetched < batchSize). A job could therefore reach COMPLETED with replayedEvents < totalEvents — a silent partial replay.

Change. Folded a completion check into nextReplayBatch's finish path (src/server/services/lake/replay.ts):

  • On drain: COMPLETED only when replayedEvents >= totalEvents; otherwise FAILED with a clear reason stamped on ReplayJob.error (e.g. Replay drained after 3 of 10 expected events).
  • An over-count (lake grew after create) still satisfies >= and completes cleanly; error is cleared (null) on a clean finish.
  • done stays the short-read drain signal; org scoping (withOrgTx) is unchanged.
  • No migration — reuses the existing ReplayJob.error String? column. FAILED was already a declared REPLAY_STATUS; getReplayJob/listReplayJobs already surface status + error, so the reason is observable with no extra wiring.

UX-2-replay — already implemented on main

The agent replay-batch endpoint already exists at src/app/api/agent/replay/route.ts (agent auth via resolveAgentOrg + authenticateAgentInOrg, org + environment pipeline scoping, calls nextReplayBatch) with a passing route test. It returns NDJSON + X-VF-Replay-* headers — the deliberate, documented Vector http_client decoding integration — rather than a JSON {events,cursor,done} envelope. Per direction, this route is left untouched (rewriting it to JSON would break the Vector integration + its test for no functional gain). UX-2-replay is considered satisfied by the existing implementation; this PR adds no route changes.

Tests

  • src/server/services/lake/__tests__/replay.test.ts — added two NF-6 cases (FAILED-with-reason on a shortfall; COMPLETED-clearing-error on a full/over-count drain). pnpm exec vitest run19 passed.
  • Existing route test src/app/api/agent/replay/__tests__/route.test.ts8 passed (confirms UX-2 endpoint behaviour).
  • Filtered tsc --noEmit clean for the changed files.

The lake/clickhouse + agent paths are not locally/visually verifiable (@clickhouse/client is not installed in this env; clickhouse is mocked in tests). Unit tests + types are the bar here.

A drained replay job was unconditionally marked COMPLETED on a short read,
so a job could reach COMPLETED with replayedEvents < totalEvents — a silent
partial replay. Fold a completion check into nextReplayBatch's finish path:
on drain, COMPLETED only when replayedEvents >= totalEvents, otherwise FAILED
with a reason stamped on ReplayJob.error. An over-count (lake grew after
create) still satisfies >= and completes cleanly. Org scoping (withOrgTx) and
the short-read drain signal are unchanged; no migration (reuses ReplayJob.error).

Tests: nextReplayBatch marks FAILED with a reason on a shortfall and COMPLETED
(clearing error) on a full/over-count drain.
@TerrifiedBug TerrifiedBug merged commit 2a4ff13 into main Jun 8, 2026
17 checks passed
@TerrifiedBug TerrifiedBug deleted the feat/nf-6-replay-validation branch June 8, 2026 12:39
TerrifiedBug added a commit that referenced this pull request Jun 8, 2026
* feat(release): gate canary broaden on replay error-budget (NF-6)

The plan's NF-6 (replay-driven promotion validation) was only partially shipped by #496, which added replay COMPLETION validation but no promotion gate. This adds the actual gate: a canary -> full-fleet broaden can be gated on how the candidate behaved over a replayed sample of past lake events.

- sli-evaluator: extract evaluateSliOverWindow + rollUpSliStatus (rolling health behavior unchanged) so a fixed [startedAt, completedAt] window can be scored.

- replay-validation service: score a completed replay against the target pipeline's OWN SLIs (error_rate/discard_rate/latency_mean) over the replay window -> PASS/FAIL/NO_DATA. throughput_floor excluded (time-rate gate, meaningless over an artificial replay window).

- replay.validate query (VIEWER, target-scoped) surfaces the verdict.

- canary.broaden: opt-in replayJobId gate. Absent -> byte-identical prior behavior. FAIL blocks unless force (override audited). Job must target the rollout's pipeline.

Unit-tested (verdict logic, window scoping, gate branches incl. override + pipeline-mismatch). Live canary-replay E2E is infra-dependent.

* fix(sli): preserve empty-window 'breached' contract for rolling health

Reviewer P1: extracting evaluateSliOverWindow flipped evaluatePipelineHealth's zero-metric-row case from breached(value:0) to no_data, so a dark pipeline stopped reporting degraded and diverged from batch-health.ts / fleet-metrics.ts. Add an emptyWindowStatus param: rolling health passes 'breached' (its documented contract); replay validation keeps 'no_data' (an unscored window is no opinion). Covered by a new dark-pipeline test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant