Skip to content

feat(release): gate canary broaden on replay error-budget (NF-6)#504

Merged
TerrifiedBug merged 2 commits into
mainfrom
feat/nf-6-replay-promotion-gate
Jun 8, 2026
Merged

feat(release): gate canary broaden on replay error-budget (NF-6)#504
TerrifiedBug merged 2 commits into
mainfrom
feat/nf-6-replay-promotion-gate

Conversation

@TerrifiedBug

Copy link
Copy Markdown
Owner

NF-6 — replay-driven promotion gating

The plan's NF-6 ('prove the new config is safe before fleet-wide rollout') was only partially shipped: #496 added replay completion validation in nextReplayBatch, but there was no promotion gate anywhere in release/. This adds the actual gate.

What

  • sli-evaluator — extracted evaluateSliOverWindow(pipelineId, sli, since, until?) + rollUpSliStatus(). Rolling evaluatePipelineHealth behavior is unchanged (open-ended window); the new fixed-window form lets a replay be scored over exactly [startedAt, completedAt].
  • replay-validation serviceevaluateReplayValidation scores a completed replay against the target pipeline's own PipelineSli thresholds (no parallel 'budget' concept) over the replay window → PASS / FAIL / NO_DATA. Only error_rate/discard_rate/latency_mean are scored; throughput_floor is excluded (a time-rate gate is meaningless over an artificial replay window). NO_DATA is never treated as approval.
  • replay.validate query (VIEWER, target-scoped) surfaces the verdict for the UI.
  • canary.broaden — opt-in replayJobId gate. Absent → byte-identical prior behavior (zero risk to existing callers). A FAIL verdict blocks the broaden unless force is set; a forced override is recorded in the audit log. The replay job must target the rollout's pipeline (else BAD_REQUEST).

Tests

  • replay-validation.test.ts — PASS/FAIL/NO_DATA (no window, no applicable SLIs, no metrics), window scoping, metric-filter (throughput_floor excluded).
  • release-canary.test.ts — gate branches: no-gate passthrough, FAIL blocks (broaden not called), force override (audited), PASS records verdict, pipeline-mismatch rejection.
  • sli-evaluator + cross-org walker regression: green.

Verification limits

Unit-tested + typechecked + lint-clean. The live end-to-end path (a real canary agent replaying a sample → metrics → gate) is infra-dependent and not exercised here. The gate is opt-in and additive, so it cannot affect existing promotions.

The plan's NF-6 (replay-driven promotion validation) was only partially shipped by #496, which added replay COMPLETION validation but no promotion gate. This adds the actual gate: a canary -> full-fleet broaden can be gated on how the candidate behaved over a replayed sample of past lake events.

- sli-evaluator: extract evaluateSliOverWindow + rollUpSliStatus (rolling health behavior unchanged) so a fixed [startedAt, completedAt] window can be scored.

- replay-validation service: score a completed replay against the target pipeline's OWN SLIs (error_rate/discard_rate/latency_mean) over the replay window -> PASS/FAIL/NO_DATA. throughput_floor excluded (time-rate gate, meaningless over an artificial replay window).

- replay.validate query (VIEWER, target-scoped) surfaces the verdict.

- canary.broaden: opt-in replayJobId gate. Absent -> byte-identical prior behavior. FAIL blocks unless force (override audited). Job must target the rollout's pipeline.

Unit-tested (verdict logic, window scoping, gate branches incl. override + pipeline-mismatch). Live canary-replay E2E is infra-dependent.
Reviewer P1: extracting evaluateSliOverWindow flipped evaluatePipelineHealth's zero-metric-row case from breached(value:0) to no_data, so a dark pipeline stopped reporting degraded and diverged from batch-health.ts / fleet-metrics.ts. Add an emptyWindowStatus param: rolling health passes 'breached' (its documented contract); replay validation keeps 'no_data' (an unscored window is no opinion). Covered by a new dark-pipeline test.
@github-actions github-actions Bot added feature and removed feature labels Jun 8, 2026
@TerrifiedBug TerrifiedBug merged commit 61ad32e into main Jun 8, 2026
18 checks passed
@TerrifiedBug TerrifiedBug deleted the feat/nf-6-replay-promotion-gate branch June 8, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant