feat(release): gate canary broaden on replay error-budget (NF-6)#504
Merged
Conversation
The plan's NF-6 (replay-driven promotion validation) was only partially shipped by #496, which added replay COMPLETION validation but no promotion gate. This adds the actual gate: a canary -> full-fleet broaden can be gated on how the candidate behaved over a replayed sample of past lake events. - sli-evaluator: extract evaluateSliOverWindow + rollUpSliStatus (rolling health behavior unchanged) so a fixed [startedAt, completedAt] window can be scored. - replay-validation service: score a completed replay against the target pipeline's OWN SLIs (error_rate/discard_rate/latency_mean) over the replay window -> PASS/FAIL/NO_DATA. throughput_floor excluded (time-rate gate, meaningless over an artificial replay window). - replay.validate query (VIEWER, target-scoped) surfaces the verdict. - canary.broaden: opt-in replayJobId gate. Absent -> byte-identical prior behavior. FAIL blocks unless force (override audited). Job must target the rollout's pipeline. Unit-tested (verdict logic, window scoping, gate branches incl. override + pipeline-mismatch). Live canary-replay E2E is infra-dependent.
Reviewer P1: extracting evaluateSliOverWindow flipped evaluatePipelineHealth's zero-metric-row case from breached(value:0) to no_data, so a dark pipeline stopped reporting degraded and diverged from batch-health.ts / fleet-metrics.ts. Add an emptyWindowStatus param: rolling health passes 'breached' (its documented contract); replay validation keeps 'no_data' (an unscored window is no opinion). Covered by a new dark-pipeline test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NF-6 — replay-driven promotion gating
The plan's NF-6 ('prove the new config is safe before fleet-wide rollout') was only partially shipped: #496 added replay completion validation in
nextReplayBatch, but there was no promotion gate anywhere inrelease/. This adds the actual gate.What
sli-evaluator— extractedevaluateSliOverWindow(pipelineId, sli, since, until?)+rollUpSliStatus(). RollingevaluatePipelineHealthbehavior is unchanged (open-ended window); the new fixed-window form lets a replay be scored over exactly[startedAt, completedAt].replay-validationservice —evaluateReplayValidationscores a completed replay against the target pipeline's ownPipelineSlithresholds (no parallel 'budget' concept) over the replay window →PASS/FAIL/NO_DATA. Onlyerror_rate/discard_rate/latency_meanare scored;throughput_flooris excluded (a time-rate gate is meaningless over an artificial replay window).NO_DATAis never treated as approval.replay.validatequery (VIEWER, target-scoped) surfaces the verdict for the UI.canary.broaden— opt-inreplayJobIdgate. Absent → byte-identical prior behavior (zero risk to existing callers). AFAILverdict blocks the broaden unlessforceis set; a forced override is recorded in the audit log. The replay job must target the rollout's pipeline (elseBAD_REQUEST).Tests
replay-validation.test.ts— PASS/FAIL/NO_DATA (no window, no applicable SLIs, no metrics), window scoping, metric-filter (throughput_floor excluded).release-canary.test.ts— gate branches: no-gate passthrough, FAIL blocks (broaden not called), force override (audited), PASS records verdict, pipeline-mismatch rejection.Verification limits
Unit-tested + typechecked + lint-clean. The live end-to-end path (a real canary agent replaying a sample → metrics → gate) is infra-dependent and not exercised here. The gate is opt-in and additive, so it cannot affect existing promotions.