feat(staged-rollout): SLO-gated canary broadening (IF-7)#495
Merged
Conversation
broadenRollout previously advanced a canary forward without consulting any error-budget/SLO signal — the manual broaden path had no gate at all and the auto-broaden path only checked the per-pipeline success criteria. Add an explicit error-budget gate inside broadenRollout (covering both the manual tRPC mutation and the auto-broaden path). It reuses the SAME signal the health-check already uses (getAggregateErrorRate over NodePipelineStatus rows) — no new Prometheus/metric client. When the canary's aggregate error ratio exceeds VF_ROLLOUT_ERROR_BUDGET (a 0..1 ratio, default 0.05) the broaden is held: the reason is recorded on the rollout (reviewNote), a warning is logged, and the call aborts with TRPCError PRECONDITION_FAILED. Status is left in HEALTH_CHECK. Conservative by design: missing metrics (null signal) never block, and this gate never triggers an auto-rollback (out of scope).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
IF-7 — SLO-gated canary broadening
broadenRolloutadvanced a canary forward without consulting any error-budget/SLO signal. The manual broaden path (release.canary.broadentRPC mutation) had no gate at all; the auto-broaden path only consulted the per-pipelinesuccessCriteria.Change
Added an explicit error-budget gate inside
broadenRollout, so it protects both the manual tRPC mutation and the auto-broaden path incheckHealthWindows.getAggregateErrorRate(pipelineId)fromauto-rollback.ts(aggregateerrorsTotal / eventsInacrossNodePipelineStatusrows). No new Prometheus/metric client.evaluateErrorBudget(pipelineId)converts that percentage to a 0..1 ratio and compares it againstVF_ROLLOUT_ERROR_BUDGET.errorRatio > budget): the broaden is held — status staysHEALTH_CHECK(never advanced toBROADENED), the reason is recorded on the rollout (Release.reviewNote), awarnLogis emitted, and the call aborts withTRPCError(PRECONDITION_FAILED)so the manual caller gets a clear error (and thedeploy.staged_broadenedaudit log is not written).getAggregateErrorRate→null) → treated as not burned → broadening is never blocked on an absent signal.Config
VF_ROLLOUT_ERROR_BUDGET— added to the Zod env schema (src/lib/env.ts). A0..1ratio (z.coerce.number().min(0).max(1)), default 0.05 (5%). With the default it matches the defaultsuccessCriteria.maxErrorRatePercent(5%), so the auto-broaden path doesn't double-block; the gate adds protection for the manual path and for pipelines whose success criteria are looser than the org SLO.No migration
Reuses the existing
Release.reviewNotefree-text field to record the hold reason — no Prisma schema change / migration required.Tests
Extended
src/server/services/__tests__/staged-rollout.test.ts(mock prisma + thenodePipelineStatusmetric source):100/1000) vs 0.05 budget: broaden rejects with an/error budget/imessage, norelayPushto remaining nodes,reviewNotereason recorded, status not advanced toBROADENED, nocanary_broadenedSSE.10/1000): broaden completes, pushes to all 3 remaining nodes, status →BROADENED,canary_broadenedSSE fired.pnpm exec vitest run src/server/services/__tests__/staged-rollout.test.ts→ 22 passed (20 existing + 2 new; the existing happy-path broaden — which provides no metric data — confirms the conservativenull→ proceed behavior). Filteredtsc --noEmiton changed files: clean.Verification notes
Cloud/agent UI paths aren't locally verifiable; unit tests + types are the bar here.
@clickhouse/clientis not installed locally (known gap) — not touched by this change.