Skip to content

feat(staged-rollout): SLO-gated canary broadening (IF-7)#495

Merged
TerrifiedBug merged 1 commit into
mainfrom
feat/if-7-slo-gated-rollout
Jun 8, 2026
Merged

feat(staged-rollout): SLO-gated canary broadening (IF-7)#495
TerrifiedBug merged 1 commit into
mainfrom
feat/if-7-slo-gated-rollout

Conversation

@TerrifiedBug

Copy link
Copy Markdown
Owner

IF-7 — SLO-gated canary broadening

broadenRollout advanced a canary forward without consulting any error-budget/SLO signal. The manual broaden path (release.canary.broaden tRPC mutation) had no gate at all; the auto-broaden path only consulted the per-pipeline successCriteria.

Change

Added an explicit error-budget gate inside broadenRollout, so it protects both the manual tRPC mutation and the auto-broaden path in checkHealthWindows.

  • Reuses the same signal the health-check usesgetAggregateErrorRate(pipelineId) from auto-rollback.ts (aggregate errorsTotal / eventsIn across NodePipelineStatus rows). No new Prometheus/metric client.
  • New private helper evaluateErrorBudget(pipelineId) converts that percentage to a 0..1 ratio and compares it against VF_ROLLOUT_ERROR_BUDGET.
  • When the budget is clearly burned (errorRatio > budget): the broaden is held — status stays HEALTH_CHECK (never advanced to BROADENED), the reason is recorded on the rollout (Release.reviewNote), a warnLog is emitted, and the call aborts with TRPCError(PRECONDITION_FAILED) so the manual caller gets a clear error (and the deploy.staged_broadened audit log is not written).
  • Conservative by design:
    • Missing metrics (getAggregateErrorRatenull) → treated as not burned → broadening is never blocked on an absent signal.
    • This gate never triggers an auto-rollback (explicitly out of scope) — it only holds the canary.

Config

  • VF_ROLLOUT_ERROR_BUDGET — added to the Zod env schema (src/lib/env.ts). A 0..1 ratio (z.coerce.number().min(0).max(1)), default 0.05 (5%). With the default it matches the default successCriteria.maxErrorRatePercent (5%), so the auto-broaden path doesn't double-block; the gate adds protection for the manual path and for pipelines whose success criteria are looser than the org SLO.

No migration

Reuses the existing Release.reviewNote free-text field to record the hold reason — no Prisma schema change / migration required.

Tests

Extended src/server/services/__tests__/staged-rollout.test.ts (mock prisma + the nodePipelineStatus metric source):

  • Burned budget is blocked — 10% error ratio (100/1000) vs 0.05 budget: broaden rejects with an /error budget/i message, no relayPush to remaining nodes, reviewNote reason recorded, status not advanced to BROADENED, no canary_broadened SSE.
  • Within budget proceeds — 1% error ratio (10/1000): broaden completes, pushes to all 3 remaining nodes, status → BROADENED, canary_broadened SSE fired.

pnpm exec vitest run src/server/services/__tests__/staged-rollout.test.ts22 passed (20 existing + 2 new; the existing happy-path broaden — which provides no metric data — confirms the conservative null → proceed behavior). Filtered tsc --noEmit on changed files: clean.

Verification notes

Cloud/agent UI paths aren't locally verifiable; unit tests + types are the bar here. @clickhouse/client is not installed locally (known gap) — not touched by this change.

broadenRollout previously advanced a canary forward without consulting any
error-budget/SLO signal — the manual broaden path had no gate at all and the
auto-broaden path only checked the per-pipeline success criteria.

Add an explicit error-budget gate inside broadenRollout (covering both the
manual tRPC mutation and the auto-broaden path). It reuses the SAME signal the
health-check already uses (getAggregateErrorRate over NodePipelineStatus rows)
— no new Prometheus/metric client. When the canary's aggregate error ratio
exceeds VF_ROLLOUT_ERROR_BUDGET (a 0..1 ratio, default 0.05) the broaden is
held: the reason is recorded on the rollout (reviewNote), a warning is logged,
and the call aborts with TRPCError PRECONDITION_FAILED. Status is left in
HEALTH_CHECK.

Conservative by design: missing metrics (null signal) never block, and this
gate never triggers an auto-rollback (out of scope).
@TerrifiedBug TerrifiedBug merged commit 4295a17 into main Jun 8, 2026
17 checks passed
@TerrifiedBug TerrifiedBug deleted the feat/if-7-slo-gated-rollout branch June 8, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant