Skip to content

fix(schedulers): re-check leadership in scheduler ticks (SC-3)#497

Merged
TerrifiedBug merged 2 commits into
mainfrom
feat/sc-3-leader-guards
Jun 8, 2026
Merged

fix(schedulers): re-check leadership in scheduler ticks (SC-3)#497
TerrifiedBug merged 2 commits into
mainfrom
feat/sc-3-leader-guards

Conversation

@TerrifiedBug

Copy link
Copy Markdown
Owner

SC-3 — de-SPOF schedulers

Problem

Background schedulers start based on leadership at boot (src/instrumentation.node.tsstartSingletonServices) but then run forever via setInterval/node-cron and never re-check leadership. When a leader is demoted (LeaderElection flips _isLeader=false after 3 failed Redis renewals), the old instance's timers keep firing for up to one TTL (~15s) while the newly-elected leader is already running — producing duplicate alerts, backups, anomaly runs, cost-analysis passes, delivery retries and telemetry heartbeats.

Change

Add an early leadership guard at the top of each scheduler's periodic work so a demoted instance becomes a no-op:

if (!isLeader()) {            // from @/server/services/leader-election
  debugLog("<tag>", "Skipping … — instance is no longer leader");
  return;
}

Guard is guard-only — timers/cron tasks are left registered, so each scheduler resumes cleanly if the instance re-acquires leadership (no teardown/re-init churn).

Guards added to:

  • fleet-alert-service.ts — top of tick()
  • anomaly-detection-job.ts — top of tick()
  • lake/lake-alerts.ts — top of module-level tick()
  • backup-scheduler.ts — top of the per-org cron.schedule callback
  • cost-optimizer-scheduler.ts — top of the shared scheduleJob callback (covers both the daily and continuous cron jobs, which share that one callback)
  • telemetry-scheduler.ts — top of the daily heartbeat cron callback
  • retry-service.ts — top of processRetries() (also covers processOutboundRetries(), which it calls)

cert-expiry-checker.ts intentionally gets no guard: it has no own timer/cron and its only production caller is checkCertificateExpiry() inside the now-guarded fleet-alert tick (confirmed by searching all callers).

Tests

Extended the two critical scheduler specs (fleet-alert-service.test.ts, anomaly-detection-job.test.ts) with a leadership guard block that mocks @/server/services/leader-election + Prisma and asserts:

  • isLeader() === false → tick is a no-op: no org scan (organization.findMany), no rule/pipeline evaluation, no channel delivery.
  • isLeader() === true → tick proceeds and evaluates per-org as before.

Both specs default the mock to leader so the existing tick-driven tests keep doing work.

Verification

  • pnpm exec vitest run on fleet-alert-service.test.ts + anomaly-detection-job.test.ts30 passed (incl. 4 new guard tests).
  • Re-ran the 5 other touched schedulers' existing specs (lake-alerts, backup-scheduler-per-org, cost-optimizer-scheduler[-per-org], telemetry-scheduler, retry-service) → 38 passed, no regressions (test env has no REDIS_URL, so the real isLeader() returns true and the guards pass through).
  • Filtered tsc --noEmit: no type errors in any changed file. The only diagnostic is the known local @clickhouse/client gap in lake/clickhouse.ts (package not installed locally) — ignored per task guidance.

Notes

  • No Prisma migration required.
  • License boundary respected: all new imports are @/… (src-internal); no @cloud/*.
  • The lake/clickhouse path can't be exercised locally (no @clickhouse/client); the guard sits before any ClickHouse access and is covered structurally by the lake-alerts spec. Failover/demotion behaviour itself isn't locally reproducible without a multi-instance Redis setup — unit tests + types are the bar here.

Background schedulers start based on leadership at boot
(instrumentation.node.ts startSingletonServices) but then run forever
via setInterval/cron and never re-check leadership. When a leader is
demoted (3 failed Redis renewals) its timers keep firing for up to one
TTL (~15s) while the new leader also runs — duplicating alerts, backups,
anomaly runs, cost analysis, retries and telemetry heartbeats.

Add an early `if (!isLeader()) return;` guard (from
@/server/services/leader-election) at the top of each scheduler's
periodic work so a demoted instance becomes a no-op, with a logger.debug
on skip. Guard only — timers/cron tasks are left registered so work
resumes cleanly if leadership is re-acquired.

Guards added to: fleet-alert-service, anomaly-detection-job, lake-alerts,
backup-scheduler (per-org cron callback), cost-optimizer-scheduler
(shared daily + continuous cron callback), telemetry-scheduler and
retry-service (processRetries, which also covers processOutboundRetries).
cert-expiry-checker needs no guard — it has no own timer and only runs
inside the now-guarded fleet-alert tick.

Tests: extend fleet-alert-service and anomaly-detection-job specs to
assert the tick does no work (no org scan / rule eval / pipeline eval /
channel delivery) when isLeader() is false, and proceeds when true.
@github-actions github-actions Bot added the fix label Jun 8, 2026
…ewer SC-3)

Reviewer flagged two leader-gated singletons started in the same boot block that
were missed: git-sync-retry (P2 — duplicate git commits/pushes on the ~15s
demotion window via a non-atomic job claim) and metrics-rollup (lower-risk,
hourly+idempotent). Add the same isLeader() tick guard to both + a git-sync-retry
guard test (with a global isLeader=true default so existing tick tests are
unaffected).
@github-actions github-actions Bot added fix and removed fix labels Jun 8, 2026
@TerrifiedBug TerrifiedBug merged commit 256aa69 into main Jun 8, 2026
18 checks passed
@TerrifiedBug TerrifiedBug deleted the feat/sc-3-leader-guards branch June 8, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant