fix(schedulers): re-check leadership in scheduler ticks (SC-3)#497
Merged
Conversation
Background schedulers start based on leadership at boot (instrumentation.node.ts startSingletonServices) but then run forever via setInterval/cron and never re-check leadership. When a leader is demoted (3 failed Redis renewals) its timers keep firing for up to one TTL (~15s) while the new leader also runs — duplicating alerts, backups, anomaly runs, cost analysis, retries and telemetry heartbeats. Add an early `if (!isLeader()) return;` guard (from @/server/services/leader-election) at the top of each scheduler's periodic work so a demoted instance becomes a no-op, with a logger.debug on skip. Guard only — timers/cron tasks are left registered so work resumes cleanly if leadership is re-acquired. Guards added to: fleet-alert-service, anomaly-detection-job, lake-alerts, backup-scheduler (per-org cron callback), cost-optimizer-scheduler (shared daily + continuous cron callback), telemetry-scheduler and retry-service (processRetries, which also covers processOutboundRetries). cert-expiry-checker needs no guard — it has no own timer and only runs inside the now-guarded fleet-alert tick. Tests: extend fleet-alert-service and anomaly-detection-job specs to assert the tick does no work (no org scan / rule eval / pipeline eval / channel delivery) when isLeader() is false, and proceeds when true.
…ewer SC-3) Reviewer flagged two leader-gated singletons started in the same boot block that were missed: git-sync-retry (P2 — duplicate git commits/pushes on the ~15s demotion window via a non-atomic job claim) and metrics-rollup (lower-risk, hourly+idempotent). Add the same isLeader() tick guard to both + a git-sync-retry guard test (with a global isLeader=true default so existing tick tests are unaffected).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SC-3 — de-SPOF schedulers
Problem
Background schedulers start based on leadership at boot (
src/instrumentation.node.ts→startSingletonServices) but then run forever viasetInterval/node-cronand never re-check leadership. When a leader is demoted (LeaderElectionflips_isLeader=falseafter 3 failed Redis renewals), the old instance's timers keep firing for up to one TTL (~15s) while the newly-elected leader is already running — producing duplicate alerts, backups, anomaly runs, cost-analysis passes, delivery retries and telemetry heartbeats.Change
Add an early leadership guard at the top of each scheduler's periodic work so a demoted instance becomes a no-op:
Guard is guard-only — timers/cron tasks are left registered, so each scheduler resumes cleanly if the instance re-acquires leadership (no teardown/re-init churn).
Guards added to:
fleet-alert-service.ts— top oftick()anomaly-detection-job.ts— top oftick()lake/lake-alerts.ts— top of module-leveltick()backup-scheduler.ts— top of the per-orgcron.schedulecallbackcost-optimizer-scheduler.ts— top of the sharedscheduleJobcallback (covers both the daily and continuous cron jobs, which share that one callback)telemetry-scheduler.ts— top of the daily heartbeat cron callbackretry-service.ts— top ofprocessRetries()(also coversprocessOutboundRetries(), which it calls)cert-expiry-checker.tsintentionally gets no guard: it has no own timer/cron and its only production caller ischeckCertificateExpiry()inside the now-guarded fleet-alert tick (confirmed by searching all callers).Tests
Extended the two critical scheduler specs (
fleet-alert-service.test.ts,anomaly-detection-job.test.ts) with aleadership guardblock that mocks@/server/services/leader-election+ Prisma and asserts:isLeader() === false→ tick is a no-op: no org scan (organization.findMany), no rule/pipeline evaluation, no channel delivery.isLeader() === true→ tick proceeds and evaluates per-org as before.Both specs default the mock to leader so the existing tick-driven tests keep doing work.
Verification
pnpm exec vitest runonfleet-alert-service.test.ts+anomaly-detection-job.test.ts→ 30 passed (incl. 4 new guard tests).REDIS_URL, so the realisLeader()returnstrueand the guards pass through).tsc --noEmit: no type errors in any changed file. The only diagnostic is the known local@clickhouse/clientgap inlake/clickhouse.ts(package not installed locally) — ignored per task guidance.Notes
@/…(src-internal); no@cloud/*.@clickhouse/client); the guard sits before any ClickHouse access and is covered structurally by the lake-alerts spec. Failover/demotion behaviour itself isn't locally reproducible without a multi-instance Redis setup — unit tests + types are the bar here.