fix(schedulers): re-check leadership in scheduler ticks (SC-3) by TerrifiedBug · Pull Request #497 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-06-08T12:13:45Z

SC-3 — de-SPOF schedulers

Problem

Background schedulers start based on leadership at boot (src/instrumentation.node.ts → startSingletonServices) but then run forever via setInterval/node-cron and never re-check leadership. When a leader is demoted (LeaderElection flips _isLeader=false after 3 failed Redis renewals), the old instance's timers keep firing for up to one TTL (~15s) while the newly-elected leader is already running — producing duplicate alerts, backups, anomaly runs, cost-analysis passes, delivery retries and telemetry heartbeats.

Change

Add an early leadership guard at the top of each scheduler's periodic work so a demoted instance becomes a no-op:

if (!isLeader()) {            // from @/server/services/leader-election
  debugLog("<tag>", "Skipping … — instance is no longer leader");
  return;
}

Guard is guard-only — timers/cron tasks are left registered, so each scheduler resumes cleanly if the instance re-acquires leadership (no teardown/re-init churn).

Guards added to:

fleet-alert-service.ts — top of tick()
anomaly-detection-job.ts — top of tick()
lake/lake-alerts.ts — top of module-level tick()
backup-scheduler.ts — top of the per-org cron.schedule callback
cost-optimizer-scheduler.ts — top of the shared scheduleJob callback (covers both the daily and continuous cron jobs, which share that one callback)
telemetry-scheduler.ts — top of the daily heartbeat cron callback
retry-service.ts — top of processRetries() (also covers processOutboundRetries(), which it calls)

cert-expiry-checker.ts intentionally gets no guard: it has no own timer/cron and its only production caller is checkCertificateExpiry() inside the now-guarded fleet-alert tick (confirmed by searching all callers).

Tests

Extended the two critical scheduler specs (fleet-alert-service.test.ts, anomaly-detection-job.test.ts) with a leadership guard block that mocks @/server/services/leader-election + Prisma and asserts:

isLeader() === false → tick is a no-op: no org scan (organization.findMany), no rule/pipeline evaluation, no channel delivery.
isLeader() === true → tick proceeds and evaluates per-org as before.

Both specs default the mock to leader so the existing tick-driven tests keep doing work.

Verification

pnpm exec vitest run on fleet-alert-service.test.ts + anomaly-detection-job.test.ts → 30 passed (incl. 4 new guard tests).
Re-ran the 5 other touched schedulers' existing specs (lake-alerts, backup-scheduler-per-org, cost-optimizer-scheduler[-per-org], telemetry-scheduler, retry-service) → 38 passed, no regressions (test env has no REDIS_URL, so the real isLeader() returns true and the guards pass through).
Filtered tsc --noEmit: no type errors in any changed file. The only diagnostic is the known local @clickhouse/client gap in lake/clickhouse.ts (package not installed locally) — ignored per task guidance.

Notes

No Prisma migration required.
License boundary respected: all new imports are @/… (src-internal); no @cloud/*.
The lake/clickhouse path can't be exercised locally (no @clickhouse/client); the guard sits before any ClickHouse access and is covered structurally by the lake-alerts spec. Failover/demotion behaviour itself isn't locally reproducible without a multi-instance Redis setup — unit tests + types are the bar here.

Background schedulers start based on leadership at boot (instrumentation.node.ts startSingletonServices) but then run forever via setInterval/cron and never re-check leadership. When a leader is demoted (3 failed Redis renewals) its timers keep firing for up to one TTL (~15s) while the new leader also runs — duplicating alerts, backups, anomaly runs, cost analysis, retries and telemetry heartbeats. Add an early `if (!isLeader()) return;` guard (from @/server/services/leader-election) at the top of each scheduler's periodic work so a demoted instance becomes a no-op, with a logger.debug on skip. Guard only — timers/cron tasks are left registered so work resumes cleanly if leadership is re-acquired. Guards added to: fleet-alert-service, anomaly-detection-job, lake-alerts, backup-scheduler (per-org cron callback), cost-optimizer-scheduler (shared daily + continuous cron callback), telemetry-scheduler and retry-service (processRetries, which also covers processOutboundRetries). cert-expiry-checker needs no guard — it has no own timer and only runs inside the now-guarded fleet-alert tick. Tests: extend fleet-alert-service and anomaly-detection-job specs to assert the tick does no work (no org scan / rule eval / pipeline eval / channel delivery) when isLeader() is false, and proceeds when true.

…ewer SC-3) Reviewer flagged two leader-gated singletons started in the same boot block that were missed: git-sync-retry (P2 — duplicate git commits/pushes on the ~15s demotion window via a non-atomic job claim) and metrics-rollup (lower-risk, hourly+idempotent). Add the same isLeader() tick guard to both + a git-sync-retry guard test (with a global isLeader=true default so existing tick tests are unaffected).

github-actions Bot added the fix label Jun 8, 2026

github-actions Bot added fix and removed fix labels Jun 8, 2026

TerrifiedBug merged commit 256aa69 into main Jun 8, 2026
18 checks passed

TerrifiedBug deleted the feat/sc-3-leader-guards branch June 8, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(schedulers): re-check leadership in scheduler ticks (SC-3)#497

fix(schedulers): re-check leadership in scheduler ticks (SC-3)#497
TerrifiedBug merged 2 commits into
mainfrom
feat/sc-3-leader-guards

TerrifiedBug commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TerrifiedBug commented Jun 8, 2026

SC-3 — de-SPOF schedulers

Problem

Change

Tests

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant