Skip to content

fix: claim job-end watcher actions before resubmit#84

Open
Ramlaoui wants to merge 1 commit into
mainfrom
fix/watcher-job-end-action-claim
Open

fix: claim job-end watcher actions before resubmit#84
Ramlaoui wants to merge 1 commit into
mainfrom
fix/watcher-job-end-action-claim

Conversation

@Ramlaoui

Copy link
Copy Markdown
Owner

Summary

  • Add a per-watcher/per-action DB claim before executing terminal job-end actions, so concurrent watcher loops cannot both call the resubmit path for the same watcher action.
  • Preserve the existing success/completed markers and release the claim for retryable failures.
  • Add a regression test that runs two concurrent timeout handlers and verifies only one resubmit action executes.

Deferred design

This PR intentionally fixes the observed live race with a narrow DB-level claim. A more complete crash-safe design would pair claims with leases/heartbeats plus a recoverable resubmit idempotency key, then reconcile whether Slurm already accepted a child job before retrying a stale claim. That broader design is deferred because it needs a clearer metadata and Slurm lookup contract; the immediate failure mode was duplicate submissions from concurrently running watcher loops.

Tests

  • uv run --no-sync pytest -q tests/unit/test_watcher_engine.py tests/unit/test_watcher_service.py tests/unit/test_run_watchers.py tests/unit/test_watcher_daemon.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant