Skip to content

fix(compute_worker): make submission lifecycle idempotent against bro…#2434

Open
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:fix/compute-worker-redelivery-idempotency
Open

fix(compute_worker): make submission lifecycle idempotent against bro…#2434
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:fix/compute-worker-redelivery-idempotency

Conversation

@AybH26

@AybH26 AybH26 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Closes #2433

Summary

compute_worker uses task_acks_late = True, so any worker that dies before acking causes RabbitMQ to redeliver the run payload. The redelivered worker used to re-execute the full lifecycle, with three visible consequences:

  1. scoring_worker_hostname was overwritten by the second worker.
  2. The submission status was flipped backwards through Running → Scoring → Finished again.
  3. upload_submission_scores inserted a second SubmissionScore row per leaderboard column, which then crashed Submission.calculate_scores() with MultipleObjectsReturned.

This PR makes the lifecycle idempotent on both sides of the wire (defense-in-depth, Stripe-style: worker short-circuits, server refuses to double-mutate).

What changed

Worker

  • compute_worker/compute_worker.pyrun_wrapper now probes GET /api/submissions/<id>/worker_state/ on entry. If the submission is already terminal it logs the redelivery and returns {"skipped": True}, letting Celery ack the message. New Run._fetch_submission_state helper is best-effort — any API error falls through to the legacy path so a transient blip cannot block a legitimate run.

API

  • src/apps/api/views/submissions.py
    • New @action worker_state on SubmissionViewSet: authenticated by the submission secret query param, returns {status, is_terminal, has_scoring_result, has_prediction_result, worker_attempt_count}, atomically bumps the counter via F('worker_attempt_count') + 1.
    • check_object_permissions no longer writes ingestion_worker_hostname / scoring_worker_hostname once the row is terminal.
    • upload_submission_scores upserts one SubmissionScore per (submission, column) instead of blindly creating duplicates.
  • src/apps/api/serializers/submissions.pySubmissionCreationSerializer.update refuses status mutations on terminal rows (returns 200 OK with the unchanged instance so the redelivered worker acks cleanly).

Model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

compute_worker re-executes submissions on broker redelivery, causing duplicate scores, hostname overwrite, and status flipping

1 participant