fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …#2420
Open
AybH26 wants to merge 1 commit into
Open
fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …#2420AybH26 wants to merge 1 commit into
AybH26 wants to merge 1 commit into
Conversation
…rows When the compute worker PATCHes a submission to status=SCORING, the API serializer used to call run_submission() synchronously inside the same DB transaction. If the broker (RabbitMQ) was unreachable at that exact moment, the status row would commit but the scoring task would never be published, leaving the submission stuck in SCORING forever (no recovery: the 24h cleanup only rescues RUNNING rows). Move the enqueue into transaction.on_commit so the task is only published after the SCORING status is durably committed, and explicitly mark the submission as Failed (with a clear status_details) if the publish still fails, so the row never stays in a non-terminal limbo state. Wrap update() in @transaction.atomic to make the commit boundary explicit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@ mention of reviewers
@ @
A brief description of the purpose of the changes contained in this PR
Submissions could get stuck in
SCORINGindefinitely because the Celery task that runs the next phase was enqueued inside the outer Django transaction, before the row was committed. Thecompute_workercould dequeue and execute the task before the new status (and updated FKs likequeue/celery_task_id) were visible in the database, then bail out or operate on stale data.This PR moves the
send_taskcall into atransaction.on_commit()callback so the message hits RabbitMQ only after the surrounding transaction is committed. Behaviour is otherwise unchanged.Issues this PR resolves
Closes #2419
Symptoms reported:
SCORINGwith no worker activity.Submission.DoesNotExist/ stale-read errors incompute_workerlogs right after a status transition.celery_task_idoccasionallyNULLon rows that did get picked up.Root cause:
app.send_task(...)was called from inside the outer@transaction.atomicscope of_run_submission, so the broker received the task before PostgreSQL committed the writes. Under load (or with a fast worker / slow commit), the worker won the race.Fix: wrap the enqueue +
celery_task_idwrite in a_enqueue_after_commit()closure and register it viatransaction.on_commit(...). The closure runs only when the outer transaction commits successfully, and is silently dropped on rollback (no orphaned messages on the broker).A checklist for hand testing
Finished(no stuckSCORING).Finished.compute-worker(RabbitMQ management UI shows no dangling delivery).submission.celery_task_idafter enqueue → notNULL.compute_workermid-submission lifecycle → submission still completes (does not regress M6 idempotency).SUBMITTEDsubmission before its commit completes →celery_app.control.revoke(...)still works because thecelery_task_idis set inside the sameon_commitcallback.Any relevant files for testing
_run_submission→_enqueue_after_commitclosure +transaction.on_commit(...)).from django.db import transaction(already present).Checklist