Skip to content

Fix submission cleanup: recover all non-terminal states, not just Running#2414

Open
hanane-ca wants to merge 1 commit into
codalab:developfrom
hanane-ca:fix/cleanup-all-non-terminal-states
Open

Fix submission cleanup: recover all non-terminal states, not just Running#2414
hanane-ca wants to merge 1 commit into
codalab:developfrom
hanane-ca:fix/cleanup-all-non-terminal-states

Conversation

@hanane-ca

@hanane-ca hanane-ca commented Jun 16, 2026

Copy link
Copy Markdown

Reviewers

@codalab/maintainers

Description

Fixes a bug where submissions stuck in non-terminal states (Submitted, Preparing, Scoring) would hang forever instead of being recovered by the cleanup task.

Problem: The submission_status_cleanup() task only recovered submissions stuck in Running state. Submissions that never reached Running (stuck in Submitted, Preparing, or Scoring) would never be cleaned up.

Root cause:

  • Cleanup only checked for Running status
  • No fallback for submissions without started_when (those that never reached Running)

Solution:

  1. Extend cleanup to cover all non-terminal states: Submitted, Preparing, Running, Scoring
  2. Use created_when as fallback when started_when is null
  3. All non-terminal submissions now recovered after 24h + execution_time_limit

Code changes:

  • src/apps/competitions/tasks.py:

    • Extended non_terminal_statuses list to include all states
    • Added created_when fallback: reference_time = started_when if started_when else created_when
    • Cleaned up inline comments per Codabench guidelines
  • src/apps/competitions/tests/test_submissions.py:

    • Added 4 unit tests for new states (Submitted, Preparing, Scoring)
    • Added negative test for recent submissions
    • All tests pass
  • tests/k6/:

    • Integration test suite with full orchestration
    • All test documentation included

Issues this PR resolves

Fixes #2413

Background

This bug was discovered during the EEG Foundation Challenge incident analysis where submissions were observed stuck in non-Running states for extended periods with no recovery mechanism.

Checklist for hand testing

  • Create a competition with at least one phase
  • Submit submissions and verify they get stuck when compute_worker is stopped
  • Age submissions to >24h
  • Run cleanup task: docker compose exec django python manage.py shell -c "from competitions.tasks import submission_status_cleanup; submission_status_cleanup()"
  • Verify all stuck submissions marked as Failed

Relevant files for testing

Integration test suite in tests/k6/:

  • run_cleanup_test.sh — End-to-end orchestrator
  • test_stuck_submissions.js — K6 recovery verification
  • test_cleanup_conservation.js — K6 conservation harness
  • README_cleanup_tests.md — Test documentation

Run tests:

cd tests/k6
./run_cleanup_test.sh

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCI tests are passing
  • Ready to merge

…ning

Problem:
- submission_status_cleanup() only recovered Running submissions
- Submissions stuck in Submitted, Preparing, or Scoring would hang forever
- No fallback for submissions that never reached Running (started_when null)

Solution:
- Extend cleanup to cover all non-terminal states: Submitted, Preparing, Running, Scoring
- Use created_when as fallback when started_when is null
- All non-terminal submissions now recovered after 24h + execution_time_limit

Changes:
- src/apps/competitions/tasks.py:
  * Extended non_terminal_statuses list to include all states
  * Added created_when fallback logic for reference_time
  * Cleaned up comments per Codabench guidelines

- src/apps/competitions/tests/test_submissions.py:
  * Added 4 unit tests covering Submitted, Preparing, Scoring states
  * Added negative test for recent non-terminal submissions
  * Cleaned up docstrings (removed M3 references)

- tests/k6/:
  * run_cleanup_test.sh: End-to-end orchestrator
  * test_stuck_submissions.js: K6 recovery verification
  * test_cleanup_conservation.js: K6 conservation harness
  * README_cleanup_tests.md: Test documentation
  * All files cleaned up (removed M3 references per guidelines)

Tests validate:
- All non-terminal states recovered after deadline
- Recent submissions NOT cleaned up
- 100% conservation rate

Fixes codalab#2413
@hanane-ca hanane-ca force-pushed the fix/cleanup-all-non-terminal-states branch from fdc11a7 to 1e6bba4 Compare June 19, 2026 08:26
@hanane-ca hanane-ca changed the title fix: submission_status_cleanup recovers all non-terminal states Fix submission cleanup: recover all non-terminal states, not just Running Jun 19, 2026
@Didayolo Didayolo requested a review from ObadaS June 19, 2026 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Submission cleanup only recovers Running submissions, not Submitted/Preparing/Scoring

3 participants