Skip to content

Fix misleading preemption logs in checkpointing#4124

Open
lukebaumann wants to merge 1 commit into
AI-Hypercomputer:mainfrom
lukebaumann:fix-preemption-logs
Open

Fix misleading preemption logs in checkpointing#4124
lukebaumann wants to merge 1 commit into
AI-Hypercomputer:mainfrom
lukebaumann:fix-preemption-logs

Conversation

@lukebaumann

Copy link
Copy Markdown
Collaborator

Resolve misleading "Job is preempted" error logs in MaxText checkpointing.

Changes

  1. Raise JaxRuntimeError directly: Instead of catching JaxRuntimeError and raising StopTraining("Job is preempted.") when elasticity is disabled, we now let the exception bubble up. This allows the process to crash normally, preventing silent failures and ensuring the logs show the actual cause of the crash (e.g. OOM).
  2. Reword SIGTERM preemption message: Updated the StopTraining message when reached_preemption is true to "Job received termination signal (SIGTERM)." since SIGTERM is sent by GKE/Kueue for various non-preemption events (e.g., updates, scaling down).

Fixes: b/516962538

- Raise JaxRuntimeError directly instead of masking it as StopTraining('Job is preempted.') when elasticity is disabled. This prevents hiding the true cause of crashes in logs and allows proper crash handling.

- Reword StopTraining message when reached_preemption is true to 'Job received termination signal (SIGTERM).' as SIGTERM is sent for various termination events (scaling, updates), not just preemption.

Fixes: b/516962538
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/common/checkpointing.py 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants