Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118
Open
gagika wants to merge 1 commit into
Open
Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118gagika wants to merge 1 commit into
gagika wants to merge 1 commit into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details. |
c03fbae to
44f5e6f
Compare
vlad-karp
approved these changes
Jun 9, 2026
| @@ -1,4 +1,4 @@ | |||
| # Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~22% MFU on v7x. | |||
| # Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~24% MFU on v7x. | |||
Collaborator
There was a problem hiding this comment.
what is the difference vs previous file config?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Performance optimization of the qwen3-30b-a3b and gpt-oss-20b distillation configs on
TPU v7x (
tpu7x-4x4x4).MFU on v7x:
variant: ~22% → ~24%
All knobs are profile-derived (xplane) and documented inline:
context=device+customremat — keep attention outputs on device so thebackward pass skips the splash-forward re-runs (per-layer forward call count
halves in the profile). The dominant win; the configs note the HBM frontiers
(qwen fits to pdbs=6 with the teacher resident, gpt-oss is pdbs=1 at seq 32k).
moe-mlp 768), ~+10% together with the layout change; gpt-oss raises the
batch-seq m-tile 512 → 1024 (its k/n dims are already full), ~+3%.
sa_q_layout: SEQ_MINORon qwen; kv-compute sub-blocks2048 → 1024 on gpt-oss (~+2%; uniformly smaller blocks regress).
dropping host offload; gpt-oss mesh moves to dp2 × fsdp64.
Tests
30-step runs on
tpu7x-4x4x4(xpk), steady-state step times, eachdelta measured against the baselines. Loss and perplexity unchanged from
baseline.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.