Skip to content

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118

Open
gagika wants to merge 1 commit into
mainfrom
agagik-distill-perf
Open

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118
gagika wants to merge 1 commit into
mainfrom
agagik-distill-perf

Conversation

@gagika

@gagika gagika commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Description

Performance optimization of the qwen3-30b-a3b and gpt-oss-20b distillation configs on
TPU v7x (tpu7x-4x4x4).

MFU on v7x:

  • qwen3-30b-a3b: ~20% → ~26% for pdbs=6 and the pdbs=8 + activation-offload
    variant: ~22% → ~24%
  • gpt-oss-20b: ~17% → ~19%

All knobs are profile-derived (xplane) and documented inline:

  • context=device + custom remat — keep attention outputs on device so the
    backward pass skips the splash-forward re-runs (per-layer forward call count
    halves in the profile). The dominant win; the configs note the HBM frontiers
    (qwen fits to pdbs=6 with the teacher resident, gpt-oss is pdbs=1 at seq 32k).
  • Megablox grouped-matmul tiles — qwen at the full dims (emb 2048 /
    moe-mlp 768), ~+10% together with the layout change; gpt-oss raises the
    batch-seq m-tile 512 → 1024 (its k/n dims are already full), ~+3%.
  • Splash-attentionsa_q_layout: SEQ_MINOR on qwen; kv-compute sub-blocks
    2048 → 1024 on gpt-oss (~+2%; uniformly smaller blocks regress).
  • Batch / mesh — qwen default moves pdbs 4 → 6 in the headroom freed by
    dropping host offload; gpt-oss mesh moves to dp2 × fsdp64.

Tests

30-step runs on tpu7x-4x4x4 (xpk), steady-state step times, each
delta measured against the baselines. Loss and perplexity unchanged from
baseline.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@gagika gagika changed the title Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→18%) Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) distillation Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

@gagika gagika changed the title Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) distillation Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) Jun 9, 2026
@gagika gagika marked this pull request as ready for review June 9, 2026 17:50
@gagika gagika force-pushed the agagik-distill-perf branch from c03fbae to 44f5e6f Compare June 9, 2026 17:54
@@ -1,4 +1,4 @@
# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~22% MFU on v7x.
# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~24% MFU on v7x.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference vs previous file config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants