Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing by titaiwangms · Pull Request #29264 · microsoft/onnxruntime

titaiwangms · 2026-06-25T17:48:12Z

Summary

The CUDA EmbedLayerNormalization and SkipLayerNormalization kernels compute output write offsets (row_index * hidden_size) using 32-bit arithmetic. For very large output tensors the element count can exceed INT32_MAX, at which point the offset is no longer representable in 32 bits.

Every output write index in these kernels is a pure function of the launch grid and hidden_size — there is no data-dependent write indexing — so the maximum index is exactly output_element_count - 1, which the host knows from the input shapes before launch. This PR adds a host-side guard in each op's ComputeInternal that computes the output element count in 64-bit arithmetic and returns a clear error when it exceeds the supported 32-bit indexing range.

Design

EmbedLayerNormalization (embed_layer_norm.cc): output_element_count = (int64)batch_size * sequence_length * hidden_size, guarded with ORT_RETURN_IF_NOT(... <= INT32_MAX, ...).
SkipLayerNormalization (skip_layer_norm.cc): output_element_count = input->Shape().Size() (output shares the input shape), same guard.
Kernels are unchanged — they keep the original int32 indexing, so there is no extra register/occupancy cost in the hot path. This is pure host-side validation.

Behavior

This rejects (rather than silently attempting) single-op LayerNorm outputs larger than 2³¹ elements — a regime no real BERT-family model produces (it would require a multi-GB single-op activation). For all supported shapes there is no behavior or numeric change.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copilot

Pull request overview

This PR addresses integer overflow in CUDA BERT LayerNorm-family kernels by widening the global element write offset (row * hidden_size) from 32-bit to 64-bit, preventing wrapped output indexing for very large tensors.

Changes:

Widen LayerNorm device helper offset/index parameters to int64_t in layer_norm.cuh.
Compute per-row offsets/indices in 64-bit in skip_layer_norm_impl.cu kernels.
Compute output_offset in 64-bit in embed_layer_norm_impl.cu and pass it through to LayerNorm.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu	Uses `int64_t` for per-row `offset`/`idx` to avoid overflow when indexing large output tensors.
onnxruntime/contrib_ops/cuda/bert/layer_norm.cuh	Updates LayerNorm helpers to accept 64-bit offsets and use 64-bit indices for global element access.
onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu	Uses 64-bit `output_offset` for writing/normalizing large outputs in EmbedLayerNorm.

tianleiwu · 2026-06-25T20:41:12Z

Is it needed? Typical max sequence length for BERT model is 512, and int32 offset is enough.
You may just check in host code like sequence_length * hidden_size < int32_max, and no need to do it in cuda kernel. Using int64 in cuda kernel will use more registers.

… output indexing The CUDA EmbedLayerNormalization and SkipLayerNormalization kernels compute output write offsets (row_index * hidden_size) using 32-bit arithmetic. For very large output tensors the element count can exceed INT32_MAX and the offset would no longer be representable in 32 bits. Every output write index in these kernels is a pure function of the launch grid and hidden_size (no data-dependent write indexing), so the maximum index is exactly output_element_count - 1, which the host knows from the input shapes before launch. Add a host-side guard in each ComputeInternal that computes the output element count in 64-bit arithmetic and returns a clear error when it exceeds the supported 32-bit indexing range, instead of silently relying on the int32 kernels for shapes they cannot index. Kernels are unchanged (int32 baseline); no numeric behavior change for supported shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tianleiwu

Thanks — the host-side guard approach is sound and keeps the hot kernel untouched (zero cost in the inference path).

SkipLayerNormalization (skip_layer_norm.cc) — looks complete. The guard is fully sound: the output, sum_output, and the skip[idx % skip_size] read in skip_layer_norm_impl.cu are all bounded by row_count * hidden_size, which equals input->Shape().Size() — exactly what the guard checks. Max device index is output_element_count - 1, so every write and read site is covered.

EmbedLayerNormalization (embed_layer_norm.cc). The 64-bit accumulation is correct (static_cast<int64_t>(batch_size) * sequence_length * hidden_size promotes before any multiply), and the guard correctly bounds the output-write offset.

One residual gap worth noting (same root issue the automated reviewer raised): in embed_layer_norm_impl.cu the embedding reads word_offset = word_id * hidden_size, segment_offset, and position_offset are still 32-bit int. These index into the embedding tables (word_embedding is [vocab_size, hidden_size], etc.), whose sizes are independent of the output element count. A model with a large embedding table (vocab_size * hidden_size > 2^31) but a modest output would still overflow these reads silently — the new guard does not cover that path. Suggest either widening those three offsets to int64_t (with the pointer arithmetic that uses them), or narrowing the PR framing to explicitly cover only output-write indexing.

see the gap in above comment

tianleiwu

Suggest adding a check of vocab_size * hidden_size too in this PR or a follow-up PR.

titaiwangms requested a review from Copilot June 25, 2026 18:21

Copilot started reviewing on behalf of titaiwangms June 25, 2026 18:22 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu Outdated

titaiwangms force-pushed the standalone-int64-bert-layernorm-write-offsets branch from 1379258 to 0b9d5e2 Compare June 25, 2026 22:12

titaiwangms changed the title ~~Use 64-bit element offsets in CUDA BERT LayerNorm/SkipLayerNorm write index~~ Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing Jun 25, 2026

tianleiwu reviewed Jun 26, 2026

View reviewed changes

tianleiwu previously approved these changes Jun 26, 2026

View reviewed changes

tianleiwu approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing#29264

Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing#29264
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:standalone-int64-bert-layernorm-write-offsets

titaiwangms commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tianleiwu commented Jun 25, 2026 •

edited

Loading

Uh oh!

tianleiwu left a comment •

edited

Loading

Uh oh!

tianleiwu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

titaiwangms commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Behavior

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

tianleiwu commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianleiwu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

titaiwangms commented Jun 25, 2026 •

edited

Loading

tianleiwu commented Jun 25, 2026 •

edited

Loading

tianleiwu left a comment •

edited

Loading