Current Scaling Group Quantization + Enabling Varying Last/Both Dims in Group Quantize by vthumbe1503 · Pull Request #3114 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-06-10T13:25:28Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Route grouped Float8CurrentScalingQuantizer through the existing grouped quantize entry point, prepare per-group current-scaling metadata with existing amax/scale helpers, and add focused tests plus a GB200 bandwidth benchmark. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_5507e814ee50f9ff304a4ce708d19768 Orchestra-Run: run_516e1e26891f4ce7d4cde07147c10862

Use wider vectorized grouped FP8 cast-transpose tiles and vectorized masked stores for rowwise and columnwise outputs. Capture all benchmark modes in a single post-warmup profiler range. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_3d6e33eab11e293d72eb4394bad76a81 Orchestra-Run: run_a6e2c31d5fdf850594f71438e53148da

Route non-MXFP8 grouped-linear bias backward through group_quantize plus grouped dbias while keeping MXFP8 bgrad_group_quantize fusion intact. Add focused zero-row grouped FP8 coverage and a current-scaling GroupedLinear bias-backward regression. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_ab566800d87047635cd27f9e64661abe Orchestra-Run: run_5f9bfef17ccd854232c54d56268ef9e8

Use packed FP8 conversion and reduce columnwise transpose staging register and synchronization overhead in group_cast_fp8_kernel. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_7a830e018ceac8de0018280bd0740a54 Orchestra-Run: run_d2f1df4ffc2265d9cfa5ed01028ee476

Match the grouped FP8 conversion helper's element-count template parameter to Vec's uint32_t parameter so rowwise, columnwise, and activation instantiations can build. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_30c4b6ddb896e5ea3ca5b54731d2c819 Orchestra-Run: run_e95cdbb445943304622b95736f0eca49

Use cached grouped offsets to avoid launching FP8 quantization over unused overallocated rows, permit larger grouped backing buffers when split metadata is present, and tighten full-tile vector paths in the grouped FP8 cast kernel. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_c5db93823dc101838cb1323e283cd6e9 Orchestra-Run: run_063e2e4c724e132612aa5597d6765c9b

Use the FP8 grouped output logical shape when computing the tensor-scaling launch grid so overallocated buffers with active metadata avoid empty tail-row launches while preserving the allocated-shape fallback. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_b4abb47c990404d73142342a19996a3f Orchestra-Run: run_8f09e7b9d7af9754ef505f2e2ce3cf90

Use larger grouped FP8 tiles with 8-warp CTAs and 16-row columnwise store fragments. Treat uniform overallocated FP8 grouped outputs as same-shape wrappers during output reuse so the timed path avoids varying-shape metadata overlaunch. Add overallocated current-scaling coverage for all grouped FP8 direction modes. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_3f98ac9c5b82192ec289d8d2a9816c7f Orchestra-Run: run_83f3b99cc950024cf06ee836337fbf72

Stage columnwise transpose fragments through shared-memory vectors with smaller columnwise row tiles to reduce register pressure and barrier overhead while preserving the larger rowwise-only store path. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_495cc57eef84749103aded403a508d99 Orchestra-Run: run_53e038e90f83186bc6c12cb722c986b5

Add fast grouped FP8 rowwise and full-tile columnwise paths for uniform active groups while preserving the general fallback for varying grouped metadata. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_4c33e88776c8a7148e9da5cc2bae84ea Orchestra-Run: run_2caaff219394eb5d59b7be38ab2bf346

Add a same-shape bidirectional full-tile kernel with wider input vectors and rowwise stores while preserving the existing rowwise-only, columnwise-only, and fallback grouped paths. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_87cec01d94f053b53e3c79377ad379ab Orchestra-Run: run_ed48db00a730a4bf56530d551ecd350e

Route same-shape rowwise+columnwise grouped FP8 tensor-scaling quantization through the compact full-tile transpose schedule instead of the wide dynamic-shared-memory variant, preserving the existing single-direction and fallback paths. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_fdddd228a620039c024b4ecf43f3ab42 Orchestra-Run: run_30a2753eea9c893cb0fadb8233da8ce6

Hint the rowwise stores in the full-tile rowwise+columnwise grouped FP8 path as streaming global stores to reduce cache/writeback pressure without changing single-direction launch geometry. Orchestra-Work-Order: wo_aea2e337b06582111bba66a6d6158a9e Orchestra-Task: task_bf82020032e68276f4e47c65f62d97ae Orchestra-Run: run_754ea4c864f329c6f2003b413b723c43

Add graph-safe grouped FP8 tensor-scaling metadata, support varying last dimensions, preserve same-shape fast paths, adjust grouped FP8 columnwise allocation by architecture, and expand benchmark/test coverage for the reviewed shape cases. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_d104e74844fbc3d3b1a98a8d96d76037 Orchestra-Run: run_1314e997c61ffb92ff7120b0b26f0318

Map varying-last columnwise tiles per group to avoid tile-alignment device errors, expand nonaligned boundary coverage, and restore same-shape benchmark baseline criteria. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_14e0e7973300d26f69550bc0aee21acc Orchestra-Run: run_2f42b8ba138ed8b2b4d9dc90b92caf85

Add grouped FP8 benchmark support for baseline-ref same-session reports and update the benchmark request to enforce same-shape baseline regression checks alongside the per-mode throughput thresholds. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_d0cada957a4aafdce9d52be86520e182 Orchestra-Run: run_4da74e9bdb4f4a4c72304a385692b6c9

Update the grouped FP8 benchmark driver so same-session baseline checks out and builds the baseline ref into an isolated PyTorch install, verifies the baseline subprocess loads those shared objects, and preserves the required same-shape baseline comparisons. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_4fd88b172872f547f2f2d0053dce73d1 Orchestra-Run: run_6a44ee0467ffff47d4b278de6127354d

Preserve grouped delayed-FP8 amax metadata and keep unsupported FP8 tensor-scaling quantizers out of the grouped GEMM path. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_2aa8e6bf11ae356f4b34d4540b508031 Orchestra-Run: run_302681098d7f4e05b0ad96450f2d9826

Set NVTE_GROUPED_LINEAR_SINGLE_PARAM inside the targeted state-dict tests so they exercise the gated single grouped parameter path without relying on external environment setup. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_261900f987bdc9397965019983a77c41 Orchestra-Run: run_c6624e34717cbe121b3e0edcf490e3d3

Add a segmented flat rowwise kernel for varying-first grouped FP8 tensor-scaling outputs while preserving the existing same-shape fast path. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_c1b7020b27290318848ef6ac9048dd5f Orchestra-Run: run_5c257b8a5d2e7e4aa95e67aa16436166

Omit the last_dims keyword when absent so the same-session baseline can run against the base extension, and refresh the benchmark request to include direct varying-last current-scaling coverage. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_c20a3c94fdc798e741a469bd7bb9c4df Orchestra-Run: run_457448e6cba80fc63ac72b3db71c5fd0

Dispatch varying-first tensor-scaling work per group to reduce inactive-tail CTAs and offset lookup overhead while preserving same-shape fast paths and graph-safe device metadata handling. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_d84e1fefef8641e558df064452f4689b Orchestra-Run: run_a361ca2f93fcec53ddd60dd99f4639e5

Add a no-tail rowwise flat kernel for aligned varying-first grouped FP8 tensor-scaling quantization and keep same-shape and varying-last dispatch isolated. Tighten benchmark profiler timing so post-warmup measured ranges exclude profiler start overhead. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_2e478be1fb38195f36d25c51320dc01f Orchestra-Run: run_9a133a75fa3d98dc3b1a63b0ff4d84af

Write grouped FP8 benchmark reports to a sidecar path by default and label script reports as benchmark_raw_report/v1 so regular 100-iteration measurements are fetched instead of the wrapper command report. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_27770b2e1d490b1a3053244d4b4ce248 Orchestra-Run: run_214052d0c1316e231443d645183a2675

Write the grouped FP8 benchmark JSON once and mirror the completed sidecar to ORCHESTRA_BENCHMARK_RAW_REPORT when running under Orchestra so the benchmark fetch path can parse the emitted measurements. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_b2e2747371204088c8e3f7cf10263164 Orchestra-Run: run_1d4ea38266807c8acb59143ee74ba241

Allow the grouped FP8 benchmark to use ORCHESTRA_BENCHMARK_RAW_REPORT as its primary output so the benchmark wrapper can fetch canonical measurements directly. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_10fdcfef6b70de4676b7843e4bbfac31 Orchestra-Run: run_4ce57df9e86d6d03a26f7aa95ac252cc

Write canonical grouped FP8 benchmark measurements to ORCHESTRA_BENCHMARK_RAW_REPORT in a small schema-shaped payload so the benchmark wrapper can materialize per-mode threshold evidence. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_3e862eebd585c74f2a58497fedea3511 Orchestra-Run: run_3770ab3dbbf51329d0839b3d10a91b5c

Write candidate_results and nonempty measurements into the Orchestra raw report path, and fail fast if the benchmark cannot produce threshold-ready evidence. Orchestra-Work-Order: wo_9d18259ce6c2833da1178606e08d251a Orchestra-Task: task_aa587a7b0d35aa9c2b715ec1b7c8bec3 Orchestra-Run: run_b42870e5d5e142a6cbf53bb5a3cafc2e

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…TransformerEngine into current_scaling_group_quant

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-06-13T01:04:08Z

/te-ci pytorch

greptile-apps · 2026-06-13T01:36:27Z

Greptile Summary

This PR adds grouped FP8 current-scaling quantization (amax-on-the-fly, per-group scale/scale_inv) and extends the existing grouped-quantize infrastructure to support varying last dim and varying both dims, in addition to the previously supported varying first dim.

New CUDA kernels: flat vectorized grouped-amax kernel (group_amax_fp8.cuh) with 4-way unrolled load + atomicMaxFloat, per-group scale/scale_inv update kernel, and a new grouped FP8 cast kernel (group_quantize_fp8.cuh) with a Blackwell (sm100+) fast path using mul_cvt_4x.
Infrastructure: nvte_splits_to_offsets_2d for bidimensional offset computation; all create_grouped_tensor overloads extended with last_dims; Python make_grouped_tensor updated to encode varying-both as (1, total_elements) logical shape; CUDA-graph-capture guard unified across all varying-dim paths.
C++ / Python integration: group_quantize and nvfp4_group_quantize_with_amax accept the new last_dims / noop_flag kwargs; grouped_mlp.py updated to pass amax tensors as keyword args following the signature change.

Confidence Score: 4/5

Safe to merge for the FP8 current-scaling and MXFP8 varying-dim paths; the NVFP4 code path needs a missing guard before last_dims can be safely exposed to callers.

nvfp4_group_quantize_with_amax now accepts a last_dims argument, but the NVFP4 kernel was not updated to handle varying last dim. Unlike bgrad_group_quantize (which immediately rejects a non-None last_dims), there is no equivalent guard here. A caller passing a non-None last_dims would silently propagate the metadata into the NVFP4 dispatch path, which only knows about varying-first-dim layouts and would produce incorrect output.

transformer_engine/pytorch/csrc/extensions/cast.cpp — specifically the nvfp4_group_quantize_with_amax function body, which is missing a NVTE_CHECK(!last_dims.has_value(), …) guard.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/csrc/extensions/cast.cpp	Adds grouped FP8 current-scaling quantize path: amax compute → optional NCCL reduce → scale/scale_inv kernel, plus `last_dims` and `noop_flag` threading through `group_quantize` / `nvfp4_group_quantize_with_amax` / `bgrad_group_quantize`. `bgrad_group_quantize` correctly guards against `last_dims.has_value()`, but `nvfp4_group_quantize_with_amax` lacks an equivalent guard.
transformer_engine/pytorch/csrc/extensions/pybind.cpp	Inserts `last_dims=py::none()` before required `rowwise_amax`/`columnwise_amax` in `nvfp4_group_quantize_with_amax` binding; the caller in `grouped_mlp.py` was updated to use keyword arguments, resolving the positional-arg shift flagged in a previous review thread.
transformer_engine/pytorch/csrc/quantizer.cpp	Adds `last_dims` parameter to all `create_grouped_tensor` overloads; introduces `build_grouped_tensor_offsets` 2D path (first+last, first-only, last-only); correctly adds `is_varying_both` scale allocation for MXFP8; FP8CurrentScaling guard rejects varying-both-dims; no issues found.
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	Removes hard-coded `last_dims=None` assumption; adds VARYING_LAST_DIM and VARYING_BOTH_DIMS offset/shape computation; unified CUDA-graph-capture guard; allocates `scale` and `amax` buffers for current-scaling Python path. Logic for varying-both encoding as `(1, total_elements)` is consistent with C++ path.
transformer_engine/common/cast/fp8/group_amax_fp8.cuh	New file: flat vectorized grouped amax kernel with 4-way unrolled inner loop, shared-memory offsets, atomicMaxFloat reduction; plus per-group scale/scale_inv kernel. Both kernels honor the noop flag for CUDA-graph-safe weight skipping.
transformer_engine/common/cast/fp8/group_quantize_fp8.cuh	New file: grouped FP8 current-scaling quantize kernels for SAME/VARYING_FIRST/VARYING_LAST shape representations; fast path for Blackwell (sm100+) with 4x-vectorized CVT; 2D grid dispatch with per-block bounds checks for varying last dim.
transformer_engine/common/recipe/current_scaling.cu	Adds `group_compute_amax_impl` that dispatches the new flat grouped-amax kernel; `nvte_group_compute_amax_with_config` and `nvte_group_compute_scale_from_amax` public APIs implemented correctly.
transformer_engine/common/util/splits_to_offsets.cu	Adds `splits_to_offsets_2d_kernel` that computes `cumsum(first_dims[i] * last_dims[i])`; single-block with chunked inner loop for `num_tensors > kThreadsPerBlock`; `nvte_splits_to_offsets_2d` API wrapper with full input validation.
transformer_engine/common/cast/dispatch/quantize.cuh	Adds `NVTE_DELAYED_TENSOR_SCALING` case to `group_quantize_fwd_helper`, dispatching to the new `fp8::group_quantize`; only reached from the FP8 current-scaling path since delayed FP8 is rejected upstream.
transformer_engine/pytorch/ops/fused/grouped_mlp.py	Switches `rowwise_amax` and `columnwise_amax` to keyword arguments in the `nvfp4_group_quantize_with_amax` call, correctly addressing the positional-arg shift introduced by inserting `last_dims` before them.

Sequence Diagram

sequenceDiagram
    participant PY as Python (group_quantize)
    participant CPP as cast.cpp
    participant AMAX as nvte_group_compute_amax_with_config
    participant NCCL as NCCL allreduce (optional)
    participant SCALE as nvte_group_compute_scale_from_amax
    participant QUANT as nvte_group_quantize (FP8 cast kernel)

    PY->>CPP: group_quantize(tensor, Float8CurrentScalingQuantizer, first_dims, last_dims, noop_flag)
    CPP->>CPP: create_grouped_tensor (allocates data, amax, scale, scale_inv)
    CPP->>AMAX: compute per-group amax from input data
    AMAX-->>CPP: amax[0..N-1] written
    alt "with_amax_reduction=True"
        CPP->>NCCL: allreduce(amax, MAX)
    end
    CPP->>SCALE: "derive scale = fp8_max / amax, scale_inv = 1/scale"
    SCALE-->>CPP: scale[0..N-1], scale_inv[0..N-1] written
    CPP->>QUANT: cast input to FP8 using per-group scale
    QUANT-->>CPP: grouped FP8 output tensor
    CPP-->>PY: grouped output Python object

Comments Outside Diff (1)

transformer_engine/pytorch/csrc/extensions/cast.cpp, line 383-395 (link)

nvfp4_group_quantize_with_amax missing last_dims guard

bgrad_group_quantize immediately rejects a non-None last_dims with an explicit NVTE_CHECK. nvfp4_group_quantize_with_amax has no equivalent check. The NVFP4 group-quantize kernel was not updated in this PR to handle VARYING_LAST_DIM or VARYING_BOTH_DIMS layouts. If a caller passes a non-None last_dims, the metadata is forwarded into the NVFP4 dispatch path where it would produce incorrect output or an internal assertion failure. Adding the same guard as bgrad_group_quantize makes the limitation explicit.

_{Reviews (5): Last reviewed commit: "Merge branch 'current_scaling_group_quan..." | Re-trigger Greptile}

… specific Signed-off-by: Varun Thumbe <vthumbe@vthumbe-mlt.client.nvidia.com>

vthumbe1503 · 2026-06-14T19:42:29Z

/te-ci

vthumbe1503 · 2026-06-14T19:55:39Z

Pipeline: 54747206

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…scale from amax Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…TransformerEngine into current_scaling_group_quant

Removed duplicate brief comment about scaled prefix-sum offsets. Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Orchestra and others added 30 commits May 14, 2026 18:42

changes to improve amax kernel

79efcca

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

cleanups

0c3f0c7

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 and others added 2 commits June 12, 2026 08:01

address review comments

587ba3d

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6986fc8

for more information, see https://pre-commit.ci

Oleg-Goncharov reviewed Jun 12, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/fp8/group_amax_fp8.cuh Outdated

Oleg-Goncharov reviewed Jun 12, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

vthumbe1503 and others added 6 commits June 12, 2026 15:57

Merge branch 'main' into current_scaling_group_quant

5c7d1d6

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

address review comments, fix lint errors

d6856a4

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

624f94c

for more information, see https://pre-commit.ci

no need for type_trait

59e06f2

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'current_scaling_group_quant' of github.com:vthumbe1503/…

34b6201

…TransformerEngine into current_scaling_group_quant

varying last dims can also be overallocated

b42af46

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 marked this pull request as ready for review June 13, 2026 01:03

vthumbe1503 requested review from ksivaman and ptrendx as code owners June 13, 2026 01:03

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py

Varun Thumbe and others added 2 commits June 14, 2026 12:21

split commmon into layout and tma files for arch specific vs non arch…

d7d6628

… specific Signed-off-by: Varun Thumbe <vthumbe@vthumbe-mlt.client.nvidia.com>

Merge branch 'main' into current_scaling_group_quant

032dc15

greptile-apps Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/fp8/group_quantize_fp8.cuh

address reviem comment

7946edd

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

vthumbe1503 and others added 2 commits June 14, 2026 21:56

no need to depend on multi tensor impl.. nvte API to compute grouepd …

cce8d9f

…scale from amax Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

53470ea

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/pybind.cpp

vthumbe1503 added 2 commits June 14, 2026 22:13

fix test

c65faa6

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'current_scaling_group_quant' of github.com:vthumbe1503/…

1afb576

…TransformerEngine into current_scaling_group_quant

vthumbe1503 requested a review from timmoon10 as a code owner June 14, 2026 22:13

Remove duplicate comment in transformer_engine.h

c73a0cc

Removed duplicate brief comment about scaled prefix-sum offsets. Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current Scaling Group Quantization + Enabling Varying Last/Both Dims in Group Quantize#3114

Current Scaling Group Quantization + Enabling Varying Last/Both Dims in Group Quantize#3114
vthumbe1503 wants to merge 75 commits into
NVIDIA:mainfrom
vthumbe1503:current_scaling_group_quant

vthumbe1503 commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Jun 14, 2026

Uh oh!

vthumbe1503 commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vthumbe1503 commented Jun 10, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Jun 14, 2026

Uh oh!

vthumbe1503 commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading