Skip to content

compiler: Avoid int32 overflow in linearized host-device transfer size#2939

Open
gaoflow wants to merge 2 commits into
devitocodes:mainfrom
gaoflow:fix-2777-transfer-size-overflow
Open

compiler: Avoid int32 overflow in linearized host-device transfer size#2939
gaoflow wants to merge 2 commits into
devitocodes:mainfrom
gaoflow:fix-2777-transfer-size-overflow

Conversation

@gaoflow
Copy link
Copy Markdown
Contributor

@gaoflow gaoflow commented May 29, 2026

Description

Fixes #2777.

When a host↔device data transfer is linearized, its array section size is emitted as a product of the Function's per-dimension sizes, for example:

#pragma acc enter data copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])

The u_vec->size[i] fields are 32-bit C ints, so the product size[0]*size[1]*size[2]*size[3] is evaluated in 32-bit arithmetic. For a Function with more than ~2**31 elements (e.g. the reporter's 1295**3 ≈ 2.17e9 points, ~24.5 GB) the product overflows int before it is used as the transfer bound, producing a bogus size (the reporter saw 18446744065653020036) and a corrupt / failed device transfer — independent of index-mode=int64/linearize=True, because those control the kernel index type, not the type of the transfer-clause arithmetic.

As @mloubout noted on the issue, the fix is to perform the size multiplication in 64-bit. This casts each factor of a product section bound to a 64-bit integer:

#pragma acc enter data copyin(u[0:(long)(u_vec->size[0])*(long)(u_vec->size[1])*(long)(u_vec->size[2])*(long)(u_vec->size[3])])

Casting the whole product ((long)(a*b*c)) would be too late — the overflow would already have happened in 32-bit — so each factor is cast individually, which forces every multiplication to be 64-bit regardless of operand ordering.

The change lives in PragmaTransfer._generate, so it is scoped to host-device transfer clauses only. Non-product bounds (a single dimension size, an offset, a constant) cannot overflow and are left untouched, addressing the concern that there is "no reason to use long for all of those". Non-transfer expressions (e.g. free-space guards, TMA descriptors) are unaffected.

Reproduction

from devito import Eq, Grid, Operator, TimeFunction

grid = Grid(shape=(4, 5, 6))
u = TimeFunction(name='u', grid=grid)
op = Operator(Eq(u.forward, u + 1), platform='nvidiaX', language='openacc',
              opt=('advanced', {'linearize': True}))
print(op.body.maps[0].ccode.value)

Before:

acc enter data copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])

After:

acc enter data copyin(u[0:(long)(u_vec->size[0])*(long)(u_vec->size[1])*(long)(u_vec->size[2])*(long)(u_vec->size[3])])

Verification

  • The fix applies to both backends (openacc copyin/copyout/delete and openmp map(to:/release:)) and to 2D/3D Functions.
  • The non-linearized transfer path (separate per-dimension sections [0:s0][0:s1]...) is unchanged — there is no product there, hence no overflow.
  • Added TestPassesOptional::test_linearize_transfer_no_overflow asserting each size[i] factor of a linearized transfer is cast to long and that no bare 32-bit product remains.
  • Updated the existing test_gpu_openmp.py expectations (test_basic, test_multiple_eqns) whose OpenMP transfers use the flattened product form.
  • Host (CPU) operators, including linearize=True, build and run unchanged (no device transfers emitted). flake8 clean on the changed files.

Note: the GPU test modules are skipif(['nodevice']), so the codegen assertions run on the GPU CI runners. They were validated locally by forcing platform='nvidiaX'.

When a host-device data transfer is linearized, its array section size is
emitted as a product of the Function's per-dimension sizes, e.g.
`copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])`.
The `size[i]` fields are 32-bit C ints, so for a Function with more than
~2**31 elements the product overflows `int` before it is used as the
transfer bound, yielding a bogus size and a corrupt/failed device transfer.

Cast each factor of the product to a 64-bit integer so the multiplication is
carried out in 64-bit arithmetic. Casting the whole product would be too late
(the overflow would already have occurred), so each factor is cast
individually. Non-product bounds (a single size, an offset, a constant) cannot
overflow and are left untouched, as are non-transfer expressions.

Fixes devitocodes#2777
Comment thread devito/passes/iet/parpragma.py Outdated
return self._make_parallel(graph, sync_mapper=graph.sync_mapper)


def _avoid_overflow(expr):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an as_long helper in symbolics/manipulation.py that you should use instead of this function (which, btw, wouldn't even logically belong to this module -- besides its implementation being quite hacky)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done — switched to as_long and dropped _avoid_overflow.

One thing worth flagging: as_long as written was a no-op here. The transfer size factors (vec->size[i]) are IndexedPointers, but as_long substituted only what retrieve_symbols returns (plain Symbols), so the product came out un-cast and the overflow remained. I changed as_long to use retrieve_terminals instead, so Indexed/IndexedPointer leaves are cast too (single-Symbol callers are unaffected, and numeric literals aren't touched). I kept the call scoped to Mul products in PragmaTransfer._generate so non-linearized multi-dimensional sections aren't needlessly upcast. test_symbolics/test_gpu_common green locally; the OpenMP exact-string expectations were updated for the new factor order.

Address review: replace the ad-hoc _avoid_overflow helper with the existing
as_long. as_long only substituted plain Symbols (retrieve_symbols), so it was
a no-op on the IndexedPointer size factors (vec->size[i]) of a linearized
transfer bound; extend it to retrieve_terminals so Indexed/IndexedPointer
leaves are cast too. Keep the cast scoped to Mul products in PragmaTransfer so
non-linearized multi-dimensional sections are not needlessly upcast.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Overflow in _C_make_dataobj due to c_int type

3 participants