Skip to content

GH-36388: [C++][Python] Return error from MakeArrayFromScalar on offset overflow#50024

Open
Sriniketh24 wants to merge 1 commit into
apache:mainfrom
Sriniketh24:fix/repeat-offset-overflow
Open

GH-36388: [C++][Python] Return error from MakeArrayFromScalar on offset overflow#50024
Sriniketh24 wants to merge 1 commit into
apache:mainfrom
Sriniketh24:fix/repeat-offset-overflow

Conversation

@Sriniketh24
Copy link
Copy Markdown

@Sriniketh24 Sriniketh24 commented May 23, 2026

Rationale

pyarrow.repeat (backed by MakeArrayFromScalar in C++) silently created an invalid array with negative offsets when the total data size (value_size * repetition_count) exceeded INT32_MAX for 32-bit offset types (StringType, BinaryType). The resulting array passed creation without error but failed validation with a cryptic "Negative offsets in binary array" or "non-monotonic offset" message.

What changed

Added an early overflow check in RepeatedArrayFactory::CreateOffsetsBuffer that computes the total data size in int64_t and returns Status::Invalid with an actionable error message when it would exceed the offset type's maximum. The error message suggests using large_* types (e.g. large_string, large_binary) for data exceeding 2 GB.

Are these changes tested?

Yes.

  • C++ test: TestMakeArrayFromScalarOffsetOverflow in array_test.cc — tests string, binary, and large_string scalars
  • Python test: test_repeat_offset_overflow in test_array.py — verifies pa.repeat raises ArrowInvalid on overflow

Are there any user-facing changes?

Yes. MakeArrayFromScalar (and pyarrow.repeat) now raises ArrowInvalid early with a clear error message instead of silently returning a corrupt array. This is a strictly better user experience.

Closes: #36388


This is AI-assisted work by Claude.

…n offset overflow

MakeArrayFromScalar silently created an invalid array with negative
offsets when the total data size (value_size * repetition_count)
exceeded the maximum value of the offset type. For 32-bit offset types
like StringType and BinaryType, this threshold is INT32_MAX (~2 GB).

The root cause was in CreateOffsetsBuffer where the running offset
accumulated via OffsetType addition without checking for overflow,
wrapping around to negative values.

Added an early overflow check in CreateOffsetsBuffer that computes the
total size in int64_t and compares against the offset type's maximum.
On overflow, a Status::Invalid error is returned with a message
suggesting the use of large_* types.

This is AI-assisted work by Claude.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][C++] pyarrow.repeat returns an invalid array when a chunked array is required.

1 participant