GH-36388: [C++][Python] Return error from MakeArrayFromScalar on offset overflow#50024
Open
Sriniketh24 wants to merge 1 commit into
Open
GH-36388: [C++][Python] Return error from MakeArrayFromScalar on offset overflow#50024Sriniketh24 wants to merge 1 commit into
Sriniketh24 wants to merge 1 commit into
Conversation
…n offset overflow MakeArrayFromScalar silently created an invalid array with negative offsets when the total data size (value_size * repetition_count) exceeded the maximum value of the offset type. For 32-bit offset types like StringType and BinaryType, this threshold is INT32_MAX (~2 GB). The root cause was in CreateOffsetsBuffer where the running offset accumulated via OffsetType addition without checking for overflow, wrapping around to negative values. Added an early overflow check in CreateOffsetsBuffer that computes the total size in int64_t and compares against the offset type's maximum. On overflow, a Status::Invalid error is returned with a message suggesting the use of large_* types. This is AI-assisted work by Claude.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
pyarrow.repeat(backed byMakeArrayFromScalarin C++) silently created an invalid array with negative offsets when the total data size (value_size * repetition_count) exceededINT32_MAXfor 32-bit offset types (StringType,BinaryType). The resulting array passed creation without error but failed validation with a cryptic "Negative offsets in binary array" or "non-monotonic offset" message.What changed
Added an early overflow check in
RepeatedArrayFactory::CreateOffsetsBufferthat computes the total data size inint64_tand returnsStatus::Invalidwith an actionable error message when it would exceed the offset type's maximum. The error message suggests usinglarge_*types (e.g.large_string,large_binary) for data exceeding 2 GB.Are these changes tested?
Yes.
TestMakeArrayFromScalarOffsetOverflowinarray_test.cc— tests string, binary, and large_string scalarstest_repeat_offset_overflowintest_array.py— verifiespa.repeatraisesArrowInvalidon overflowAre there any user-facing changes?
Yes.
MakeArrayFromScalar(andpyarrow.repeat) now raisesArrowInvalidearly with a clear error message instead of silently returning a corrupt array. This is a strictly better user experience.Closes: #36388
This is AI-assisted work by Claude.