[fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2)#63255
[fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2)#63255morrySnow wants to merge 2 commits into
Conversation
…8.2) ### What problem does this PR solve? Issue Number: close #DORIS-25576 Problem Summary: Nereids `JsonLiteral` and legacy `analysis.JsonLiteral` silently accepted lone UTF-16 surrogates (e.g. `"\uD800"`) in JSONB literal values. Jackson and Gson both parse such inputs without error by default, but RFC 8259 §8.2 forbids unpaired surrogates in JSON strings. Silent acceptance causes data-correctness issues: the invalid value is stored in BE and only surfaces as an error during export, cross-system transfer, or UTF-8 serialization. Fix: add a recursive `validateNoLoneSurrogate` post-parse walk in both `JsonLiteral` constructors that throws `AnalysisException` for any string node containing a lone high or low surrogate. ### Release note JSONB literal expressions now reject strings containing lone UTF-16 surrogates (e.g. `'"\uD800"'::JSONB`) with an AnalysisException, conforming to RFC 8259 §8.2. ### Check List (For Author) - Test: Unit Test (JsonLiteralTest — lone-surrogate rejection + valid surrogate-pair acceptance) - Behavior changed: Yes — lone surrogates in JSONB literals now throw AnalysisException instead of being silently accepted - Does this need documentation: No Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
The PR addresses the intended JSONB literal validation path and keeps the change focused, but it misses one JSON string location that can still carry invalid UTF-16.
Critical checkpoint conclusions:
- Goal/test coverage: The goal is to reject lone UTF-16 surrogates in JSONB literals. Tests cover string values in arrays/objects and valid pairs, but object member names are not covered and still bypass validation.
- Scope/clarity: The JSON literal changes are small and localized.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, persistence, or wire/storage format compatibility concerns found.
- Parallel paths: Both Nereids and legacy literal constructors were updated, but the same field-name gap applies to both parsers.
- Error handling: Parse errors and validation failures are surfaced as AnalysisException; the missing field-name validation is the blocking correctness issue.
- Tests/results: Unit tests were added for Nereids value strings only; legacy constructor and object key cases are not covered. No test execution was performed in this review.
- Observability/performance: No additional observability is needed; the recursive validation cost is acceptable for literal analysis.
User focus: No additional user-provided focus points were present.
TPC-H: Total hot run time: 29782 ms |
TPC-DS: Total hot run time: 170130 ms |
Extend the RFC 8259 §8.2 surrogate validation to cover object field
names in addition to string values. Previously only string values were
walked; a JSON literal like {"\uD800": "value"} could bypass the check.
Both the Nereids (Jackson) and legacy analysis (Gson) paths now call a
shared validateNoLoneSurrogateInString helper for every field name and
every string value in the JSON tree.
Also add unit tests for lone surrogates in object keys.
Fixes review feedback on apache#63255.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the review! Fixed in the latest commit:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review result: no additional blocking issues found in the current PR head.
Critical checkpoint conclusions:
- Goal and proof: The PR aims to reject JSONB literals containing lone UTF-16 surrogates in both Nereids/Jackson and legacy Gson literal paths. The current implementation recursively validates string values and object field names, addressing the previously raised object-key gap. A Nereids unit test covers top-level strings, arrays, object values, object keys, low/high lone surrogates, and valid surrogate pairs.
- Scope: The change is small and focused on post-parse validation in the two JsonLiteral constructors plus targeted tests.
- Concurrency/lifecycle: No new shared mutable state, locks, threads, or non-trivial lifecycle behavior are introduced. Existing static ObjectMapper usage is unchanged.
- Configuration/compatibility: No new configuration, storage format, wire protocol, or persisted metadata compatibility concern is introduced. Behavior intentionally changes invalid literal acceptance to analysis failure.
- Parallel paths: Both relevant FE JSON literal paths are updated. The prior inline review concern about object field names is already fixed in the current head and was not duplicated.
- Error handling: Parser syntax errors and surrogate validation errors are surfaced as AnalysisException consistently with surrounding code.
- Tests: Added Nereids unit coverage for the changed behavior, including object keys. I attempted to run
./run-fe-ut.sh org.apache.doris.nereids.trees.expressions.literal.JsonLiteralTest, but the runner is missingthirdparty/installed/bin/protoc, so the test could not start. - Observability/performance: No new observability appears necessary for constructor-time literal validation. The recursive tree walk is proportional to literal size and not a hot-path concern beyond parsing itself.
- Transaction/data correctness: The change improves data correctness by rejecting invalid JSON strings before they can be stored; no transaction or visibility-version path is modified.
User focus: no additional user-provided review focus was specified.
TPC-H: Total hot run time: 31388 ms |
TPC-DS: Total hot run time: 169617 ms |
FE Regression Coverage ReportIncrement line coverage |
Summary
Problem fixed:
JsonLiteral(Nereids/Jackson path) andanalysis.JsonLiteral(legacy/Gson path) silently accepted lone UTF-16 surrogates (e.g.'"\uD800"'::JSONB) as valid JSONB literals. RFC 8259 §8.2 explicitly forbids unpaired surrogates in JSON strings because they cannot be represented as valid UTF-8.How it was fixed: Added a recursive
validateNoLoneSurrogatepost-parse check in bothJsonLiteralconstructors. After Jackson/Gson parses the JSON tree, the method walks all string nodes and immediately throwsAnalysisExceptionon any lone high or low surrogate.What problem does this PR solve?
Issue Number: close #DORIS-25576
Before this fix: Passing a lone surrogate like
'"\uD800"'::JSONBwas silently accepted at the FE layer. The invalid value would be stored in the BE JSONB column. The error would only surface later — duringEXPORT,SELECT INTO OUTFILE, or cross-system transfer — making it hard to diagnose. This is a data-correctness (SEV-2) issue.After this fix: Constructing a
JsonLiteralwith a lone surrogate immediately throwsAnalysisException: Invalid jsonb literal: JSON string contains lone high surrogate(orlone low surrogate), giving the user a clear error at write time.Behavior change
'"\uD800"'::JSONBINSERT INTO t VALUES (1, '"\uD800"')'"\uD83D\uDE00"'::JSONB(valid pair 😀)'"hello"'::JSONB(plain ASCII)Why both paths?
Doris has two
JsonLiteralimplementations:fe-core): uses JacksonObjectMapper.readTree— Jackson accepts lone surrogates by defaultfe-catalog,analysis): uses GsonJsonParser.parse— Gson also accepts lone surrogates by defaultBoth needed the same fix to ensure consistent behavior regardless of which query path is used.
Release note
JSONB literal expressions now reject strings containing lone UTF-16 surrogates (e.g.
'"\uD800"'::JSONB) with anAnalysisExceptionat parse time, conforming to RFC 8259 §8.2. Previously such literals were silently accepted, which could cause errors during export or cross-system data transfer.Check List (For Author)
JsonLiteralTest— 10 test cases covering lone-high, lone-low, nested in object/array, valid surrogate pairs, plain strings)AnalysisExceptioninstead of being silently accepted