Skip to content

[fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2)#63255

Open
morrySnow wants to merge 2 commits into
apache:masterfrom
morrySnow:fix-25576
Open

[fix](fe) Reject lone UTF-16 surrogates in JSONB literals (RFC 8259 §8.2)#63255
morrySnow wants to merge 2 commits into
apache:masterfrom
morrySnow:fix-25576

Conversation

@morrySnow
Copy link
Copy Markdown
Contributor

@morrySnow morrySnow commented May 14, 2026

Summary

Problem fixed: JsonLiteral (Nereids/Jackson path) and analysis.JsonLiteral (legacy/Gson path) silently accepted lone UTF-16 surrogates (e.g. '"\uD800"'::JSONB) as valid JSONB literals. RFC 8259 §8.2 explicitly forbids unpaired surrogates in JSON strings because they cannot be represented as valid UTF-8.

How it was fixed: Added a recursive validateNoLoneSurrogate post-parse check in both JsonLiteral constructors. After Jackson/Gson parses the JSON tree, the method walks all string nodes and immediately throws AnalysisException on any lone high or low surrogate.

What problem does this PR solve?

Issue Number: close #DORIS-25576

Before this fix: Passing a lone surrogate like '"\uD800"'::JSONB was silently accepted at the FE layer. The invalid value would be stored in the BE JSONB column. The error would only surface later — during EXPORT, SELECT INTO OUTFILE, or cross-system transfer — making it hard to diagnose. This is a data-correctness (SEV-2) issue.

After this fix: Constructing a JsonLiteral with a lone surrogate immediately throws AnalysisException: Invalid jsonb literal: JSON string contains lone high surrogate (or lone low surrogate), giving the user a clear error at write time.

Behavior change

Scenario Before After
'"\uD800"'::JSONB Accepted silently AnalysisException at parse time
INSERT INTO t VALUES (1, '"\uD800"') Stored in BE, may fail on export AnalysisException at FE
'"\uD83D\uDE00"'::JSONB (valid pair 😀) Accepted Still accepted (no change)
'"hello"'::JSONB (plain ASCII) Accepted Still accepted (no change)

Why both paths?

Doris has two JsonLiteral implementations:

  • Nereids (fe-core): uses Jackson ObjectMapper.readTree — Jackson accepts lone surrogates by default
  • Legacy (fe-catalog, analysis): uses Gson JsonParser.parse — Gson also accepts lone surrogates by default

Both needed the same fix to ensure consistent behavior regardless of which query path is used.

Release note

JSONB literal expressions now reject strings containing lone UTF-16 surrogates (e.g. '"\uD800"'::JSONB) with an AnalysisException at parse time, conforming to RFC 8259 §8.2. Previously such literals were silently accepted, which could cause errors during export or cross-system data transfer.

Check List (For Author)

  • Test: Unit Test (JsonLiteralTest — 10 test cases covering lone-high, lone-low, nested in object/array, valid surrogate pairs, plain strings)
  • Behavior changed: Yes — lone surrogates in JSONB literals now throw AnalysisException instead of being silently accepted
  • Does this need documentation: No

…8.2)

### What problem does this PR solve?

Issue Number: close #DORIS-25576

Problem Summary: Nereids `JsonLiteral` and legacy `analysis.JsonLiteral` silently
accepted lone UTF-16 surrogates (e.g. `"\uD800"`) in JSONB literal values.
Jackson and Gson both parse such inputs without error by default, but RFC 8259 §8.2
forbids unpaired surrogates in JSON strings. Silent acceptance causes data-correctness
issues: the invalid value is stored in BE and only surfaces as an error during export,
cross-system transfer, or UTF-8 serialization.

Fix: add a recursive `validateNoLoneSurrogate` post-parse walk in both
`JsonLiteral` constructors that throws `AnalysisException` for any string node
containing a lone high or low surrogate.

### Release note

JSONB literal expressions now reject strings containing lone UTF-16 surrogates
(e.g. `'"\uD800"'::JSONB`) with an AnalysisException, conforming to RFC 8259 §8.2.

### Check List (For Author)

- Test: Unit Test (JsonLiteralTest — lone-surrogate rejection + valid surrogate-pair acceptance)
- Behavior changed: Yes — lone surrogates in JSONB literals now throw AnalysisException instead of being silently accepted
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morrySnow
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morrySnow
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR addresses the intended JSONB literal validation path and keeps the change focused, but it misses one JSON string location that can still carry invalid UTF-16.

Critical checkpoint conclusions:

  • Goal/test coverage: The goal is to reject lone UTF-16 surrogates in JSONB literals. Tests cover string values in arrays/objects and valid pairs, but object member names are not covered and still bypass validation.
  • Scope/clarity: The JSON literal changes are small and localized.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, configuration, persistence, or wire/storage format compatibility concerns found.
  • Parallel paths: Both Nereids and legacy literal constructors were updated, but the same field-name gap applies to both parsers.
  • Error handling: Parse errors and validation failures are surfaced as AnalysisException; the missing field-name validation is the blocking correctness issue.
  • Tests/results: Unit tests were added for Nereids value strings only; legacy constructor and object key cases are not covered. No test execution was performed in this review.
  • Observability/performance: No additional observability is needed; the recursive validation cost is acceptable for literal analysis.

User focus: No additional user-provided focus points were present.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29782 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b3b378885ce45857c8b015ef5aa45977485d899b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17649	3999	3957	3957
q2	q3	10733	877	617	617
q4	4659	467	362	362
q5	7443	1324	1163	1163
q6	190	173	146	146
q7	932	964	744	744
q8	9302	1441	1275	1275
q9	5607	5436	5384	5384
q10	6249	2104	1815	1815
q11	482	270	260	260
q12	629	432	304	304
q13	18044	3415	2781	2781
q14	293	285	268	268
q15	q16	909	884	802	802
q17	998	1042	759	759
q18	6486	5657	5632	5632
q19	1315	1280	1030	1030
q20	519	403	258	258
q21	4907	2326	1917	1917
q22	434	356	308	308
Total cold run time: 97780 ms
Total hot run time: 29782 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4282	4222	4168	4168
q2	q3	4646	4778	4177	4177
q4	2149	2199	1392	1392
q5	4957	5002	5277	5002
q6	192	171	137	137
q7	2032	1976	1729	1729
q8	3641	3222	3324	3222
q9	8573	8512	8512	8512
q10	4520	4563	4286	4286
q11	632	457	417	417
q12	715	753	533	533
q13	3500	3585	2916	2916
q14	307	313	282	282
q15	q16	774	814	703	703
q17	1377	1395	1335	1335
q18	8101	7203	7215	7203
q19	1174	1168	1181	1168
q20	2303	2259	1970	1970
q21	6269	5628	5095	5095
q22	556	509	424	424
Total cold run time: 60700 ms
Total hot run time: 54671 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170130 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b3b378885ce45857c8b015ef5aa45977485d899b, data reload: false

query5	4300	654	518	518
query6	344	217	198	198
query7	4298	590	287	287
query8	319	230	216	216
query9	8830	4032	4022	4022
query10	489	344	307	307
query11	6040	2476	2271	2271
query12	201	129	128	128
query13	1314	601	433	433
query14	6805	5359	5040	5040
query14_1	4369	4382	4362	4362
query15	216	206	186	186
query16	1016	459	442	442
query17	1398	774	647	647
query18	2728	494	370	370
query19	342	209	167	167
query20	151	136	131	131
query21	223	143	125	125
query22	13642	13992	14489	13992
query23	17249	16555	16157	16157
query23_1	16369	16253	16349	16253
query24	7450	1765	1394	1394
query24_1	1355	1318	1349	1318
query25	560	487	427	427
query26	1308	312	163	163
query27	2651	563	341	341
query28	4259	1963	1925	1925
query29	976	640	502	502
query30	307	232	196	196
query31	1112	1051	929	929
query32	88	71	72	71
query33	552	345	278	278
query34	1178	1104	631	631
query35	752	773	660	660
query36	1277	1322	1191	1191
query37	144	104	87	87
query38	3164	3134	3065	3065
query39	926	934	900	900
query39_1	895	884	866	866
query40	237	153	135	135
query41	64	62	60	60
query42	110	105	105	105
query43	325	326	291	291
query44	
query45	209	202	187	187
query46	1053	1202	745	745
query47	2347	2270	2246	2246
query48	404	425	290	290
query49	639	528	434	434
query50	708	291	219	219
query51	4378	4308	4242	4242
query52	105	104	92	92
query53	248	270	206	206
query54	313	266	247	247
query55	93	86	91	86
query56	310	301	301	301
query57	1466	1383	1289	1289
query58	293	288	262	262
query59	1539	1643	1374	1374
query60	335	333	325	325
query61	158	151	154	151
query62	668	618	549	549
query63	243	204	218	204
query64	2317	808	686	686
query65	
query66	1709	516	383	383
query67	30076	30075	29219	29219
query68	
query69	462	348	307	307
query70	1002	1038	961	961
query71	305	279	265	265
query72	2934	2689	2466	2466
query73	866	731	440	440
query74	5061	4862	4750	4750
query75	2780	2688	2339	2339
query76	2288	1121	748	748
query77	416	426	344	344
query78	13004	12966	12331	12331
query79	1588	952	739	739
query80	1331	601	494	494
query81	511	285	239	239
query82	1330	166	124	124
query83	362	286	248	248
query84	256	136	114	114
query85	901	516	448	448
query86	434	347	303	303
query87	3426	3369	3229	3229
query88	3527	2685	2658	2658
query89	437	385	333	333
query90	1786	184	181	181
query91	180	167	144	144
query92	78	75	71	71
query93	957	950	555	555
query94	690	349	292	292
query95	679	472	336	336
query96	1073	794	340	340
query97	2720	2715	2603	2603
query98	243	234	226	226
query99	1083	1110	980	980
Total cold run time: 255618 ms
Total hot run time: 170130 ms

Extend the RFC 8259 §8.2 surrogate validation to cover object field
names in addition to string values. Previously only string values were
walked; a JSON literal like {"\uD800": "value"} could bypass the check.

Both the Nereids (Jackson) and legacy analysis (Gson) paths now call a
shared validateNoLoneSurrogateInString helper for every field name and
every string value in the JSON tree.

Also add unit tests for lone surrogates in object keys.

Fixes review feedback on apache#63255.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morrySnow
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Fixed in the latest commit:

  • Refactored validateNoLoneSurrogate to handle object nodes explicitly via node.fields() (Jackson) / entrySet() (Gson), so both field names and values are checked.
  • Extracted a shared validateNoLoneSurrogateInString helper to avoid duplication.
  • Added unit tests for lone surrogates in object keys (testLoneHighSurrogateInObjectKey, testLoneLowSurrogateInObjectKey, testValidSurrogatePairInObjectKey).

@morrySnow
Copy link
Copy Markdown
Contributor Author

run buildall

@morrySnow
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no additional blocking issues found in the current PR head.

Critical checkpoint conclusions:

  • Goal and proof: The PR aims to reject JSONB literals containing lone UTF-16 surrogates in both Nereids/Jackson and legacy Gson literal paths. The current implementation recursively validates string values and object field names, addressing the previously raised object-key gap. A Nereids unit test covers top-level strings, arrays, object values, object keys, low/high lone surrogates, and valid surrogate pairs.
  • Scope: The change is small and focused on post-parse validation in the two JsonLiteral constructors plus targeted tests.
  • Concurrency/lifecycle: No new shared mutable state, locks, threads, or non-trivial lifecycle behavior are introduced. Existing static ObjectMapper usage is unchanged.
  • Configuration/compatibility: No new configuration, storage format, wire protocol, or persisted metadata compatibility concern is introduced. Behavior intentionally changes invalid literal acceptance to analysis failure.
  • Parallel paths: Both relevant FE JSON literal paths are updated. The prior inline review concern about object field names is already fixed in the current head and was not duplicated.
  • Error handling: Parser syntax errors and surrogate validation errors are surfaced as AnalysisException consistently with surrounding code.
  • Tests: Added Nereids unit coverage for the changed behavior, including object keys. I attempted to run ./run-fe-ut.sh org.apache.doris.nereids.trees.expressions.literal.JsonLiteralTest, but the runner is missing thirdparty/installed/bin/protoc, so the test could not start.
  • Observability/performance: No new observability appears necessary for constructor-time literal validation. The recursive tree walk is proportional to literal size and not a hot-path concern beyond parsing itself.
  • Transaction/data correctness: The change improves data correctness by rejecting invalid JSON strings before they can be stored; no transaction or visibility-version path is modified.

User focus: no additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31388 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit db190b7a97eec80b806e8ff3b5d47c68d7b56781, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17703	3941	3878	3878
q2	q3	10832	1376	799	799
q4	4685	473	343	343
q5	7570	2311	2136	2136
q6	242	185	142	142
q7	933	759	628	628
q8	9356	1644	1661	1644
q9	5120	4908	4905	4905
q10	6375	2081	1760	1760
q11	445	274	254	254
q12	630	422	302	302
q13	18122	3330	2828	2828
q14	263	252	233	233
q15	q16	822	766	703	703
q17	970	918	981	918
q18	6815	5671	5456	5456
q19	1316	1258	1075	1075
q20	509	397	265	265
q21	6684	2922	2802	2802
q22	454	379	317	317
Total cold run time: 99846 ms
Total hot run time: 31388 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4808	4496	4617	4496
q2	q3	4915	5293	4654	4654
q4	2151	2219	1426	1426
q5	4865	4635	4654	4635
q6	239	190	133	133
q7	1923	1808	1582	1582
q8	2374	2078	2071	2071
q9	7715	7212	7174	7174
q10	4446	4400	3925	3925
q11	536	378	345	345
q12	710	729	520	520
q13	3047	3384	2833	2833
q14	267	280	248	248
q15	q16	679	710	628	628
q17	1264	1261	1239	1239
q18	7116	6899	6636	6636
q19	1103	1087	1082	1082
q20	2211	2208	1960	1960
q21	5471	4743	4617	4617
q22	522	459	404	404
Total cold run time: 56362 ms
Total hot run time: 50608 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169617 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit db190b7a97eec80b806e8ff3b5d47c68d7b56781, data reload: false

query5	4311	642	521	521
query6	319	224	203	203
query7	4248	592	294	294
query8	338	241	226	226
query9	8842	4037	3982	3982
query10	450	340	300	300
query11	5810	2351	2231	2231
query12	178	128	124	124
query13	1281	616	438	438
query14	5989	5358	5074	5074
query14_1	4342	4360	4350	4350
query15	206	219	186	186
query16	1065	436	424	424
query17	1143	710	584	584
query18	2709	484	344	344
query19	211	195	158	158
query20	140	131	130	130
query21	219	141	115	115
query22	13659	13497	13378	13378
query23	17137	16368	16003	16003
query23_1	16167	16133	16139	16133
query24	7542	1765	1281	1281
query24_1	1295	1301	1286	1286
query25	572	477	403	403
query26	1310	328	169	169
query27	2692	556	332	332
query28	4402	1953	1953	1953
query29	1028	647	490	490
query30	301	238	198	198
query31	1110	1062	952	952
query32	91	74	74	74
query33	530	346	292	292
query34	1154	1141	654	654
query35	771	791	674	674
query36	1311	1346	1129	1129
query37	152	106	94	94
query38	3223	3144	3030	3030
query39	940	917	888	888
query39_1	890	873	874	873
query40	236	146	128	128
query41	71	64	64	64
query42	111	112	116	112
query43	321	322	289	289
query44	
query45	213	200	201	200
query46	1094	1232	727	727
query47	2323	2367	2200	2200
query48	395	416	287	287
query49	635	499	385	385
query50	998	365	252	252
query51	4273	4335	4229	4229
query52	111	107	97	97
query53	268	287	206	206
query54	312	290	253	253
query55	94	92	85	85
query56	292	303	295	295
query57	1397	1391	1276	1276
query58	302	281	272	272
query59	1588	1701	1379	1379
query60	322	325	316	316
query61	155	160	162	160
query62	681	650	558	558
query63	241	194	208	194
query64	2410	822	649	649
query65	
query66	1675	478	355	355
query67	30041	30028	29745	29745
query68	
query69	463	336	306	306
query70	1015	1007	1028	1007
query71	307	280	266	266
query72	3028	2719	2516	2516
query73	866	783	452	452
query74	5068	4908	4726	4726
query75	2703	2618	2281	2281
query76	2303	1145	755	755
query77	418	430	353	353
query78	12100	11965	11655	11655
query79	1227	1010	695	695
query80	638	582	502	502
query81	453	293	254	254
query82	248	165	125	125
query83	283	300	270	270
query84	269	151	113	113
query85	955	619	528	528
query86	355	337	326	326
query87	3402	3388	3212	3212
query88	3486	2696	2620	2620
query89	430	381	346	346
query90	2189	182	181	181
query91	179	167	139	139
query92	78	75	75	75
query93	1423	1463	923	923
query94	534	344	263	263
query95	663	464	344	344
query96	1051	758	335	335
query97	2701	2733	2588	2588
query98	244	226	234	226
query99	1090	1100	972	972
Total cold run time: 251731 ms
Total hot run time: 169617 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/22) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants