fix: handle ujson surrogate parse exceptions in JSON imports#1046
Open
He-Pin wants to merge 4 commits into
Open
fix: handle ujson surrogate parse exceptions in JSON imports#1046He-Pin wants to merge 4 commits into
He-Pin wants to merge 4 commits into
Conversation
22fe76f to
7ca9536
Compare
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Jun 25, 2026
Motivation: JSON imports and std.parseJson relied on ujson behavior for Unicode surrogate escapes. ujson can throw plain Exception for some malformed escapes and silently drop a lone high surrogate, so catching an exception message only fixed one symptom. Modification: Add a JsonUnicodeValidator that scans JSON string contents for invalid \uXXXX surrogate sequences before invoking ujson. Reuse it for .json import fast path and std.parseJson while preserving import fallback to normal Jsonnet parsing. Add regression coverage for lone high/low surrogates, reversed pairs, valid pairs, and escaped backslash literals. Result: Malformed surrogate escapes no longer surface as Internal Error or corrupt imported strings. Valid pairs and literal \u text continue to parse correctly across JVM and JS targeted tests. References: databricks#1046
7ca9536 to
85788f3
Compare
Motivation: Some malformed Unicode surrogate escapes in imported .json files make ujson throw a plain java.lang.Exception, which previously escaped as an Internal Error. Avoid adding a pre-parse scan on the normal import path. Modification: Extend the .json import fast-path fallback to recognize ujson's known surrogate parse exception messages and fall back to normal Jsonnet parsing. Keep validation delegated to ujson and do not change std.parseJson. Result: Malformed surrogate escapes that ujson reports no longer leak Internal Error stack traces. Valid surrogate pairs still parse on the JSON fast path. Cases ujson silently accepts or corrupts remain a parser limitation rather than being fixed with an extra scan. References: databricks#1046
85788f3 to
15a8173
Compare
He-Pin
commented
Jun 25, 2026
| message != null && ( | ||
| message.startsWith("Unexpected character following high surrogate") || | ||
| message.startsWith("Duplicate high surrogate") || | ||
| message.startsWith("Un-paired low surrogate") |
Contributor
Author
There was a problem hiding this comment.
this depends on ujson's exception message not that bad I think.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Importing some malformed Unicode surrogate escapes from
.jsonfiles can make ujson throw a plainjava.lang.Exception, which previously escaped as anInternal Errorinstead of falling back to normal Jsonnet parsing.This keeps the fast path performance profile: no pre-parse validation scan is added, and normal
.jsonimports still go through a singleujson.ByteArrayParserpass.Modification
Exceptionsurrogate parse failures in the.jsonimport fast path and returnNone, preserving the existing fallback to normal Jsonnet parsing.java.lang.Exceptionclass plus known surrogate error message prefixes.std.parseJsonunchanged.Tradeoff
This intentionally does not fix cases that ujson silently accepts or corrupts, such as some lone high-surrogate escapes. Fixing those without changing ujson would require an external pre-scan or replacing/fixing the parser, which is outside this performance-focused PR.
Verification
./mill __.checkFormat./mill 'sjsonnet.jvm[3.3.8].test.testOnly' sjsonnet.JsonImportFastPathTests sjsonnet.StdParseTests./mill 'sjsonnet.jvm[3.3.8].test'./mill 'sjsonnet.jvm[2.12.21].compile' 'sjsonnet.jvm[2.13.18].compile'./mill 'sjsonnet.js[3.3.8].test.testOnly' sjsonnet.JsonImportFastPathTests/tmp/project-qoder-review.log)