Skip to content

fix: handle ujson surrogate parse exceptions in JSON imports#1046

Open
He-Pin wants to merge 4 commits into
databricks:masterfrom
He-Pin:fix/json-import-surrogate
Open

fix: handle ujson surrogate parse exceptions in JSON imports#1046
He-Pin wants to merge 4 commits into
databricks:masterfrom
He-Pin:fix/json-import-surrogate

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

Importing some malformed Unicode surrogate escapes from .json files can make ujson throw a plain java.lang.Exception, which previously escaped as an Internal Error instead of falling back to normal Jsonnet parsing.

This keeps the fast path performance profile: no pre-parse validation scan is added, and normal .json imports still go through a single ujson.ByteArrayParser pass.

Modification

  • Catch ujson's known bare-Exception surrogate parse failures in the .json import fast path and return None, preserving the existing fallback to normal Jsonnet parsing.
  • Keep the catch narrow by requiring the exact java.lang.Exception class plus known surrogate error message prefixes.
  • Add regression coverage for low-surrogate, reversed-surrogate, high-before-non-low-escape, high-before-high-escape failures, and valid surrogate pairs.
  • Leave std.parseJson unchanged.

Tradeoff

This intentionally does not fix cases that ujson silently accepts or corrupts, such as some lone high-surrogate escapes. Fixing those without changing ujson would require an external pre-scan or replacing/fixing the parser, which is outside this performance-focused PR.

Verification

  • ./mill __.checkFormat
  • ./mill 'sjsonnet.jvm[3.3.8].test.testOnly' sjsonnet.JsonImportFastPathTests sjsonnet.StdParseTests
  • ./mill 'sjsonnet.jvm[3.3.8].test'
  • ./mill 'sjsonnet.jvm[2.12.21].compile' 'sjsonnet.jvm[2.13.18].compile'
  • ./mill 'sjsonnet.js[3.3.8].test.testOnly' sjsonnet.JsonImportFastPathTests
  • Qoder stdout review: No must-fix findings (/tmp/project-qoder-review.log)

@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from 22fe76f to 7ca9536 Compare June 25, 2026 14:58
@He-Pin He-Pin marked this pull request as ready for review June 25, 2026 15:13
@He-Pin He-Pin marked this pull request as draft June 25, 2026 15:13
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Jun 25, 2026
Motivation:
JSON imports and std.parseJson relied on ujson behavior for Unicode surrogate escapes. ujson can throw plain Exception for some malformed escapes and silently drop a lone high surrogate, so catching an exception message only fixed one symptom.

Modification:
Add a JsonUnicodeValidator that scans JSON string contents for invalid \uXXXX surrogate sequences before invoking ujson. Reuse it for .json import fast path and std.parseJson while preserving import fallback to normal Jsonnet parsing. Add regression coverage for lone high/low surrogates, reversed pairs, valid pairs, and escaped backslash literals.

Result:
Malformed surrogate escapes no longer surface as Internal Error or corrupt imported strings. Valid pairs and literal \u text continue to parse correctly across JVM and JS targeted tests.

References: databricks#1046
@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from 7ca9536 to 85788f3 Compare June 25, 2026 15:31
@He-Pin He-Pin changed the title fix: JSON import handles unpaired surrogate escapes without crashing fix: validate JSON surrogate escapes before parsing Jun 25, 2026
Motivation:
Some malformed Unicode surrogate escapes in imported .json files make ujson throw a plain java.lang.Exception, which previously escaped as an Internal Error. Avoid adding a pre-parse scan on the normal import path.

Modification:
Extend the .json import fast-path fallback to recognize ujson's known surrogate parse exception messages and fall back to normal Jsonnet parsing. Keep validation delegated to ujson and do not change std.parseJson.

Result:
Malformed surrogate escapes that ujson reports no longer leak Internal Error stack traces. Valid surrogate pairs still parse on the JSON fast path. Cases ujson silently accepts or corrupts remain a parser limitation rather than being fixed with an extra scan.

References: databricks#1046
@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from 85788f3 to 15a8173 Compare June 25, 2026 15:52
@He-Pin He-Pin changed the title fix: validate JSON surrogate escapes before parsing fix: handle ujson surrogate parse exceptions in JSON imports Jun 25, 2026
message != null && (
message.startsWith("Unexpected character following high surrogate") ||
message.startsWith("Duplicate high surrogate") ||
message.startsWith("Un-paired low surrogate")

@He-Pin He-Pin Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this depends on ujson's exception message not that bad I think.

@He-Pin He-Pin marked this pull request as ready for review June 25, 2026 15:55
@He-Pin He-Pin closed this Jun 25, 2026
@He-Pin He-Pin reopened this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant