fix: use StringDecoder to handle UTF-8 chunk boundaries in setEncoding#5035
Open
398651434 wants to merge 1 commit intonodejs:mainfrom
Open
fix: use StringDecoder to handle UTF-8 chunk boundaries in setEncoding#5035398651434 wants to merge 1 commit intonodejs:mainfrom
398651434 wants to merge 1 commit intonodejs:mainfrom
Conversation
When setEncoding('utf8') is called, each chunk was being converted to
a string individually, which corrupts multi-byte UTF-8 characters that
span chunk boundaries.
This fix:
- Initializes a StringDecoder when setEncoding is called
- Uses StringDecoder.write() in consumePush to properly handle
incomplete UTF-8 sequences at chunk boundaries
- Resets the decoder in consumeFinish to allow garbage collection
Closes nodejs#5002
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes a bug where
response.body.setEncoding('utf8')corrupts multi-byte UTF-8 characters that span chunk boundaries.Root Cause
Each chunk was being individually converted to a string via
buffer.utf8Slice()(ortoString()). When a multi-byte UTF-8 character (e.g., a Chinese character = 3 bytes) is split across two HTTP response chunks, the first chunk gets an incomplete byte sequence converted to garbage, and the second chunk's portion becomes a separate corrupted character.Fix
Use Node.js's built-in
StringDecoder(fromnode:string_decoder) which properly buffers incomplete byte sequences betweenwrite()calls:setEncoding(encoding): Initialize aStringDecoderwhen encoding is setconsumePush: When a decoder exists, usedecoder.write(chunk)instead of storing the raw buffer — this accumulates incomplete UTF-8 bytes internallyconsumeFinish: Reset the decoder to allow garbage collectionTesting
The bug manifests when:
setEncoding('utf8')is called on the bodyAfter fix, characters are correctly reassembled across chunk boundaries.
Closes #5002