Skip to content

fix: use StringDecoder to handle UTF-8 chunk boundaries in setEncoding#5035

Open
398651434 wants to merge 1 commit intonodejs:mainfrom
398651434:main
Open

fix: use StringDecoder to handle UTF-8 chunk boundaries in setEncoding#5035
398651434 wants to merge 1 commit intonodejs:mainfrom
398651434:main

Conversation

@398651434
Copy link
Copy Markdown

Description

Fixes a bug where response.body.setEncoding('utf8') corrupts multi-byte UTF-8 characters that span chunk boundaries.

Root Cause

Each chunk was being individually converted to a string via buffer.utf8Slice() (or toString()). When a multi-byte UTF-8 character (e.g., a Chinese character = 3 bytes) is split across two HTTP response chunks, the first chunk gets an incomplete byte sequence converted to garbage, and the second chunk's portion becomes a separate corrupted character.

Fix

Use Node.js's built-in StringDecoder (from node:string_decoder) which properly buffers incomplete byte sequences between write() calls:

  1. setEncoding(encoding): Initialize a StringDecoder when encoding is set
  2. consumePush: When a decoder exists, use decoder.write(chunk) instead of storing the raw buffer — this accumulates incomplete UTF-8 bytes internally
  3. consumeFinish: Reset the decoder to allow garbage collection

Testing

The bug manifests when:

  • HTTP response contains multi-byte UTF-8 text (e.g., Chinese characters, emoji)
  • setEncoding('utf8') is called on the body
  • The text spans multiple TCP packets/chunks

After fix, characters are correctly reassembled across chunk boundaries.


Closes #5002

When setEncoding('utf8') is called, each chunk was being converted to
a string individually, which corrupts multi-byte UTF-8 characters that
span chunk boundaries.

This fix:
- Initializes a StringDecoder when setEncoding is called
- Uses StringDecoder.write() in consumePush to properly handle
  incomplete UTF-8 sequences at chunk boundaries
- Resets the decoder in consumeFinish to allow garbage collection

Closes nodejs#5002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setEncoding('utf8') on response body corrupts multi-byte UTF-8 characters at chunk boundaries

1 participant