Skip to content

Fix multi-byte Unicode input (CJK/emoji)#6

Open
keyuchen21 wants to merge 1 commit into
Sentdex:masterfrom
keyuchen21:fix/unicode-input
Open

Fix multi-byte Unicode input (CJK/emoji)#6
keyuchen21 wants to merge 1 commit into
Sentdex:masterfrom
keyuchen21:fix/unicode-input

Conversation

@keyuchen21

@keyuchen21 keyuchen21 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • Fix garbled input when typing Chinese/Japanese/Korean characters or emoji in the chatbox — _raw_read_key() was reading one byte at a time and immediately decoding, corrupting multi-byte UTF-8 sequences into replacement characters (������)
  • Fix display width calculations in the input editor — CJK characters occupy 2 terminal columns but were counted as 1, causing cursor misalignment and broken box borders
  • _raw_read_available() (used for escape sequences and paste) now accumulates raw bytes before decoding, fixing pasted CJK text

Test plan

  • Type Chinese characters (你好世界) — should appear correctly, not as replacement chars
  • Type Japanese (こんにちは) and Korean (안녕하세요) — same
  • Paste multi-line Chinese text — should insert correctly
  • Verify cursor positioning is correct when moving left/right through CJK text
  • Verify the input box border stays aligned with wide characters
  • Verify normal ASCII input still works as before
  • Test emoji input (😀🎉) — should appear correctly

The raw input reader was reading stdin one byte at a time and immediately
decoding, which corrupted multi-byte UTF-8 characters into replacement
characters (U+FFFD). Now inspects the lead byte to determine continuation
byte count before decoding.

Also fixes display width calculations — CJK characters occupy 2 terminal
columns but were treated as 1, causing cursor misalignment and broken
box-drawing in the input editor.
@keyuchen21

Copy link
Copy Markdown
Author
image

@keyuchen21

keyuchen21 commented Jun 30, 2026

Copy link
Copy Markdown
Author

Verification

Tested manually — Chinese input now works correctly:

Before fix: typing "你好" produced "������" (each UTF-8 byte decoded separately as replacement characters)

After fix: "你好" displays correctly in the input box with proper cursor positioning and box alignment

Test results:

  • ✅ Chinese input (你好世界) renders correctly
  • ✅ Input box borders stay aligned with wide characters
  • ✅ Model responds to Chinese input properly
  • ✅ ASCII input continues to work as before

@Sentdex Sentdex self-assigned this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants