Unicode and String Processing in JavaScript: Code Points, Surrogate Pairs, and Normalization Algorithms
1. Description
JavaScript uses UTF-16 encoding to represent strings, where each character is internally stored as one or two 16-bit code units. This encoding method can lead to complex processing issues, especially for Unicode characters in the range U+10000 to U+10FFFF, as they require two code units to represent, known as "surrogate pairs." Understanding how JavaScript handles Unicode characters, including code points, surrogate pairs, and normalization, is key to building robust internationalized and localized applications.
2. Core Concepts Explained
2.1 Code Points and Code Units
- Code Point: A unique numeric identifier assigned to each character by the Unicode standard, ranging from 0x0 to 0x10FFFF. For example, the code point for the letter "A" is 0x0041.
- Code Unit: The smallest addressable unit in a text encoding system. In UTF-16, a code unit is 16 bits. For characters in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF), one code point corresponds to one code unit; for characters in the supplementary planes (U+10000 to U+10FFFF), one code point corresponds to two code units (a surrogate pair).
2.2 Surrogate Pairs
- When a code point exceeds 0xFFFF, UTF-16 encodes it using two code units. The rules are as follows:
- High surrogate code unit: range 0xD800 to 0xDBFF.
- Low surrogate code unit: range 0xDC00 to 0xDFFF.
- For example, the character "😀" (grinning face emoji) has a code point of 0x1F600 and is encoded in UTF-16 as the high surrogate code unit 0xD83D and the low surrogate code unit 0xDE00.
3. String Processing in JavaScript
3.1 String Length Issues
console.log("😀".length); // Outputs 2, because the string "😀" is stored internally as two code units
console.log("A".length); // Outputs 1
- The
.lengthproperty returns the number of code units in a string, not the number of characters. This can yield unexpected results for strings containing surrogate pairs.
3.2 Correct Ways to Iterate Over Strings
- Incorrect iteration (using indexing):
let emoji = "😀";
for (let i = 0; i < emoji.length; i++) {
console.log(emoji[i]); // Outputs: � � (garbled, because the surrogate pair is split)
}
- Correct iteration methods:
- Using
for...ofloops (supports Unicode):
for (let char of emoji) { console.log(char); // Outputs: 😀 }- Using the spread operator (
...):
[...emoji].forEach(char => console.log(char)); // Outputs: 😀 - Using
4. Unicode Normalization
- Unicode allows certain characters to have multiple representations. For example, the letter "é" can be represented as:
- A single code point U+00E9 (Latin small letter e with acute accent).
- A composite form: letter "e" (U+0065) + acute accent "´" (U+0301).
- This can cause issues in string comparison and sorting because the two representations look identical but have different code points.
4.1 Normalization Forms
The Unicode standard defines four normalization forms:
- NFC: Uses the shortest representation, combining characters if possible.
- NFD: Decomposes composite characters into base characters and combining marks.
- NFKC and NFKD: In addition to normalization, they handle compatibility characters (e.g., full-width letters).
4.2 Normalization in JavaScript
- ES6 introduced the
String.prototype.normalize()method:
let s1 = '\u00E9'; // "é"
let s2 = '\u0065\u0301'; // "é" (composite form)
console.log(s1 === s2); // false
console.log(s1.normalize() === s2.normalize()); // true (using NFC)
console.log(s1.normalize('NFD') === s2.normalize('NFD')); // true
5. Utility Methods for Handling Code Points
- ES6 provides methods for handling code points:
String.fromCodePoint(): Creates a string from a code point (supports all Unicode).
console.log(String.fromCodePoint(0x1F600)); // "😀"String.prototype.codePointAt(): Returns the code point at the specified position (correctly handles surrogate pairs).
console.log("😀".codePointAt(0)); // 128512 (0x1F600)String.prototype.at(): Returns the character at the specified position (supports Unicode, experimental API).
console.log("😀".at(0)); // "😀"
6. Regular Expressions and Unicode
- ES6 introduced the
uflag, enabling Unicode mode in regular expressions:
console.log(/^.$/.test("😀")); // false (default matches code units)
console.log(/^.$/u.test("😀")); // true (matches code points)
- Using
\p{...}to match Unicode properties:
console.log(/\p{Emoji}/u.test("😀")); // true
console.log(/\p{Script=Greek}/u.test("α")); // true
7. Practical Recommendations
- When processing strings that may contain surrogate pairs, avoid using
.lengthto count characters; use the spread operator or iterators instead. - Normalize strings before comparing or sorting internationalized text.
- Use
for...ofloops orArray.from()to safely iterate over strings. - Use the
uflag in regular expressions to correctly handle Unicode characters.
Understanding these Unicode processing mechanisms will help you write more robust and internationalized JavaScript code.