Unicode and String Processing in JavaScript: Code Points, Surrogate Pairs, and Normalization Algorithms

Unicode and String Processing in JavaScript: Code Points, Surrogate Pairs, and Normalization Algorithms

1. Description

JavaScript uses UTF-16 encoding to represent strings, where each character is internally stored as one or two 16-bit code units. This encoding method can lead to complex processing issues, especially for Unicode characters in the range U+10000 to U+10FFFF, as they require two code units to represent, known as "surrogate pairs." Understanding how JavaScript handles Unicode characters, including code points, surrogate pairs, and normalization, is key to building robust internationalized and localized applications.

2. Core Concepts Explained

2.1 Code Points and Code Units

Code Point: A unique numeric identifier assigned to each character by the Unicode standard, ranging from 0x0 to 0x10FFFF. For example, the code point for the letter "A" is 0x0041.
Code Unit: The smallest addressable unit in a text encoding system. In UTF-16, a code unit is 16 bits. For characters in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF), one code point corresponds to one code unit; for characters in the supplementary planes (U+10000 to U+10FFFF), one code point corresponds to two code units (a surrogate pair).

2.2 Surrogate Pairs

When a code point exceeds 0xFFFF, UTF-16 encodes it using two code units. The rules are as follows:
1. High surrogate code unit: range 0xD800 to 0xDBFF.
2. Low surrogate code unit: range 0xDC00 to 0xDFFF.
For example, the character "😀" (grinning face emoji) has a code point of 0x1F600 and is encoded in UTF-16 as the high surrogate code unit 0xD83D and the low surrogate code unit 0xDE00.

3. String Processing in JavaScript

3.1 String Length Issues

console.log("😀".length); // Outputs 2, because the string "😀" is stored internally as two code units
console.log("A".length);  // Outputs 1

The .length property returns the number of code units in a string, not the number of characters. This can yield unexpected results for strings containing surrogate pairs.

3.2 Correct Ways to Iterate Over Strings

Incorrect iteration (using indexing):

let emoji = "😀";
for (let i = 0; i < emoji.length; i++) {
  console.log(emoji[i]); // Outputs: � � (garbled, because the surrogate pair is split)
}

Correct iteration methods:

Using for...of loops (supports Unicode):

for (let char of emoji) {
  console.log(char); // Outputs: 😀
}

Using the spread operator (...):

[...emoji].forEach(char => console.log(char)); // Outputs: 😀

4. Unicode Normalization

Unicode allows certain characters to have multiple representations. For example, the letter "é" can be represented as:
- A single code point U+00E9 (Latin small letter e with acute accent).
- A composite form: letter "e" (U+0065) + acute accent "´" (U+0301).
This can cause issues in string comparison and sorting because the two representations look identical but have different code points.

4.1 Normalization Forms

The Unicode standard defines four normalization forms:

NFC: Uses the shortest representation, combining characters if possible.
NFD: Decomposes composite characters into base characters and combining marks.
NFKC and NFKD: In addition to normalization, they handle compatibility characters (e.g., full-width letters).

4.2 Normalization in JavaScript

ES6 introduced the String.prototype.normalize() method:

let s1 = '\u00E9';        // "é"
let s2 = '\u0065\u0301';  // "é" (composite form)
console.log(s1 === s2);                 // false
console.log(s1.normalize() === s2.normalize()); // true (using NFC)
console.log(s1.normalize('NFD') === s2.normalize('NFD')); // true

5. Utility Methods for Handling Code Points

ES6 provides methods for handling code points:
1. String.fromCodePoint(): Creates a string from a code point (supports all Unicode).
```
console.log(String.fromCodePoint(0x1F600)); // "😀"
```
1. String.prototype.codePointAt(): Returns the code point at the specified position (correctly handles surrogate pairs).
```
console.log("😀".codePointAt(0)); // 128512 (0x1F600)
```
1. String.prototype.at(): Returns the character at the specified position (supports Unicode, experimental API).
```
console.log("😀".at(0)); // "😀"
```

6. Regular Expressions and Unicode

ES6 introduced the u flag, enabling Unicode mode in regular expressions:

console.log(/^.$/.test("😀")); // false (default matches code units)
console.log(/^.$/u.test("😀")); // true (matches code points)

Using \p{...} to match Unicode properties:

console.log(/\p{Emoji}/u.test("😀")); // true
console.log(/\p{Script=Greek}/u.test("α")); // true

7. Practical Recommendations

When processing strings that may contain surrogate pairs, avoid using .length to count characters; use the spread operator or iterators instead.
Normalize strings before comparing or sorting internationalized text.
Use for...of loops or Array.from() to safely iterate over strings.
Use the u flag in regular expressions to correctly handle Unicode characters.

Understanding these Unicode processing mechanisms will help you write more robust and internationalized JavaScript code.