String Reversal and Unicode Security Issues in JavaScript

String Reversal and Unicode Security Issues in JavaScript

I. Problem Description

Reversing a string is a common interview question in JavaScript. At first glance, it seems simple, but it actually hides Unicode security issues. Many developers use the method str.split('').reverse().join(''), but this method produces incorrect results when dealing with strings containing Unicode characters (especially emojis, combining characters, etc.).

II. Core Problem Analysis

JavaScript String Encoding Issues
- JavaScript uses UTF-16 encoding to represent strings.
- Each character occupies 1-2 code units in memory.
- Characters in the Basic Multilingual Plane (BMP): 1 code unit (e.g., English, Chinese).
- Supplementary Plane characters: 2 code units forming a surrogate pair (e.g., many emojis).

Defects of the Simple Reversal Method

// Common incorrect approach
function reverseString(str) {
  return str.split('').reverse().join('');
}

// Testing the problem
console.log(reverseString('hello'));  // 'olleh' ✅
console.log(reverseString('🐘🐍'));  // '��🐘'  ❌ Error!
console.log(reverseString('café'));  // 'éfac'  ✅

III. Step-by-Step Problem Resolution

Step 1: Understanding Differences in String Iteration

// Comparing different string iteration methods
const str = '🐘hello';

// Iterate by code unit (incorrect)
console.log(str.split(''));  // ['\uD83D', '\uDC18', 'h', 'e', 'l', 'l', 'o']

// Iterate by code point (correct)
console.log([...str]);  // ['🐘', 'h', 'e', 'l', 'l', 'o']
console.log(Array.from(str));  // ['🐘', 'h', 'e', 'l', 'l', 'o']

Step 2: Handling Basic Unicode Characters

function reverseStringBasic(str) {
  // Use spread operator or Array.from to correctly handle Unicode characters
  return [...str].reverse().join('');
}

// Test
console.log(reverseStringBasic('🐘🐍'));  // '🐍🐘' ✅
console.log(reverseStringBasic('café'));  // 'éfac' ✅

Step 3: Handling More Complex Unicode Scenarios

Problem: Combining characters (e.g., letters with accents)

// Issue with combining characters
const str = 'caf\u0065\u0301';  // Another representation of 'café'
console.log([...str]);  // ['c', 'a', 'f', 'e', '\u0301']
console.log(reverseStringBasic(str));  // '́efac' ❌ Accent mark separated

Step 4: Handling Combining Characters (Normalization)

function reverseStringWithNormalization(str) {
  // First perform Unicode normalization to convert combining characters into single code points
  const normalized = str.normalize('NFC');  // Canonical decomposition followed by composition
  return [...normalized].reverse().join('');
}

// Test combining characters
const cafe1 = 'caf\u00E9';  // Single code point
const cafe2 = 'caf\u0065\u0301';  // Combined form
console.log(reverseStringWithNormalization(cafe1));  // 'éfac' ✅
console.log(reverseStringWithNormalization(cafe2));  // 'éfac' ✅

Step 5: Handling Zero-Width Joiners and Directional Characters

// More complex cases: Zero-width non-joiner (ZWNJ) and directional characters
function safeReverseString(str) {
  // Use Intl.Segmenter (ES2022+) for more precise text segmentation
  if ('Segmenter' in Intl) {
    const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
    const segments = Array.from(segmenter.segment(str), s => s.segment);
    return segments.reverse().join('');
  }
  
  // Fallback: Normalization + spread operator
  return [...str.normalize('NFC')].reverse().join('');
}

// Test
const complexStr = '👨‍👩‍👧‍👦';  // Family emoji (contains zero-width joiners)
console.log(safeReverseString(complexStr));  // Correctly handles composite emojis

Step 6: Performance Considerations and Alternatives

// Method 1: Spread operator (recommended, clear and readable)
function reverse1(str) {
  return [...str].reverse().join('');
}

// Method 2: Using for loop (may be more efficient for very large strings)
function reverse2(str) {
  const arr = [];
  for (const char of str) {
    arr.unshift(char);
  }
  return arr.join('');
}

// Method 3: Recursion (not recommended for production, risk of stack overflow)
function reverse3(str) {
  if (str === '') return '';
  return reverse3(str.substr(1)) + str[0];
}

IV. Complete Safe Reversal Function

/**
 * Safe string reversal function
 * Correctly handles Unicode characters, combining characters, and complex emojis
 */
function safeStringReverse(str) {
  if (typeof str !== 'string') {
    throw new TypeError('Expected a string');
  }
  
  // ES2022+ Use Intl.Segmenter for text segmentation
  if (typeof Intl?.Segmenter === 'function') {
    try {
      const segmenter = new Intl.Segmenter(undefined, { granularity: 'grapheme' });
      const segments = Array.from(segmenter.segment(str), s => s.segment);
      return segments.reverse().join('');
    } catch (e) {
      // Fallback to regular method
    }
  }
  
  // Fallback: Unicode normalization + spread operator
  // Note: Normalization may alter certain strings but ensures reversal correctness
  const normalized = str.normalize('NFC');
  return [...normalized].reverse().join('');
}

// Comprehensive testing
const testCases = [
  'hello',
  '🐘🐍',  // Animal emojis
  'café',  // Accented characters
  '👨‍👩‍👧‍👦',  // Family emoji (zero-width joiners)
  'नमस्ते',  // Sanskrit (combining characters)
  'A\u0301',  // Combined accent
  '🔥🌟✨',  // Multiple emojis
];

testCases.forEach(str => {
  console.log(`Original string: ${str}`);
  console.log(`Reversed result: ${safeStringReverse(str)}`);
  console.log('---');
});

V. Interview Key Points Summary

Basic Pitfall: split('') splits by UTF-16 code units, which breaks surrogate pairs.
Unicode Normalization: Use normalize('NFC') to handle combining characters.
Modern APIs: Prefer [...str] or Array.from(str) for splitting.
ES2022 Enhancement: Intl.Segmenter handles more complex text boundaries.
Performance Considerations: Use spread operator for simple scenarios; consider manual iteration for very large strings.
Edge Cases: Empty strings, non-string inputs, special Unicode characters.

Understanding these details not only helps write correct string reversal functions but also demonstrates a deep understanding of JavaScript's Unicode handling mechanisms.