KMP String Matching Algorithm

KMP String Matching Algorithm

Problem Description
The KMP algorithm is used to solve the string matching problem: given a text string S and a pattern string P, determine whether P appears in S and return the position(s) where it occurs. The core idea of the KMP algorithm is to leverage already matched information when a mismatch occurs, avoiding restarting the comparison from the beginning, thereby optimizing the time complexity to O(n+m).

Solution Process

1. Drawbacks of Brute-Force Matching
The brute-force approach compares characters one by one:

  • When S[i] matches P[j], increment both i and j.
  • When a mismatch occurs, reset i back to the next character after the start of the current matching attempt, and reset j to 0.
    This backtracking leads to repeated comparisons, resulting in a worst-case time complexity of O(n*m).

2. Core Idea of KMP
When a mismatch occurs, the KMP algorithm keeps the text pointer i unchanged (no backtracking) and only moves the pattern pointer j to a specific position. This position is determined by the "Longest Common Prefix and Suffix".

  • Prefix: Any contiguous substring starting from the first character (excluding the entire string).
  • Suffix: Any contiguous substring ending at the last character (excluding the entire string).
  • Longest Common Prefix and Suffix (LPS): The longest identical part between a prefix and a suffix.
    Example: For P="ababc", prefixes include "a", "ab", "aba", "abab"; suffixes include "c", "bc", "abc", "babc". The longest common prefix and suffix is "ab" (length 2).

3. Constructing the next Array
The next array (also called the partial match table) stores the length of the longest common prefix and suffix for the substring ending before each position in the pattern string. Define next[j] as the position j should jump to when P[j] fails to match.

  • Steps to compute the next array (using P="ababc" as an example):
    • j=0: Substring "a" has no proper prefix or suffix, next[0]=-1 (a special marker indicating a shift of the entire pattern).
    • j=1: Substring "ab": Prefixes ["a"], Suffixes ["b"], no common part, next[1]=0.
    • j=2: Substring "aba": Prefixes ["a", "ab"], Suffixes ["a", "ba"], common part "a" length 1, next[2]=1.
    • j=3: Substring "abab": Prefixes ["a", "ab", "aba"], Suffixes ["b", "ab", "bab"], common part "ab" length 2, next[3]=2.
    • j=4: Substring "ababc": Prefixes ["a", "ab", "aba", "abab"], Suffixes ["c", "bc", "abc", "babc"], no common part, next[4]=0.
      The final next array is [-1, 0, 1, 2, 0].

4. Matching Process
Assume S="abababc", P="ababc", next=[-1,0,1,2,0]:

  • i=0, j=0: S[0]='a' matches P[0]='a', i=1, j=1.
  • i=1, j=1: S[1]='b' matches P[1]='b', i=2, j=2.
  • i=2, j=2: S[2]='a' matches P[2]='a', i=3, j=3.
  • i=3, j=3: S[3]='b' matches P[3]='b', i=4, j=4.
  • i=4, j=4: S[4]='a' mismatches P[4]='c', check next[4]=0, j jumps to 0.
  • i=4, j=0: S[4]='a' matches P[0]='a', i=5, j=1.
  • Continue matching until j=5 (out of bounds), match successful, starting position is i-j=5-5=0.

5. Key Points Summary

  • Both the construction of the next array and the matching process are simulated using two pointers, avoiding backtracking.
  • Understanding the Longest Common Prefix and Suffix is key to mastering KMP.
  • In actual coding, pay attention to boundary handling for the next array (e.g., when j=-1, shift the entire pattern).