KMP Algorithm (String Matching)

KMP Algorithm (String Matching)

The KMP algorithm (Knuth-Morris-Pratt algorithm) is an efficient string matching algorithm. Its core idea is: when a mismatch occurs between the main string and the pattern string at a certain position, the pattern string can utilize the information from previously successful matches to slide to a specific position and continue matching, rather than moving back by only one position and restarting as in brute-force matching. This avoids backtracking of the main string pointer, thereby optimizing the time complexity to O(n+m).

1. Introduction: Why KMP?

Imagine you have a very long text (main string, length n) and a relatively short word (pattern string, length m). Your task is to find the first occurrence of this word in the text.

The most intuitive method is brute-force matching:

Align the starting position of the pattern string with every possible starting position in the main string.
Starting from the aligned position, compare the pattern string and the main string character by character.
If a mismatch is found, move the pattern string back by one position, then start comparing from the beginning again.

Drawbacks of Brute-Force Matching: When a mismatch occurs, the pointer (index) of the main string backtracks to the next position of the current match's starting point. For example, main string "AAAAAAB", pattern string "AAAB". When matching fails at the last character ('A' != 'B'), brute-force matching causes the main string pointer to backtrack, leading to many unnecessary repeated comparisons. Its worst-case time complexity is O(n*m).

The KMP algorithm was born to solve this "pointer backtracking" problem.

2. The Core Idea of KMP: Utilizing Known Information

The brilliance of the KMP algorithm lies in its pre-calculation of a special array for the pattern string itself—the Prefix Table (often called the next array). This array records information about the "maximum length of common prefixes and suffixes" within the pattern string.

Prefix: Refers to all head combinations of a string except for the last character.
Suffix: Refers to all tail combinations of a string except for the first character.
Longest Common (Equal) Prefix-Suffix: The length of the longest, equal prefix and suffix within a string.

Example to Understand "Longest Common Prefix-Suffix":

Pattern string "ABABC":
- Prefixes: "A", "AB", "ABA", "ABAB"
- Suffixes: "C", "BC", "ABC", "BABC"
- All prefixes and suffixes are unequal, so the longest common prefix-suffix length is 0.
Pattern string "ABABA":
- Prefixes: "A", "AB", "ABA", "ABAB"
- Suffixes: "A", "BA", "ABA", "BABA"
- Common prefixes and suffixes are: "A" (length 1) and "ABA" (length 3). The longest is "ABA", so the length is 3.

Role of the next Array: next[j] represents the length of the "longest common prefix-suffix" for the substring of the pattern string from index 0 to j (inclusive). When a mismatch occurs between the main string and the pattern string at position j, the value of next[j] tells us to which position we should slide the pattern string, aligning the next[j]-th position of the pattern string with the current main string pointer position, and then continue the comparison.

3. Constructing the next Array (Preprocessing Stage)

The process of constructing the next array is itself a "self-matching" process of the pattern string. We use two pointers: i (end of the suffix) and j (end of the prefix, also representing the current length of the longest common prefix-suffix).

Algorithm steps:

Initialize: next[0] = 0 (because a single character has no true prefix/suffix, length 0). Let j = 0, i = 1.
Start the loop, i traverses from 1 to the end of the pattern string.
- Case 1: pattern[i] == pattern[j]. This means we have found a longer common prefix-suffix. Execute j++, then set next[i] = j, and finally i++.
- Case 2: pattern[i] != pattern[j]. This means the characters do not match.
  - If j > 0, it means there is a common part ahead. We need to fall back j. Let j = next[j - 1]. Then compare pattern[i] and pattern[j] again. (This step is the most core fallback logic of KMP, which can be understood as matching within the pattern string).
  - If j == 0, it means we have fallen back to the beginning, and there is indeed no common part. Set next[i] = 0, then i++.

Example: Constructing the next array for pattern string "aabaaf":

p = ['a','a','b','a','a','f']
next[0] = 0 (initialization)
i=1, j=0: p[1]('a') == p[0]('a') -> j=1, next[1]=1, i=2
i=2, j=1: p[2]('b') != p[1]('a') -> j>0, fallback j = next[0] = 0
i=2, j=0: p[2]('b') != p[0]('a') -> j==0, next[2]=0, i=3
i=3, j=0: p[3]('a') == p[0]('a') -> j=1, next[3]=1, i=4
i=4, j=1: p[4]('a') == p[1]('a') -> j=2, next[4]=2, i=5
i=5, j=2: p[5]('f') != p[2]('b') -> j>0, fallback j = next[1] = 1
i=5, j=1: p[5]('f') != p[1]('a') -> j>0, fallback j = next[0] = 0
i=5, j=0: p[5]('f') != p[0]('a') -> j==0, next[5]=0, end.

Final next array: [0, 1, 0, 1, 2, 0]

4. Using the next Array for Matching

The matching process is very similar to constructing the next array. We use two pointers: i traverses the main string, j traverses the pattern string.

Algorithm steps:

Initialize: i = 0 (main string pointer), j = 0 (pattern string pointer).
Loop, i traverses from 0 to the end of the main string.
- Case 1: text[i] == pattern[j]. Match successful, i++, j++.
- Case 2: text[i] != pattern[j]. Match failed.
  - If j > 0, it means part of the pattern has been matched. Fallback the pattern string pointer according to the next array: j = next[j - 1]. (Note: The main string pointer i does NOT backtrack!)
  - If j == 0, it means the pattern string has fallen back to the beginning and cannot fall back further. Then advance the main string pointer i by one step: i++.
Determine the result: If j equals the pattern string length m, the match is successful, return the position i - j. Otherwise, the match fails.

Example: Main string "aabaabaaf", pattern string "aabaaf" (next=[0,1,0,1,2,0]):

i=0, j=0: 'a'=='a' -> i=1, j=1
i=1, j=1: 'a'=='a' -> i=2, j=2
i=2, j=2: 'b'=='b' -> i=3, j=3
i=3, j=3: 'a'=='a' -> i=4, j=4
i=4, j=4: 'a'=='a' -> i=5, j=5
i=5, j=5: 'b' != 'f' -> j>0, fallback j = next[4] = 2
i=5, j=2: 'b'=='b' -> i=6, j=3
i=6, j=3: 'a'=='a' -> i=7, j=4
i=7, j=4: 'a'=='a' -> i=8, j=5
i=8, j=5: 'f'=='f' -> i=9, j=6
j(6) == pattern.length(6), match successful! Starting position is i - j = 9 - 6 = 3.

Summary: The KMP algorithm preprocesses the pattern string to obtain the next array. When a mismatch occurs, it intelligently slides the pattern string using information from this array, avoiding backtracking of the main string pointer and improving the efficiency of string matching to a linear level O(n+m). Understanding the meaning and construction process of the next array is key to mastering the KMP algorithm.