preface
What is the KMP algorithm? The main problem to be solved is to quickly find whether the substring pattern exists in the template given a string template.
The KMP algorithm is an improved string matching algorithm proposed by D.E.K Nuth, J.H.Morris and V.R. Prate, so it is named after three predecessors with one letter each. In some versions of the data Structures course, there is the string chapter.
The KMP algorithm is notoriously difficult, and its idea is probably understandable to most students, but the key is to find the next array, and many students don’t understand why a few lines of code cut short can achieve magic effects.
KMP has abused me thousands of times, and I treat KMP like my first love. I have been writing code for 7-8 years, and I have never seen such an incomprehensible algorithm. I can’t help but sigh: how many predecessors’ philosophy and wisdom are contained in these few lines of code.
1. Brute force matching algorithm
As usual, before introducing THE KMP algorithm, we have to mention the brute force matching algorithm, because we can compare the advantages of the KMP algorithm by understanding the brute force matching algorithm.
Code first:
function subString(text,pattern){
let m = text.length;
let n = pattern.length;
for(let i = 0; i < m - n; ++i){let j = 0;
while(j < n && text[i+j] == pattern[j]){
++j;
}
if(j == n){
returni; }}return -1;
}
Copy the code
The reason why it is a brute force matching algorithm is that two Pointers are set, the pointer represents the position on the main string, and the j pointer represents the position on the target string, which is to compare one by one. If the match fails, the J pointer returns to 0, and the I pointer moves backwards by one.
The general process is as follows: When a mismatch is encountered, the I pointer moves back one bit and the j pointer backtracksThe problem of brute force matching lies in the backtracking of the I and J Pointers, so the core of KMP algorithm is to solve the backtracking.
2. KMP algorithm
When the brute force algorithm fails to match, let’s think about the situationAt this point, is it possible for the pattern string to match one bit later? The answer isCertainly notWhy? Obviously, if the prefix “abcdab” fails, only a common prefix is likely to match. Otherwise, you have to keep looking back until the moment when the largest common prefix is found.Therefore, the key of KMP algorithm is to find the longest common prefix and suffix at this time when the matching fails. The POINTER I does not backtrack, and directly moves the pointer J to the position of the largest common prefix and suffix to start the next matching.
At this point, we introduce the concept of prefixes and suffixes.
Prefix: All header combinations of a string except the last character.
Suffix: Refers to all tail combinations of a string except the first character.
For our pattern string “abcdabc”, All the prefix strings are “A “,”ab”,” ABC “,”abcd”,”abcda”,”abcdab”, all the suffix strings are “bcdabc”,”cdabc”,”dabc”,” ABC “,” BC “,”c”, so for the string abcdabc, The longest common prefix and suffix is 3.
We can see that because the pattern string is given, it can fail to match at any location. Is it wise to calculate the maximum common prefix and suffix for each position that fails to match? Those of you who have taken the chapter on hash tables of data structures know that the efficiency of searching by hash is O(1). If we calculate this mapping in advance, then we can know directly where to start a match if the match fails.
The above process is the process of solving the next array in KMP algorithm.
Let’s look at our pattern string: “abcdabc”
For the string “a”, the prefix set: [], the suffix set: [], and the longest common prefix: 0
For the string “ab”, the prefix set: [“b”], the suffix set: [“a”], and the longest common prefix: 0
For the string “ABC”, prefix collection: [” a “, “ab”], suffix collection: [” BC “, “c”), the longest suffix before public: 0
For string “abcd” prefix collection: [” a “, “ab”, “ABC”], suffix collection: [” BCD “, “CD” and “d”], the longest suffix before public: 0
For string “abcda,” prefix collection: [” a “, “ab”, “ABC”, “abcd”], suffix collection: [” bcda “, “cda”, “da” and “a”], the longest suffix before public: 1
For string “abcdab,” prefix collection: [” a “, “ab”, “ABC”, “abcd”, “abcda”], suffix collection: [” bcdab “, “cdab”, “dab”, “ab”, “b”), the longest suffix before public: 2
For string “abcdabc,” prefix collection: [” a “, “ab”, “ABC”, “abcd”, “abcda”, “abcdab”], suffix collection: [” bcdabc “, “cdabc”, “dabc”, “ABC”, “BC”, “c”), the longest suffix before public: 3
The above results are sorted into the following table:
a | b | c | d | a | b | c |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 2 | 3 |
The above derivation process is our thinking process, but it is not an easy thing to convert it into code, so we directly give the code of the NEXT array in the KMP algorithm, and understand it by analyzing the execution process of the code.
Find the next array code as follows:
/** * Generates the next array *@param {String} pattern
* @param {Number[]} next
*/
function genNext(pattern) {
let m = pattern.length
let next = []
// Since the first string has no prefix or suffix, we can assign 0 directly
next[0] = 0;
// When taking a character, it must have no prefix or suffix
for (let i = 1, j = 0; i < m; ++i) {
// If no match is found, recurse to find the previous maximum prefix
// Exit the loop if k is greater than 0 and the string at the current position is the same
while (j > 0&& pattern[i] ! = pattern[j]) {// Find the last maximum prefix suffix
j = next[j - 1];
}
// If a match is found, the maximum prefix and suffix is +1
if (pattern[i] == pattern[j]) {
j++;
}
// Find the maximum common prefix and suffix of the current string
next[i] = j;
}
return next
}
Copy the code
This is one of the most difficult pieces of code I’ve seen so far, and at first glance, it doesn’t feel like it has anything to do with what we did before, but why does it get the next array right? Please be patient and keep reading
First, we need to know that as the length of the string increases by 1, the length of the largest common prefix and suffix can only increase by 1. Let’s draw the execution of this code as follows:
Step1: Step2: Next [1]=0, where I points to 2Step3: Next [2]=0 because the prefix and suffix are not equal, then I points to 3Step4: Because the suffix is not equal, next[3]=0, at which point I points to 4Step5: Because the current string is the same as pattern[0], next[4]=1, where I points to 5Step6: Because the current string is the same as pattern[1], next[5]=2 and I points to 6Step7: Because the current string is the same as pattern[2], next[6]=3 and I points to 7Step8: I pointer is out of bounds, next array generation is complete.
In this process, we can see that the j pointer always points to the maximum common suffix of the string before the current character, because we mentioned earlier that the length of the string increases by 1 and the length of the maximum common suffix increases by 1.
The above test case doesn’t cover all the possible cases, if for the string “abCDabcaa”, we can see that when I =7, j=2, and when I =8, j=3, because d and A don’t match, then j will fall back, where should J fall back, remember, Before our next array but remember the current string string before the biggest public suffix, if equal, then back to this location, or continue to recursively until the current character and character is equal or j = 0, so that j should be back to this place, to continue to produce cross. You don’t call functions inside functions, but the idea is purely recursive.
The j pointer in this process always points to the largest common prefix and suffix.
2.1 Complete code of KMP algorithm
/** * Generates the next array *@param {String} pattern
* @param {Number[]} next
*/
function genNext(pattern) {
let m = pattern.length
let next = []
// Since the first string has no prefix or suffix, we can assign 0 directly
next[0] = 0;
// When taking a character, it must have no prefix or suffix
for (let i = 1, j = 0; i < m; ++i) {
// If no match is found, recurse to find the previous maximum prefix
// Exit the loop if k is greater than 0 and the string at the current position is the same
while (j > 0&& pattern[i] ! = pattern[j]) {// Find the last maximum prefix suffix
j = next[j - 1];
}
// If a match is found, the maximum prefix and suffix is +1
if (pattern[i] == pattern[j]) {
j++;
}
// Find the maximum common prefix and suffix of the current string
next[i] = j;
}
return next
}
/**
* KMP-Search
* @param {String} tpl
* @param {String} pattern
* @returns * /
function kmpSearch(tpl, pattern) {
let n = tpl.length, m = pattern.length;
let pos = -1;
let next = genNext(pattern);
for (let i = 0, q = 0; i < n; ++i) {
// If the match does not fail when the pattern pointer points to 0, the maximum public prefix before the current position is read from the next array, and the pattern string pointer moves to the maximum public prefix position.
if (q > 0&& pattern[q] ! = tpl[i]){ q = next[q -1];
}
// If the current character is equal to the character on the pattern string pointer bit, the pattern pointer moves back one bit
if (pattern[q] == tpl[i]) {
q++;
}
/* * The function will not work properly if the two if's cannot be swapped, and the match must be checked to see if the match fails. If swapped, the function will not work properly if the q pointer moves back one bit, and the loop does not end. * /
// If the pattern string pointer goes to the last bit, the match is successful
if (q == m) {
// Because the current matching position is actually in the length-1 position of pattern
pos = i - m + 1
break}}return pos
}
Copy the code
3, summarize
The essence of KMP algorithm is the process of solving the next array. The code is concise but the logic is complex. But most people don’t know how to solve the next array. I spent a lot of time trying to understand how to solve the next array. If you take the time to read it carefully, you will learn something. The process of finding next is described in natural language as follows: initialization; Suffixes equal; Antecedents and suffixes; Dispose of the Next array; One of the most important points is that in the solution, it’s always recursing (while), not just recursing once (if). Another important point is that during the string traversal, we need to determine whether the character pointed to by the current template string pointer is equal to the character pointed to by the pattern pointer, and then move the pattern string pointer back, otherwise we read the next[pattern string pointer-1] array.
Due to the limited level of the author, it is inevitable that there will be mistakes in the writing process. If there are any mistakes, please correct them. Please contact the author at [email protected], your opinions will help me make better progress. This article is the author’s original, if reproduced please contact the author.