One: Background TOC
Given a main string (replaced by S) and a pattern string (replaced by P), it is required to find the position of P in S, that is, the pattern matching problem of the string. Today I’m going to introduce one of the most common algorithms for solving this problem, the Knuth-Morris-Pratt algorithm, or KMP, which was conceived in 1974 by Donald Ervin Knuth and Vaughn Pratt, In the same year, James H. Morris independently designed the algorithm, which was finally jointly published by three people in 1977. Before we move on, it’s worth introducing two concepts here: prefixes and suffixes.
Two: Simple string matching algorithm TOC
When we first encounter the pattern matching problem of string, the first reaction in our mind must be naive string matching (violent matching), that is, we traverse every character of S, compare it with P starting with the character, and output all matches; Otherwise, until the end of S. The code is as follows:
/* String subscript starts at 0 */ int NaiveStringSearch(string S, string P) {int I = 0; // int j = 0; // int s_len = s.size (); int p_len = P.size(); While (I < s_len && j < p_len) {if (S = = P) [j] [I] / / if equal, before further {i++; j++; } else // not equal {I = I - j + 1; j = 0; }} if (j == p_len) return i-j; return -1; }Copy the code
The time complexity of the above algorithm is O(nm), where N is the length of S and m is the length of P. This time complexity is difficult to meet our needs, so let’s get to the topic: KMP algorithm with θ (n+ M) time complexity.
Three: KMP string matching algorithm TOC
3.1 Algorithm Flow TOC
(1)
First, the first character of the main string “BBC ABCDAB ABCDABCDABDE” is compared with the first character of the pattern string “ABCDABD”. Because B does not match A, the pattern string moves back one bit.
(2)
Because B and A do not match again, the pattern string moves further back.
(3)
This is done until the main string has one character that is the same as the first character of the pattern string.
(4)
Then compare the next character of the main string and the pattern string, again the same.
(5)
Until the main string has a character different from the corresponding character of the pattern string.
(6)
At this point, the most natural reaction is to move the whole pattern string back one bit and compare it from scratch. This works, but it’s inefficient, because you have to move the “search location” to the location you’ve already compared, and do the comparison again.
(7)
One basic fact is that when Spaces don’t match D, you actually know that the first six characters are “ABCDAB”. The idea of the KMP algorithm is that it is more efficient to try to use this known information by not moving the “search position” back to the one it has already been compared to, but by moving it further back.
(8)
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
Pattern string | A | B | C | D | A | B | D | ‘\ 0’ |
next[j] | – 1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
How do you do that? You can set a jump array for pattern stringsint next[ ]
I’ll show you how to compute this array later, but you only need to know how to use it.
(9)
The first six characters “ABCDAB” are known to match Spaces that do not match D. According to the jump array, the next value of D at the mismatch is 2, so nextThe match starts at position 2 in the pattern string.
(10)
Because whitespace does not match C, the next value at C is 0, so the following pattern string matches from subscript 0.
(11)
Because the space does not match A, the next value is -1, which means that the first character of the pattern string does not match, so move it one bit further.
(12)
Bit by bit, until C and D do not match. So, the next step is to start matching at subscript 2.
(13)
Bit-by-bit comparison until the last bit of the pattern string finds an exact match, and the search is complete.
3.2 How to calculate TOC for the next array
The next array is evaluated based on “prefix” and “suffix”, i.e. next[j] equals P[0]… P[J-1] the longest length of the same prefix and suffix (please ignore j equals 0 for the moment, as explained below). Again, let’s use the table above as an example, which I’ve reproduced below for ease of reading.
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
Pattern string | A | B | C | D | A | B | D | ‘\ 0’ |
next[j] | – 1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
(1) j=0, for the first character of the pattern string, we are unified as next[0]=-1; (2) j=1, the preceding string is A, the longest length of the same prefix and suffix is 0, that is, next[1]=0; (3) j=2, the preceding string is AB, and the longest length of the same prefix and suffix is 0, that is, next[2]=0; (4) j=3, the preceding string is ABC, and its longest same prefix and suffix length is 0, that is, next[3]=0; (5) j=4, the preceding string is ABCD, and the longest length of the same prefix and suffix is 0, that is, next[4]=0; (6) j=5, the preceding string is ABCDA, the longest same prefix and suffix is A, that is, next[5]=1; (7) j=6, the preceding string is ABCDAB, its longest same prefix and suffix is AB, that is, next[6]=2; (8) j=7, the preceding string is ABCDABD, and the longest length of the same prefix and suffix is 0, that is, next[7]=0. So why is it possible to jump to a mismatch based on the length of the longest identical prefix or suffix? Here’s a typical example: If j=6 does not match, then we know that the string before the position is ABCDAB. If we look at the string carefully, there is an AB at the beginning and the end of the string. Since the D at j=6 does not match, why don’t we just take the C at j=2 and continue the comparison? This AB is the longest identical prefix and suffix of ABCDAB, whose length 2 is the subscript position of the jump. Some readers may have a question, if the match fails when j=5, according to the train of thought I explained, at this time, the character at j=1 should be taken to continue to compare, but the characters at these two positions are the same, both are B, since the same, it is useless to compare. In fact, it is not my explanation that there is a problem, nor is there a problem with the algorithm, but the algorithm has not been optimized, about this problem will be detailed in the following, but I suggest readers do not tangle here, skip this, you will naturally understand. The idea is so simple, the next problem is the code implementation, as follows:
*/ void GetNext(string P, int next[]) {int p_len = p.size (); int i = 0; // int j = -1; // next[0] = -1; while (i < p_len) { if (j == -1 || P[i] == P[j]) { i++; j++; next[i] = j; } else j = next[j]; }}Copy the code
Look confused, isn’t it… The code above evaluates the next value for each position, that is, the length of the longest identical prefix or suffix of the string that precedes each position. For specific analysis, THE code is divided into three parts: (1) What is the function of I? I is the subscript of the pattern string P. Starting at 0, we find the value of next[I]. This is very simple. (2) What is the role of J? From the next [I] = j; It is easy to infer that j represents the length of the element with the longest common prefix and suffix. (3) the if… else… What does the statement do? First of all, we must recognize the fact that if I =3, then the next thing we want to solve for is P[0]… The longest length of the same prefix for p[3] is next[4], not next[3], as evidenced by the following code:
i++;
j++;
next[i] = j;Copy the code
With this fact in mind, let’s break it down:
Suppose the positions of I and j are as shown above, given bynext[i]=j
Omega, which is the segment for position I0 to I – 1The longest identical prefix and suffix are respectively0 to j – 1 和I – j to I – 1That is, the two sections have the same content.
According to the algorithm,if(P[i]==P[j])
,i++; j++; next[i]=j;
; If not, thenj=next[j]
, as shown below:
next[j]
Is the meaning of0 to j – 1The length of the longest same prefix and suffix in the section is shown in the figure. The two ellipses on the left represent the longest same prefix and suffix, that is, the two ellipses represent the same content of the section. Similarly, I have the same two ellipses on the right-hand side. So the else statement just uses the first ellipse and the fourth ellipse to speed things up0 to I – 1The length of the same prefix and suffix for the section.
Careful friends will ask if statementj==-1
What is the meaning of existence? First, when the program first runs, j is initialized with -1 and goes straight aheadP[i]==P[j]
Judgment will doubtless be wrong; Second, else statementj=next[j]
, j is continuously backward, if j is assigned -1 in the backward (i.e. J =next[0]), inP[i]==P[j]
Judgment can also be wrong. In summary, the significance of the above two points is for special boundary judgment.
Four: Complete code TOC
/** ** author * date 2017-03-05 * mode C++ */ #include<iostream> #include<string> using namespace STD; */ void GetNext(string P, int next[]) {int p_len = p.size (); int i = 0; // int j = -1; // next[0] = -1; while (i < p_len) { if (j == -1 || P[i] == P[j]) { i++; j++; next[i] = j; } else j = next[j]; Int KMP(string S, string P, int next[]) {GetNext(P, next); int i = 0; // int j = 0; // int s_len = s.size (); int p_len = P.size(); While (I < s_len && j < p_len) {if (j = = | | S = = P) [j] [I] / / P of the first character does not match or S [I] = = P [j] {i++; j++; } else j = next[j]; } if (j == p_len) return i-j; return -1; } int main() { int next[100] = { 0 }; cout << KMP("bbc abcdab abcdabcdabde", "abcdabd", next) << endl; //15 return 0; }Copy the code
Five: KMP optimizes TOC
j | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
Pattern string | A | B | C | D | A | B | D | ‘\ 0’ |
next[j] | – 1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
Taking the table 3.2 as an example (copied above), if the match fails when j=5, according to the code 3.2, the characters at j=1 should be taken to continue the comparison, but the characters at these two positions are the same, both are B. Since they are the same, it is useless to take the characters for comparison. As I explained in 3.2, this happens because KMP is not perfect. So how do you rewrite the code to solve this problem? Very simple.
*/ void GetNextval(string P, int nextval[]) {int p_len = p.size (); int i = 0; // int j = -1; Nextval [0] = -1; while (i < p_len) { if (j == -1 || P[i] == P[j]) { i++; j++; if (P[i] ! = P[j]) nextval[i] = j; else nextval[i] = nextval[j]; } else j = nextval[j]; }}Copy the code
Here is also a reminder to all readers, KMP algorithm is strictly divided into KMP algorithm (not optimized version) and KMP algorithm (optimized version), so it is recommended that readers in the expression of KMP algorithm, the best to tell your version, because the two are very different in some cases, here is a simple talk. KMP algorithm (not optimized version) : The next array represents the longest length of the same prefix and suffix, we can not only use next to solve the pattern string matching problem, but also can be used to solve similar string repetition problems, etc., this kind of problem we can find in the major OJ, not to make too much description here. KMP algorithm (optimized version) : Easy to know from the code (name also changed to Nextval), optimized Next simply represents the length of the same prefix, but not necessarily the longest (I personally call it “optimal identical prefix”). At this point, we can get our answer to the pattern string matching problem much faster with the optimized next version than with the unoptimized version, but the optimized version doesn’t do anything about the string duplication problem mentioned above. So, which version you go for depends on the actual problems you have in the real world.
[1] Yan Weimin. Data Structure (C language version) [2] Ruan Yifeng. String matching KMP algorithm
– the ヾ (^ del ^ *)))