Understand KMP
The KMP algorithm is more convoluted in the string matching algorithm. It is mainly necessary to understand the necessity of solving KMP next array and the basis of j backtracking; It’s easy to get bald when it comes to understanding KMP algorithms. This algorithm can be understood a few more times, in the process of understanding more thoroughly;
KMP algorithm is also a famous pattern matching algorithm. It was created by D.E.Knuth,J.H. Morris and VR.Pratt. Published a pattern matching algorithm. Repeated traversal can be greatly avoided;
Principle of KMP pattern matching algorithm
Case 1:
For example, suppose you now have a main string S = “abcdefgab”; Mode string T = “abcdex”; If the storm algorithm is used, the first five letters are exactly equal until the sixth letter.’f’ is not equal to ‘x’; The following figure;
Next, according to the outbreak algorithm, we need to perform the following procedures (②③④). That is, when the main string I =2,3,4,5,6, the first letter is not equal to the first letter of the substring T;
This is how we execute the outbreak algorithm. But for the matching substring T, the initial letter “a” of “abcdex” is not equal to any character in the following string “bcdex”;
That is, since “a” is not equal to any character in the substring that follows it. So for ① in the figure above, the first five characters are equal. That means that the substring T and the initial character “a” cannot match the second through fifth digits of the S string. Then the judgment of figure ②, ③ and ④ is redundant;
The idea of the KMP algorithm is to try to use this known information, not to move the “search location” back to the location where it has been compared, but to move it further back, thus increasing efficiency
== The key to understanding KMP! = =
- If we know that the first letter “A” in a string of T is not equal to the following characters in the string of T (note that this is the premise, how to determine this will be explained later);
- And the second “B” of the T string is equal to the second “B” of the S string in the first figure.
- So that means that the first character “A” in the T string and the second character “b” in the S string are undecided. And that they cannot be equal; So figure 2 can be omitted!
- Similarly, if we know that the first letter “A” in T string is not equal to the following characters in T string, “A” in T string and “C”, “D” and “e” in S string can also be combined in figure 1 to determine that they are not equal. ②③④⑤ is not necessary.
- Just keep ①⑥! The following figure;
- The reason for reserving the sixth judgment is that it is in ①
T[6] ! = S[6]
Even though we already know thatT[1] ! = T[6]
. But it is not certain that T[1] is not equal to S [6]Therefore, step 6 needs to be retained.
Case 2: What if the T string retains the initial character “A”?
For example, suppose you now have a main string S = “abcababca”; Mode string T = “abcdex”;
- For the beginning, the first five characters are exactly equal. The sixth character is not equal; If the picture below is ①.
- The first character of T, “a”, is not equal to the second character, “b”, or the third character, “C”. So there’s no judgment to be made. Then steps ② and ③ are unnecessary;
- Because the first “A” of T is equal to the fourth “A” of T, and the second “B” is equal to the fifth “B”;
- And by the first comparison, the fourth “A” and the fifth “B” have already been compared with the corresponding positions in the main string S. Is equal;
- Therefore, the first character “A” and the second character “b” of T need not be compared with the fourth letter “A” and the fifth letter “b” of S. It has to be the same.
- Therefore, ④ / ⑤ is unnecessary.
- In other words, some unnecessary judgment steps can be omitted for characters equal to the first character in the substring.
- As shown in the following picture, the first two “a” and “b” in the right picture T string are omitted to match the 4,5 position characters in the S string!
- Comparing these two examples, we can see that at ①, our I value, that is, the current position of the main string is 6;
- ②③④⑤, the value of I is 2,3,4,5. By ⑥, the value of I is back to 6.
- In the storm matching algorithm, the I value of the main string is constantly backtracked;
- In this analysis, there was a lot of unnecessary backtracking;
- The KMP algorithm is designed to make unnecessary backtracking happen!
Since I can’t go back, it can’t go down; So the change you’re considering is the j value;
Through the analysis just understood, we have repeatedly mentioned the comparison between the first character of T string and the character after itself; It is found that if there are equal characters, the change of j value will be different. In other words, the change in the value of j in relation to the pivot string. It depends more on whether there is duplication in the structure of the T string;
In the figure above, since T = “abcdex” has no repeating digits, j changes from 6 to 1;
In the figure below, since T = “abcabx”, the prefix “ab” is equal to the suffix “ab” in the string before the final “x”. So j goes from 6 to 3;
Therefore, the law is drawn: the number of j value depends on the similarity of the string prefix and suffix before the current character;
Derivation of _next array value in KMP matching algorithm
Case 1: There are no duplicate characters in the pattern string
In the case of example 1, there are no repeating characters in the pattern string T; So in this case, the next array, the derivation works in case 1 and case 3 of the formula. So the next array is = {011111};
Case 2: Pattern string similar to “abcabx”
In this case, when j = 1,2,3,4, it’s the same as in the previous case;
-
If j = 1, next[1] = 0;
-
Next [2…4] = {0111}; next[2…4] = {0111};
-
When j = 5, j has the character “abCA” from 1 to J-1, and the prefix “A” is equal to the suffix “a”.
- Belongs to Case 2:
P [1,2-1] and P [5-2+1,5-1] to P1 = P4. So k = 2, and next[5] = 2
;
- Belongs to Case 2:
-
When j = 6, j from 1 to j-1 has the character “abcab”, and the prefix “ab” is equal to the suffix “ab”
- Belongs to Case 2:
P [1,3-1] and p[6-3+1,6-1]. So k = 3, then next[6] = 3
;
- Belongs to Case 2:
Rule of thumb: if the prefix and suffix are equal by one character, the K value is 2; If two characters are equal, it is 3; K equals n plus 1;
Case 3: Pattern string similar to “ababaaaba”
According to the rule for finding K summarized earlier! K can be converted from equal characters;
Also note that:
next
Arrays compare contiguous prefix and suffix characters; Such as whenj=6
When, string”ababa
“, the largest prefix is”aba
“, the suffix is also”aba
“.- when
j = 7
When, string”ababaa
“, where the prefix is”a
“, the suffix is”a
“. Don’t mistake it for”ab
“;
Case 4: Pattern string like “aaaaAAaab”
In this case, you just need to pay attention to the overlap that just happened.
Realization of KMP pattern matching algorithm
Evaluate the next array of substring T: Evaluating the next array is essentially evaluating the traceback position of the string;
KMP Next array backtracking position solution process simulation
-
Suppose that the main string S = “abcababca”; Mode string T = “abcdex”
-
According to the derivation process of the formula above, we should actually be 011111 for the next array. But how do we use this code to solve for the next array?
-
What does 011111 mean?
-
This means that abcdex doesn’t have a traceable address in the pattern string at all. That is, when the main string and the pattern string do not match, both need to be compared from the first position;
-
For example, abcabAbca compares abcdex to the fourth position and finds a mismatch. The index of the main string stays the same;
-
The index j of the pattern string is equal to 4 at that time. If a mismatch is found, a backtrace is performed. So where do we go back?
-
j = next[j] ; j = 1; The fourth character of abcbabca is compared to the first character of the pattern string;
-
Next [1] = 0; Means that it needs to be traversed from zero;
-
Next, design two index subscripts I = 0, j = 1;
-
I is used to solve the address of the traceback, while j is used for traversal of the pattern string
-
If I = 0 appears, it means that the same character does not appear in the pattern string at this time. It is necessary to record the backtracking address at this time. The address is 1. next[j] = 1; Indicates the need to compare from the first character of the pattern string;
-
If T[I] == T[j] = next[j] = I; if T[I] == T[j] == T[j] = next[j] = I;
-
If neither I = 0 and T[I] == T[j] indicates that there is no backtracking adjustment from the range [I,j], we backtrack I to next[I];
-
This process we analyze by legend! This is the KMP algorithm. It is also one of the most difficult matching algorithms to understand;
- At I = 0, and
j = 1
That is, we need to find[0, 1]
The range of characters a stores cases that need to be traced. - And this is just the letter A, and the current
i = 0
, so if the main string and pattern string match, the second character does not match. Then the main string should be moved back 1 bit, and the recalculation of the first bit of the pattern string begins; - But at this time
i=0,j = 1
; So we shouldi++,j++;
- And will be
next[j] = i
; So ‘next[2] = 1; - Said that when
j=2
Character mismatch is found when. The subscript index of the control mode string comparison needs to be traced back to 1.
- At this time
i = 1,j = 2
; Understand that there is no reasonable position to go back to within the range of [0,2]; - So will
i
Back to thenext[1]
Is the.i = next[i];
- So what if
i = 1,next[1] = 0
; That’s where 0 is; 0 means we need the next character to compare, we need the head to start the comparison; - So if it’s
next[i]
Is equal to some other position, which means I goes back to some other position. And only if I has a possibility to go back to something else.
- At this time due to
i = 0
, so that means if you need to start from scratch; - So the position of the head refers to
1
. Soi++,j++
; Because this is whenj = 3
If there is no match, go back to the starting position. So j is also ➕➕; - At the same time
next[j] = i; next[3] = 1
;
- At this time
i = 1, j = 3
; So we’re going to see if there’s a possibility of backtracking within that range; - At a time when
T[i] ! = T[j], a ! = c
, so you should go back to where you started; - That is
i = next[i]
; So I is equal to 0. - Start from the beginning next time.
- At this time
i = 0
Which means whenj = 4
Is the need to start from scratch; - then
i++,j++
And at the same timenext[j] = i
; - At this point,
i = 1,j = 4
.next[4] = 1
;
- At this point in
[1, 4]
Range to see if there is a position that can reduce the backtrace length - but
T[i]! =T[j], a ! = d
, so this means that when we need to start the comparison again; - then
i = next[i]
; At this timei = next[i] ; i = 0;
- At this time
i = 0
Which means whenj = 5
Is the need to start from scratch; - then
i++,j++
And at the same timenext[j] = i
; - At this point,
i = 1,j = 5
.next[5] = 1
;
- At this point in
(1, 5)
Range to see if there is a position that can reduce the backtrace length - but
T[i]! =T[j], a ! = e
, so this means that when we need to start the comparison again; - then
i = next[i]
; At this timei = next[i] ; i = 0;
- At this time
i = 0
Which means whenj = 6
Is the need to start from scratch; - then
i++,j++
And at the same timenext[j] = i
; - At this point,
i = 1,j = 6
.next[6] = 1
;
- At this time
j = 6
. If the loop condition cannot be met, the loop will exit;
conclusion
There are four cases in solving the next array:
- The default
next[1] = 0;
- when
i=0
, indicating that the current ratio should start from scratch. thei++,j++,next[j] = i
; - when
T[i] == T[j]
Indicates that the two characters are equali++,j++
. At the same timenext[j] = i
; - when
T[i] ! = T[j]
Is not equal, you need to return I to a reasonable position. thei = next[i]
;
thennext
How are arrays applied to KMP lookups?
When character comparison does not match, if it is a violent search method;
-
Main string and pattern string do not match when I = 4 and j = 4. So the next comparison
-
I = 2,j = 1, start the comparison again;
-
When KMP is used, a mismatch is found when I = 4 and j = 4, and the next comparison is made.
-
I retains the position of 4 without rollback; J = next[J];
-
Next [j] = 1, next[j] = 1, next[j] = 1
Next array solution of KMP algorithm
//----KMP pattern matching algorithm -- //1. // Note that string T[0] is the length of the stored string; The real character content starts at T[1]; void get_next(String T,int *next){ int i,j; j = 1; i = 0; next[1] = 0; //abcdex // traverses the pattern string T, where T[0] is the length of the pattern string T;printf("length = %d\n",T[0]);
while (j < T[0]) {
printf("i = %d j = %d\n",i,j);
if(I = = 0 | | T = = [I] T [j]) {/ / T [I] said the suffix of a single character; //T[j]; ++i; ++j; next[j] = i;printf("next[%d]=%d\n",j,next[j]);
}else{// If the characters are different, the I value is backdated; i = next[i]; }}}Copy the code
KMP Search Algorithm (1)
int count = 0; //KMP matching algorithm // return the position of the substring T after the pos character in the main string S, or 0 if it does not exist; Int Index_KMP(String S,String T,int pos){// I = pos; int j = 1; // Define an empty next array; int next[MAXSIZE]; // Next array; get_next(T, next); count = 0; T[0] and S[0] store the length of string T and string S; // If I is less than the length of S and j is less than the length of T, the cycle continues;while(I <= S[0] &&j <= T[0]) {// Continue if two letters are equal, and j++, I ++if(j == 0 || S[i] == T[j]){
i++;
j++;
}else{// if j does not match,j falls back to the appropriate position and I remains unchanged; j = next[j]; }}if (j > T[0]) {
return i-T[0];
}else{
return0; }}Copy the code
Optimization of KMP pattern matching algorithm
The KMP algorithm is also flawed. For example, if our primary string S = “aaaABCde” and the pattern string T = “aaAAax “. Where the next array is 012345;
- When you start matching, when
i = 5, j = 5
When we find characters"b"
With the characters"a"
When not equal, as shown in figure ①j = next[5] = 4
.
- Then the character is shown in figure ②
"b"
With the fourth position"a"
Still unequal;j = next[4] = 3;
As shown in figure 3.
- As shown in FIG. ④⑤; when
j = next[1] = 0
According to the algorithm, at this timei++,j++
; geti = 6,j = 1
. As shown in figure 6;
In fact, in the process of comparison, the backtracking comparison of steps ②③④⑤ is redundant judgment;
Since the second, third, fourth, and fifth positions of the T string all correspond to the first character"a"
Equal, so we can use the first digitnext[1]
To replace the subsequent character of its equivalentnext[j]
So let’s pairnext
Function to improve;
Next array =,1,2,3,4,5 {0} if you want to save just (2) (3) (4) (5) invalid comparison requires the next array =,0,0,0,0,5 {0}
Optimization of KMP pattern matching next array
-
NextVal = 0 when j = 1; Continue with the logic of next[1];
-
When j = 2, that is, when j = 2 a matching error occurs, then since the next value of the second character ‘b’ is 1 and the first character is ‘a’. Nextval [2] = nextval[2] = nextval[2] = nextval[2];
- when
j = 3
At this time because the3
A character'a'
的next
The value is 1, so we know the first digit'a'
And the third'a'
Is equal, thennextVal[3] = nextVal[1] = 0;
- when
j = 4
Because of the fourth character"b"
的next = 2
; So we know that it corresponds to the second character"b"
If is equal, thennextVal[4] = nextVal[2] = 1;
- when
j = 5
When,next
The value is 3, the fifth character"A"
And the third character"A"
Equal,nextVal[5] = nextVal[3] = 0
; - when
j = 6
When,next
The value is 4, the sixth character"A"
And the fourth character"B"
If not, thennextVal[6] = 4;
- when
j = 7
When,next
The value is 2, the seventh character"A"
And the second character"B"
If not, thennextVal[7] = 2;
- when
j = 8
When,next
The value is 2, the eighth character"B"
And the second character"B"
Equal,nextVal[6] = nextVal[2] = 1
; - when
j = 9
When,next
The value is 3, the ninth character"A"
And the third character"A"
Equal,nextVal[9] = nextVal[3] = 0;
KMP_NextVal array logic
There are 5 cases in which the nextVal array is evaluated:
- The default
nextval[1] = 0
; T[i] == T[j]
且++i,++j
后T[i]
Still is equal to theT[j]
则nextval[i] = nextval[j]
i = 0
, means to start from scratchi++,j++
Later, andT[i] ! = T[j], nextVal = j
;T[i] == T[j]
且++i,++j
后T[i] != T[j] ,则nextVal = j;
- when
T[i] ! = T[j]
Is not equal, you need to return I to a reasonable position. thei = next[i];
Code to implement the KMP_NextVal array
void get_nextVal(String T,int *nextVal){
int i,j;
i = 1;
j = 0;
nextVal[1] = 0;
while (i < T[0]) {
if(j == 0 || T[i] == T[j]) { ++j; ++i; // If the current character is different from the prefix, the current j is the value of nextVal at Iif(T[i] ! = T[j]) nextVal[i] = j;elseNextVal [I] = nextVal[j]; nextVal[I] = nextVal[j]; }else{ j = nextVal[j]; }}}Copy the code