BF algorithmAnalysis of the

If the primary string is “abcdefgab” and the pattern string is “abcdex” (each bit of the pattern string is different), then according to the BF algorithm, when comparing the “F” of the primary string, the match fails, the pattern string falls back to the starting position “A”, the primary string falls back to the character “B”, the match fails, the primary string is subscript to “C”, and so on…

In fact, because each bit of the pattern string is different, when the first match fails, the part “A” to “E” that has already been matched successfully, so the “B” to “E” of the main string cannot match the first character “A” of the pattern string. Therefore, we can omit this part of the loop. The diagram below:

Explore the principle of KMP pattern matching algorithm

We then use I to represent traversal of the primary string (I starts at 1) and j to represent traversal of the pattern string (j starts at 1).

The essence of KMP algorithm is to make use of known information (pattern string), modify the new value according to its repeated value (the main string does not go back), and get the most appropriate position to compare with the main string in the next round:

Let’s look at another example: suppose the main string is “abcababca” and the pattern string is “abcabx”. How do we compare? According to the above experience, when I =6&&j=6 is compared in the first round, the matching fails, because the first three ABCs are different from each other, so I =2&&j=1 I =3&&j=1 is a meaningless comparison, as shown in the figure below:

So, let’s think about, is the fourth step necessary? The first two “ab” values are the same as the fourth and fifth “ab” values in the main string. The first two “ab” values are the same as the fourth and fifth “ab” values in the main string. This step can also be improved (directly from I =6 j=3) :

KMP pattern matching algorithm — Next array derivation

We define the j-value variation at each position of the pattern string as a next array, so the length of next is the length of the pattern string, and the subscript of next also starts from 1, with the first digit being 0 by default, i.e. next[1] = 0. In KMP, I doesn’t go backwards, I either doesn’t move, or it moves backwards with J.

Derivation of next with mode string “abcedx” :

  • J =1 matches conflict, next[1] = 0
  • J =2, j from 1 to j-1, only character “A”, next[2]=1
  • When j=3, there is only character “AB” in the range from j 1 to J-1, and there is no equality. Next [3]=1
  • When j=4, there is only character “ABC” in the range from j 1 to J-1, and there is no equality. Next [4]=1
  • When j=5, there is only character “abcd” in the range from j 1 to J-1, and there is no equality. Next [5]=1
  • When j=6, there is a match conflict, and there is only character “abcde” in the range from j 1 to J-1. There is no equality, and next[6]=1

Derivation of next when mode string is “abcabx” :

  • J =1 matches conflict, next[1] = 0
  • J =2, j from 1 to j-1, only character “A”, next[2]=1
  • When j=3, there is only character “AB” in the range from j 1 to J-1, and there is no equality. Next [3]=1
  • When j=4, there is only character “ABC” in the range from j 1 to J-1, and there is no equality. Next [4]=1
  • When j=5, there is a matching conflict, and there is a character “ABCA” in the range from j 1 to J-1, and there is a repetition. We find the repeated pre-character and post-character, and get the pre-character “A” and post-character “A”. At this time, it is unnecessary to backtrace J to A, because A must exist, so it is directly backtraced to B, that is, next[5]=2
  • When j=6, there is a matching conflict, and there is a character “abCAB” in the range of J from 1 to J-1. There is a repetition, and the leading character “AB” and the trailing character “ab” are obtained. At this time, it is unnecessary to backtrace J to A and B, and it is directly backtraced to C to get next[6]=3

If two characters are equal, then next is equal to 3. If n characters are equal, then next is equal to n+1

Compute the next array of pattern string “ababaaaba” as shown above:

  • J <4,next[1]=0,next[2]=1,next[3]=1;
  • J =4,next[4]=2;
  • J =5,next[5]=3;
  • J =6, note that the same characters are “ABA” and “ABA” respectively (the longest equal character when the front and back are not exactly equal), the middle a is both before and after,next[6]=4;
  • J =7,next[7]=2;
  • J =8,next[8]=2;
  • J =9, the same characters are “ab” and “ab”,next[9]=3;

Try “aaaaaaaab” again:

  • J =1, next[1] = 0
  • J =2, only “a”,next[2]=1;
  • J =3,next[3]=2;
  • J =4,next[4]=3;
  • J =5,next[5]=4;
  • J =6,next[6]=5;
  • J =7,next[7]=6;
  • J =8,next[8]=7;
  • J =9, the same characters are “aaaaAAA” and “aaaaAAA” respectively,next[9]=8;

Evaluate the next array with code

Suppose the main string is “abcababca”, the pattern string is “abcdex”, and the next array is {0,1,1,1,1,1}. When traversal reaches a match failure, j is backtraced to next[j]

Let’s simulate the execution process:

  • In the first round of comparison, when I =4 and j=4, the matching fails, and j is reversed to the position of next[4], that is,j= next[4]=1.

  • Continue traversal, I =6,j=3 matching failure, back to j=next[3]=1 to continue comparison

  • And so on……
So how do you evaluate the next array in code?

Suppose that the primary string S=’abcababca’ and the pattern string T=’abcdex’

  1. Default next[1] = 0;

2, I =0,j=1 start traversal, when j< t. length, traversal the string from 1-length

3, if I =0, the same string is not found in the range [0,j], so I should be backtraced to 1, indicating next[j]=1

4, if T[I]=T[j], next[j]= I;

5. If neither of the above conditions is met, then I is reversed to the previous recorded position of next[I]

Simulation calculation steps (at the beginning I =0,j=1) :

1, I = 0, j = 1, T [I]! I++,j++,next[2]=1; At this point I = 1, j = 2

2, T[i]! =T[j] && i! =0, in this case, I needs to be reversed to the position of next[I], that is, I =0, I =0,j=2

3, T[i]! = T [j] && I = 0, said there was no repeated characters within the scope of the I – j, i++, j++, next [3] = 1, I = 1, j = 3

4, T[i]! =T[j] && i! =0, in this case, I needs to be backtraced to the position of next[I], that is, I =0, I =0,j=3

5, T[i]! = T [j] && I = 0, said there was no repeated characters within the scope of the I – j, i++, j++, next [4] = 1, I = 1, j = 4

6, T[i]! =T[j] && i! =0, in this case, I needs to be back to the position of next[I], that is, I =0, I =0,j=4

7, T[i]! = T [j] && I = 0, said there was no repeated characters within the scope of the I – j, i++, j++, next [5] = 1, this time I = 1, j = 5

8, T[i]! =T[j] && i! =0, in this case, I needs to be reversed to the position of next[I], that is, I =0, I =0,j=5

9, T[i]! = T [j] && I = 0, said there was no repeated characters within the scope of the I – j, i++, j++, next [6] = 1, I = 1, j = 6

10. Break out of the loop

The next array is [0,1,1,1,1,1]

There are four cases when evaluating the next array:

1, default next[1]=0;

I ++,j++,next[j]=1;

3, when T[I]==T[j], the same character is found, ready to start the next loop I ++, j++, and next[j]= I; (Because the next array starts at 1 and the traversal starts at 0, do ++ first and then assign)

4, when T [I]! =next[I] =next[I] =next[I]

The following code is used to calculate:

// Return the next array of substring T by calculating; // Note that string T[0] is the length of the stored string; The real character content starts at T[1]; void get_next(String T,int *next){ int i,j; j = 1; i = 0; next[1] = 0; //abcdex // traverses the pattern string T, where T[0] is the length of the pattern string T; //printf("length = %d\n",T[0]); while (j < T[0]) { //printf("i = %d j = %d\n",i,j); If (I = = 0 | | T = = [I] T [j]) {/ / T [I] said the suffix of a single character; //T[j]; ++i; ++j; next[j] = i; //printf("next[%d]=%d\n",j,next[j]); }else {// If characters are different, the value of I is backloaded; i = next[i]; }}}Copy the code

Use of the next array

KMP ideas:

1, traverses primary string S, I is the index used to mark primary string, traverses pattern string T, j is the index used to mark pattern string;

2, when I > S.l ength | | j > T.l ength; I > s.length, j< t.length, indicating that the main string traversal did not find a match; There is only one possibility, which is when j> t. length means a matching string has been found

3. When j = 0, it means that the mode string needs to be compared from position 1 to position I +1

4, if S[I] == T[j], the position character of mode string j and main string I is equal, I ++,j++;

5, when j>0 and S[I]! =T[j], indicating that this matter needs to move the j of the pattern string, so we let j=next[j], in order to reduce the number of comparisons

//KMP matching algorithm // return the position of the substring T after the pos character in the main string S, or 0 if it does not exist; Int Index_KMP(String S,String T,int pos){// I = pos; int j = 1; // Define an empty next array; int next[MAXSIZE]; // Next array; get_next(T, next); count = 0; T[0] and S[0] store the length of string T and string S; // If I is less than the length of S and j is less than the length of T, the cycle continues; While (I < = S [0] && < = j T [0]) {/ / if equal to two letters, and j++ i++ if (j = = 0 | | S [I] T [j]) = = {i++; j++; }else{// if there is no match,j falls back to the appropriate position, and I remains unchanged; j = next[j]; } } if (j > T[0]) { return i-T[0]; }else{ return -1; }}Copy the code