KMP pattern matching algorithm

Ordinary string matching algorithm is very inefficient, I will explain the KMP fast matching algorithm for you.

Before I talk about THE KMP algorithm, let me talk about the general matching algorithm

To put it simply, a normal algorithm to match a string would loop through S until the string is matched or all is iterated.

char *S = '12351234';
char *T = '1234';
Copy the code

The complete code

#include <stdio.h>
#include <string.h>
int sel(char * S,char *T){
    int i=0,j=0;
    while (i<strlen(S) && j<strlen(T)) {// The loop starts
        if (S[i]==T[j]) {// First character comparison
            i++;
            j++;
        }else{
            i=i-j+1;// The match fails and returns to the next digit of the last match
            j=0;// The match continues from the first character of T and the next character of S}}if (j==strlen(T)) {// If j is equal to the length of T, the match is successful
        return i-strlen(T)+1;
    }
    return 0;
}
int main(a) {
    int add=sel("12351234"."1234");
    printf("%d",add);
    return 0;
}
// Run result: 5
Copy the code

General algorithm flow

S: 1, 2, 3, 5, 1, 2, 3, 4 // The first time I =1, j=1

T: 1, 2, 3, 4

S: 1 2 3 5 1 2 3 4 // Second time I =2, j=2

T: 1, 2, 3, 4

S: 1 2 3 5 1 2 3 4 // the third time I =3,j=3

T: 1, 2, 3, 4

S: 1 2 3 5 1 2 3 4 // The fourth failure I =1, j=0

T: 1, 2, 3, 4

The ordinary algorithm will be brainless cyclic matching, in fact, the efficiency is very low, have you noticed that, as long as the matching fails, it will start from the ith bit of T to match the first bit of S.

S: 1, 2, 3, 5, 1, 2, 3, 4 // The first time I =1, j=1

T: 1, 2, 3, 4

When T [0] = 1, T [1] = 2, S [1] = 2, T [0] clearly indicates T [1], T [1] = S [1], this is the key to the KMP algorithm, remove excess judgment.

The advantage of THE KMP algorithm compared with the ordinary matching algorithm is that on the premise of ensuring that the I pointer does not backtrack, when the matching fails, the matching string can be moved to the right by the maximum distance.

In order to ensure that the I pointer does not backtrack and function, only the J pointer can be backtracked. The backtracking distance of the j pointer is the distance that the T string moves to the right. The position of the J pointer corresponding to each character in T can be calculated by the algorithm, and the result will be stored in an array (array named next).

For each matching string (T), the first corresponding value is 0 and the second corresponding value is 1.

For example, find next for Y (1, 2, 3, 1, 2, 4, 5). The first two characters correspond to a fixed 0 and 1.

The third character “3” extracts the character “12”, “1” and “2” are not equal, the same number is 0,0 +1=1, so “3” corresponding to next is 1.

First, “1” and “3” do not want to wait, the same number is 0,0 +1=2, so “1” corresponding to next is 1.

The fifth character “2” extracts the character “1231”, the first “1” is the same as the last “1”, the same number is 1, 1+1=2, so “2” corresponds to the next is 2.

The sixth character “4” extracts the character “12312”. The first two “12” are the same as the last two “12”. The same number is 2,2 +1=3, so the corresponding “4” is 3.

The seventh character “5” extracts the character “123124”, the first character “1” and the last character “4” are not equal, the same number is 0,0 +1=1, so “5” corresponds to next is 1.

So we end up with the next array and the value is (0,1,1,1,2,3,1)

The specific algorithm is as follows

one

Y(subscript starting from 1) : "1231245" next array (subscript starting from 1) : 01 Because next[1]=0, end. "3" corresponds to the next value of 1. (As long as we loop until next[1]=0, next is 1 for this character.)Copy the code

two

Y: "1231245" next array: 011 the fourth character is "1" : because the first character "3" next value is 1, take Y[1]= "1" and "3" compare, continue; Because next[1]=0, end. The value of next corresponding to "1" is 1.Copy the code

three

Y: "1231245" next array: 0111 fifth character "2" : Y: "1231245" next array: 0111 fifth character "2" : Y: "1231245" next array: 0111 fifth character "2" : Y: "1231245" next array: 0111 fifth character "2" : Y: "1231245" next array: 0111 fifth character "2" : Y: "1231245" next array: 0111 fifth character "2" : Y: "1231245" next array: 0111 fifth character "2" : "2" corresponds to the next value of 2.Copy the code

four

Y: "1231245" next array: 01112 the sixth character is "4" : because the previous character "2" next value is 2, take Y[2]= "2" and "2" comparison, equal, 2 (the previous character "2" next)+1=3, "4" corresponding next value is 3.Copy the code

five

Y: "1231245" next array: 011123 the 7th character is "5" : because the last character "4" next value is 3, take Y[3]= "3" and "4" compare, continue, because next[3]=1, take Y[1]= "1" and "4" compare, end; "2" corresponds to the next value of 2.Copy the code

So we end up with the next array and the value is (0,1,1,1,2,3,1)

The algorithm implements the next array

#include <stdio.h>
#include <string.h>
void Next(char*T,int *next){
    int i=1;
    next[1] =0;
    int j=0;
    while (i<strlen(T)) {
        if (j==0||T[i- 1]==T[j- 1]) {
            i++;
            j++;
            next[i]=j;
        }else{ j=next[j]; }}}Copy the code

KMP algorithm

int KMP(char * S,char * T){
    int next[10];
    Next(T,next);// Initialize the next array according to the pattern string T
    int i=1;
    int j=1;
    while (i<=strlen(S)&&j<=strlen(T)) {
    
        if (j==0 || S[i- 1]==T[j- 1]) {// if j==0 then there is no match and the pointer moves down
        // If the match is successful, the pointer moves down
            i++;
            j++;
        }
        else{ j=next[j]; }}if (j>strlen(T)) {
        return i-(int)strlen(T);
    }
    return - 1;
}
/ / the main body
int main(a) {
    int i=KMP("ababcabcacbab"."abcac");
    printf("%d",i);
    return 0;
}

Copy the code

Running result: 6

Optimize the Next array

Such as:

T: A b C A C

next 0 1 1 1 2

In T, there are two characters “a”, and we assume that the first is A1 and the second is A2. In the program matching process, if the j pointer to A2 fails to match, then at this point, the I pointer does not move, j pointer to A1, obviously, because a1==a2, and a2! Is equal to S[I], so a1 is definitely not equal to S[I].

To avoid unnecessary judgments, we need to simplify the next array. For T, since T[4] == T[next[4]], we can change the next array to:

T: A b C A C

Next: 0, 1, 1, 0, 2 // If j==0 then there is no match and the j and I Pointers move down

void Next(char*T,int *next){
    int i=1;
    next[1] =0;
    int j=0;
    while (i<strlen(T)) {
        if (j==0||T[i- 1]==T[j- 1]) {
            i++;
            j++;
            if (T[i- 1]! =T[j- 1]) {
               next[i]=j;
            }
            else{ next[i]=next[j]; }}else{ j=next[j]; }}}Copy the code

KMP pattern matching algorithm