String pattern matching algorithm, do you know?

The storage structure of a string

Like linear tables, strings have two basic storage structures: sequential storage and chained storage. But considering the storage efficiency and the convenience of the algorithm, the sequential storage structure is adopted.

Sequential storage of 1 string

Char char [MAXSTRLEN+1]; char char [MAXSTRLEN+1]; int length; }SString;Copy the code

This definition is static. The size of the string space is determined at compile time. In most cases, the operation of the string is in the form of the whole string, and the length of the string variable varies greatly during the operation. Exist in C, this one is called a “Heap” (Heap) of free storage, can be generated for each new dynamic allocation of a string of a string of actual storage space needed for long, if the distribution is successful, it returns a pointer to the starting address, as base of string, the string stored in strings of pile type sequential storage structure, defined as follows:

typedef struct{
      char *ch;
      int length;
  }HString;Copy the code

String pattern matching algorithm

The location operation of substrings is usually called pattern matching or string matching. This operation is widely used, such as in search engines, spell checking, language translation, data compression and other applications, all need to carry out string matching.

Let S be a substring and also a text string. Set T as a substring, also known as the pattern, in the main string S to find the substring matching pattern T, if the match is successful, determine the position of the first character in the matched substring in the main string S. The famous pattern matching algorithm has BF algorithm and KMP algorithm

BF algorithm

C + + language

#include<cstring>
#include<iostream>
using namespace std;

#define OK 1
#define ERROR 0
#define OVERFLOW -2
typedef int Status;
#define MAXSTRLEN 255 #define MAXSTRLEN 255Char char [MAXSTRLEN+1]; char char [MAXSTRLEN+1]; int length; }SString; Status StrAssign(SString &T, char *chars) {// Generate a string T int I whose value is chars;if (strlen(chars) > MAXSTRLEN)
        return ERROR;
    else {
        T.length = strlen(chars);
        for (i = 1; i <= T.length; i++)
            T.ch[i] = *(chars + i - 1);
        return OK;
    }
}
int Index(SString S, SString T, int pos)
{
    int i,j;
    i=pos;
    j=1;
    while(i<=S.length && j<=T.length){
        if(S.ch[i]==T.ch[j]){
            ++i;
            ++j;
        }
        else{ i=i-j+2; j=1; }}if(j>T.length){
        return i-T.length;
    }
    else return 0;    
}
int main()
{
    SString S;
    StrAssign(S,"ababcabcacbab");
    SString T;
    StrAssign(T,"abcac");
    cout<<"Primary string and substring at no."<<Index(S,T,1)<<"First match at character \n";
    return 0;
}Copy the code

The Java language

package one;

public class BF {
    public static int find(String str1,String str2){
        char[] c1=str1.toCharArray();
        char[] c2=str2.toCharArray();
        int i=0;
        int j=0;
        while(i<str1.length() && j<str2.length()){
            if(c1[i]==c2[j]){
                i++;
                j++;
            }
            else{ i=i-j+1; j=0; }}if(j>=str2.length()) return i-str2.length();
        else return- 1; } public static void main(String[] args) { String str1="ababcabcacbab";
        String str2="abcac";
        int n=find(str1,str2);
        System.out.println(str2+"In"+str1+The position appearing in "is:+n); }}Copy the code

The javascript language

function find(str1,str2){
    var c1=str1.split("");
    var c2=str2.split(""); var i=0; j=0;while(i<str1.length&&j<str2.length){
        if(c1[i]==c2[j]){
            i++;
            j++;
        }
        else{ i=i-j+1; j=0; }}if(j>=str2.length){
        return i-str2.length;
    }
    else return- 1; } var str1="ababcabcacbab";
var str2="abcac";
console.log(find(str1,str2));Copy the code

Think of the javascript indexOf() method for string

function find2(str1,str2){
    return str1.indexOf(str2);
}
var str1="ababcabcacbab";
var str2="abcac";
console.log(find2(str1,str2));Copy the code

Wouldn’t that make it easier? (1) In the best case, every unsuccessful match occurs when the first character in the pattern string is compared with the corresponding character in the main string. For example: S=”aaaaaba” T=”ba” Let the main string length be n and the substring length be m if the number of successful character comparisons in the ith run is M, then the total number of comparisons is i-1+m. For the main string that is successfully matched, its starting position is from 1 to n-M +1. Assume that the probability of successful matches at the n-M +1 starting position is equal. The average comparison times of successful matching in this case can be calculated as

O(n+m)

(i-1)*m
i*m

O(n*m)

KMP algorithm

Before I start this algorithm, I’ll introduce two concepts: prefixes and suffixes

As shown in the figure above, “prefix” refers to all combinations of the heads of a string except the last character; Suffixes refer to all combinations of the ends of a string except the first character.

To better illustrate this algorithm, let’s look at an example: The primary string S=”abcdefgab” the pattern string T=”abcdex” the initial letter “A” of the pattern string T is not equal to any character in the following string “bcdex”, which means that the first character “A” of the pattern string T cannot be equal to the second through fifth characters of the S string, I don’t have to go back to I minus j plus 2 like BF did, so I is equal to 6. But what if the T string starts with “A”?

S=”abcabcabc” T=”abcabx” T=”abcabx” T=”abcabx” T=”abcabx” T=” ABcabx “T=” ABcabx” T=” ABcabx “T=” ABcabx” Since the first “a” of T is equal to the fourth “a” of T and the second “B” is equal to the fifth “B”, the first character “a” of T and the second character “b” are also equal to the fourth and fifth characters of S without any further comparison. We just have to directly compare the values at I =6 and j=3. There is no need to backtrack the value of I. The change of j has nothing to do with the main string. The key is whether there are repeated substrings in the structure of the T string.

Since there are no duplicate characters in T=”abcdex”, the j value changes from 6 to 1, and at T=”abcabx”, the prefix “ab” is equal to the suffix “ab” before the final “x”, so the j changes from 6 to 3. Therefore, we can draw the rule that the value of j depends on the similarity of the prefix and suffix of the string before the current character. We define the variation of j values at various positions in the T string as an array next.

Next Array value derivation

Example:

j	123456789
Pattern string T	ababaaaba
next[j]	011234223

j	123456789
Pattern string T	aaaaaaaab
next[j]	012345678

j	123456
Pattern string T	abcabx
next[j]	011123

C++ code implementation

#include<cstring>
#include<iostream>
using namespace std;

#define OK 1
#define ERROR 0
#define OVERFLOW -2
typedef int Status;
#define MAXSTRLEN 255 #define MAXSTRLEN 255typedef char SString[MAXSTRLEN+1]; Status StrAssign(SString T,char *chars){int I;if(strlen(chars)>MAXSTRLEN)
        return ERROR;
    else{
        T[0]=strlen(chars);
        for(i=1; i<=T[0]; i++){ T[i]=*(chars+i-1); }return OK;
    }
}
void get_next(SString T,int next[]){
    int i=1,j=0;
    next[1]=0;
    while(i<T[0]){
        if(j = = 0 | | T = = [I] T [j]) {/ / T [I] said the suffix of a single character + + I; //T[j] indicates a single character prefix ++j; next[i]=j; }elsej=next[j]; }} int Index_KMP(SString S,SString T,int pos,int next[]) {int I =pos; Int j=1; if pos is not 1, int j=1;while(i<=S[0]&&j<=T[0]){
        if(j==0||S[i]==T[j]){
            ++i;
            ++j;
        }
        else j=next[j];
    } 
    if(j>T[0])
        return i-T[0];
    else return 0;    
}
int main(){
    SString S;
    StrAssign(S,"aaabaaaab");
    SString T;
    StrAssign(T,"aaaab");
    int *p=new int[T[0]+1];
    get_next(T,p);
    cout<<"Primary string and substring at no."<<Index_KMP(S,T,1,p)<<"First match at character \n";
    return 0; 
}Copy the code

KMP pattern matching algorithm improved

If the primary string is S=”aaaabcde” T=”aaaaax”, the next array value is 01234. At the beginning, I =5,j=5,b is not equal to z,j= next[5]=4, “b” is still equal to the fourth “a”,j= next[4]=3; . J =next[1]=0, according to the algorithm,++ I,++j, I =6,j=1. So there are four extra steps in this process. Next [j] = next[j] = next[j] = next[j] = next[j]

Void get_next(SString T, int next[]) {void get_next(SString T, int next[]); next[1] = 0;while (i < T[0]){
        if (j == 0 || T[i] == T[j])
        {
            ++i;
            ++j;
            if(T[i]! =T[j]) next[i]=j;else next[i]=next[j];
        }
        elsej = next[j]; }}Copy the code

Improved nexT1 array value derivation

Example:

j	123456789
Pattern string T	ababaaaba
next[j]	011234223
next1[j]	010104210

j	123456789
Pattern string T	aaaaaaaab
next[j]	012345678
next1[j]	000000008

Java version of KMP algorithm

package one; public class KMP { public static int[] getNext(String b) { int len=b.length(); int j=0; int next[]=new int[len+1]; //next represents the longest common part of the prefix and suffix of a string of length I, starting at 1 next[0]=next[1]=0;for(int i=1; i<len; I ++)// I represents the subscript of the string, starting at 1 {//j represents the value of next[I] at the beginning of each loop, as well as the next position to comparewhile(j>0&&b.charAt(i)! =b.charAt(j)) j=next[j];if(b.charAt(i)==b.charAt(j))
                j++;  
            next[i+1]=j;  
        }  

        return next;  
    }  
    public static void search(String original, String find, int next[]) {  
        int j = 0;  
        for (int i = 0; i < original.length(); i++) {  
            while(j > 0 && original.charAt(i) ! = find.charAt(j)) j = next[j];if (original.charAt(i) == find.charAt(j))  
                j++;  
            if (j == find.length()) {                 
                System.out.println("find at position " + (i - j+1));  
                System.out.println(original.subSequence(i - j + 1, i + 1));  
                j = next[j];
            }  
        }  
    }
    public static void main(String[] args){
        String s1="abcaababcabacc";
        String s2="ababcab"; int[] next=getNext(s2); search(s1,s2,next); }}Copy the code

Javascript version of the interested children’s shoes can write their own, not here.