The storage structure of a string
Like linear tables, strings have two basic storage structures: sequential storage and chained storage. But considering the storage efficiency and the convenience of the algorithm, the sequential storage structure is adopted.
Sequential storage of 1 string
Char char [MAXSTRLEN+1]; char char [MAXSTRLEN+1]; int length; }SString;Copy the code
This definition is static. The size of the string space is determined at compile time. In most cases, the operation of the string is in the form of the whole string, and the length of the string variable varies greatly during the operation. Exist in C, this one is called a “Heap” (Heap) of free storage, can be generated for each new dynamic allocation of a string of a string of actual storage space needed for long, if the distribution is successful, it returns a pointer to the starting address, as base of string, the string stored in strings of pile type sequential storage structure, defined as follows:
typedef struct{
char *ch;
int length;
}HString;Copy the code
String pattern matching algorithm
The location operation of substrings is usually called pattern matching or string matching. This operation is widely used, such as in search engines, spell checking, language translation, data compression and other applications, all need to carry out string matching.
Let S be a substring and also a text string. Set T as a substring, also known as the pattern, in the main string S to find the substring matching pattern T, if the match is successful, determine the position of the first character in the matched substring in the main string S. The famous pattern matching algorithm has BF algorithm and KMP algorithm
BF algorithm
C + + language
#include<cstring>
#include<iostream>
using namespace std;
#define OK 1
#define ERROR 0
#define OVERFLOW -2
typedef int Status;
#define MAXSTRLEN 255 #define MAXSTRLEN 255Char char [MAXSTRLEN+1]; char char [MAXSTRLEN+1]; int length; }SString; Status StrAssign(SString &T, char *chars) {// Generate a string T int I whose value is chars;if (strlen(chars) > MAXSTRLEN)
return ERROR;
else {
T.length = strlen(chars);
for (i = 1; i <= T.length; i++)
T.ch[i] = *(chars + i - 1);
return OK;
}
}
int Index(SString S, SString T, int pos)
{
int i,j;
i=pos;
j=1;
while(i<=S.length && j<=T.length){
if(S.ch[i]==T.ch[j]){
++i;
++j;
}
else{ i=i-j+2; j=1; }}if(j>T.length){
return i-T.length;
}
else return 0;
}
int main()
{
SString S;
StrAssign(S,"ababcabcacbab");
SString T;
StrAssign(T,"abcac");
cout<<"Primary string and substring at no."<<Index(S,T,1)<<"First match at character \n";
return 0;
}Copy the code
The Java language
package one;
public class BF {
public static int find(String str1,String str2){
char[] c1=str1.toCharArray();
char[] c2=str2.toCharArray();
int i=0;
int j=0;
while(i<str1.length() && j<str2.length()){
if(c1[i]==c2[j]){
i++;
j++;
}
else{ i=i-j+1; j=0; }}if(j>=str2.length()) return i-str2.length();
else return- 1; } public static void main(String[] args) { String str1="ababcabcacbab";
String str2="abcac";
int n=find(str1,str2);
System.out.println(str2+"In"+str1+The position appearing in "is:+n); }}Copy the code
The javascript language
function find(str1,str2){
var c1=str1.split("");
var c2=str2.split(""); var i=0; j=0;while(i<str1.length&&j<str2.length){
if(c1[i]==c2[j]){
i++;
j++;
}
else{ i=i-j+1; j=0; }}if(j>=str2.length){
return i-str2.length;
}
else return- 1; } var str1="ababcabcacbab";
var str2="abcac";
console.log(find(str1,str2));Copy the code
Think of the javascript indexOf() method for string
function find2(str1,str2){
return str1.indexOf(str2);
}
var str1="ababcabcacbab";
var str2="abcac";
console.log(find2(str1,str2));Copy the code
Wouldn’t that make it easier? (1) In the best case, every unsuccessful match occurs when the first character in the pattern string is compared with the corresponding character in the main string. For example: S=”aaaaaba” T=”ba” Let the main string length be n and the substring length be m if the number of successful character comparisons in the ith run is M, then the total number of comparisons is i-1+m. For the main string that is successfully matched, its starting position is from 1 to n-M +1. Assume that the probability of successful matches at the n-M +1 starting position is equal. The average comparison times of successful matching in this case can be calculated as
O(n+m)
(i-1)*m
i*m
O(n*m)
KMP algorithm
Before I start this algorithm, I’ll introduce two concepts: prefixes and suffixes
As shown in the figure above, “prefix” refers to all combinations of the heads of a string except the last character; Suffixes refer to all combinations of the ends of a string except the first character.
To better illustrate this algorithm, let’s look at an example: The primary string S=”abcdefgab” the pattern string T=”abcdex” the initial letter “A” of the pattern string T is not equal to any character in the following string “bcdex”, which means that the first character “A” of the pattern string T cannot be equal to the second through fifth characters of the S string, I don’t have to go back to I minus j plus 2 like BF did, so I is equal to 6. But what if the T string starts with “A”?
S=”abcabcabc” T=”abcabx” T=”abcabx” T=”abcabx” T=”abcabx” T=” ABcabx “T=” ABcabx” T=” ABcabx “T=” ABcabx” Since the first “a” of T is equal to the fourth “a” of T and the second “B” is equal to the fifth “B”, the first character “a” of T and the second character “b” are also equal to the fourth and fifth characters of S without any further comparison. We just have to directly compare the values at I =6 and j=3. There is no need to backtrack the value of I. The change of j has nothing to do with the main string. The key is whether there are repeated substrings in the structure of the T string.
Since there are no duplicate characters in T=”abcdex”, the j value changes from 6 to 1, and at T=”abcabx”, the prefix “ab” is equal to the suffix “ab” before the final “x”, so the j changes from 6 to 3. Therefore, we can draw the rule that the value of j depends on the similarity of the prefix and suffix of the string before the current character. We define the variation of j values at various positions in the T string as an array next.
Next Array value derivation
Example:
j | 123456789 |
---|---|
Pattern string T | ababaaaba |
next[j] | 011234223 |
j | 123456789 |
---|---|
Pattern string T | aaaaaaaab |
next[j] | 012345678 |
j | 123456 |
---|---|
Pattern string T | abcabx |
next[j] | 011123 |
C++ code implementation
#include<cstring>
#include<iostream>
using namespace std;
#define OK 1
#define ERROR 0
#define OVERFLOW -2
typedef int Status;
#define MAXSTRLEN 255 #define MAXSTRLEN 255typedef char SString[MAXSTRLEN+1]; Status StrAssign(SString T,char *chars){int I;if(strlen(chars)>MAXSTRLEN)
return ERROR;
else{
T[0]=strlen(chars);
for(i=1; i<=T[0]; i++){ T[i]=*(chars+i-1); }return OK;
}
}
void get_next(SString T,int next[]){
int i=1,j=0;
next[1]=0;
while(i<T[0]){
if(j = = 0 | | T = = [I] T [j]) {/ / T [I] said the suffix of a single character + + I; //T[j] indicates a single character prefix ++j; next[i]=j; }elsej=next[j]; }} int Index_KMP(SString S,SString T,int pos,int next[]) {int I =pos; Int j=1; if pos is not 1, int j=1;while(i<=S[0]&&j<=T[0]){
if(j==0||S[i]==T[j]){
++i;
++j;
}
else j=next[j];
}
if(j>T[0])
return i-T[0];
else return 0;
}
int main(){
SString S;
StrAssign(S,"aaabaaaab");
SString T;
StrAssign(T,"aaaab");
int *p=new int[T[0]+1];
get_next(T,p);
cout<<"Primary string and substring at no."<<Index_KMP(S,T,1,p)<<"First match at character \n";
return 0;
}Copy the code
KMP pattern matching algorithm improved
If the primary string is S=”aaaabcde” T=”aaaaax”, the next array value is 01234. At the beginning, I =5,j=5,b is not equal to z,j= next[5]=4, “b” is still equal to the fourth “a”,j= next[4]=3; . J =next[1]=0, according to the algorithm,++ I,++j, I =6,j=1. So there are four extra steps in this process. Next [j] = next[j] = next[j] = next[j] = next[j]
Void get_next(SString T, int next[]) {void get_next(SString T, int next[]); next[1] = 0;while (i < T[0]){
if (j == 0 || T[i] == T[j])
{
++i;
++j;
if(T[i]! =T[j]) next[i]=j;else next[i]=next[j];
}
elsej = next[j]; }}Copy the code
Improved nexT1 array value derivation
Example:
j | 123456789 |
---|---|
Pattern string T | ababaaaba |
next[j] | 011234223 |
next1[j] | 010104210 |
j | 123456789 |
---|---|
Pattern string T | aaaaaaaab |
next[j] | 012345678 |
next1[j] | 000000008 |
Java version of KMP algorithm
package one; public class KMP { public static int[] getNext(String b) { int len=b.length(); int j=0; int next[]=new int[len+1]; //next represents the longest common part of the prefix and suffix of a string of length I, starting at 1 next[0]=next[1]=0;for(int i=1; i<len; I ++)// I represents the subscript of the string, starting at 1 {//j represents the value of next[I] at the beginning of each loop, as well as the next position to comparewhile(j>0&&b.charAt(i)! =b.charAt(j)) j=next[j];if(b.charAt(i)==b.charAt(j))
j++;
next[i+1]=j;
}
return next;
}
public static void search(String original, String find, int next[]) {
int j = 0;
for (int i = 0; i < original.length(); i++) {
while(j > 0 && original.charAt(i) ! = find.charAt(j)) j = next[j];if (original.charAt(i) == find.charAt(j))
j++;
if (j == find.length()) {
System.out.println("find at position " + (i - j+1));
System.out.println(original.subSequence(i - j + 1, i + 1));
j = next[j];
}
}
}
public static void main(String[] args){
String s1="abcaababcabacc";
String s2="ababcab"; int[] next=getNext(s2); search(s1,s2,next); }}Copy the code
Javascript version of the interested children’s shoes can write their own, not here.