The introduction

String location operation is usually called string pattern matching, which is one of the most important operations in various string processing systems. This paper introduces four matching algorithms including Hash, KMP, BM and Sunday.

String Hash

String Hash is a Hash of a string, which is commonly understood as converting a string to an integer and constructing an injective of an integer corresponding to a string in an ideal state.

Given a string S, we specify:


Natural overflow method

The natural overflow Hash formula is:


The hash array here makes use of the natural overflow pair of unsigned long longModulus.

A single Hash method

The single hash formula is:


P and mod are both prime numbers, and p is less than mod. To reduce hash conflicts, make p and mod as large as possible.

Hash conflicts occur when a hash value corresponds to two or more strings

Choosing prime numbers can effectively avoid conflicts, as shown in the following example:

Let’s say that the function expression is zero

When N = 8 and y = 2:

When N = 7 and y = 2:

You can see that there’s less conflict in choosing prime numbers.

Optional prime numbers

Double Hash method

Take a string with different mod hashes twice, and then represent the two results as a binary to form oneTo the ideal injective of a string, the formula is:



Get the hash of the string

According to the definition:

And by the

It can be concluded that:

So the general formula is:

In addition, since we need to take modulo every time, and negative numbers may appear when we do subtraction, we can modify it after mod, and get the final formula for string hash:


KMP

String search algorithm (Knuth-Morris-Pratt), referred to as KMP algorithm, is often used to find the occurrence position of a pattern string P within a text string S. The time complexity is O(n+ M), where N is the length of the text string and M is the length of the pattern string.

The KMP algorithm first looks for the longest equal prefix and the suffix LPS in the pattern string.

Suppose the pattern string P is: “bcdebcd”

The available prefixes are: “b”, “BC”, “BCD”, “bcde”

The suffixes available are: “d”, “CD”, “BCD”, “ebcd”

LPS can be concluded as: “BCD”

So we can set up the position where the dp array stores the string again according to the LPS of the pattern string:

step1:

step2:

step5:

step6:

step7:

step8:

step9:

step11:

step12:

step19:

step20:

If match transition and mismatch transition are expressed in the form of finite-state automaton (FSM), the following state transition diagram can be obtained according to the dp array of pattern string bcbce [0,0,1,2,0] :

Note that the I used for text string traversal is different from the I used for BF, which refers to the starting position, and the I used for KMP, which refers to the current matching position.

BM

BM(Boyer-Moore) algorithm is different from KMP algorithm. The pattern string of KMP algorithm is traversed from front to back, while that of BM algorithm is traversed from back to front. The time complexity of BM algorithm is O(n/m) at best and O(nm) at worst, where N is the length of text string and M is the length of pattern string. BM algorithm has two rules, namely Bad character rule and Good prefix rule, according to which the two rules compete with each other for string matching.

Bad string rule

When a character in a text string with pattern of some characters do not match, we call this mismatch characters in a text string of bad character, pattern string need to move to the right, at this time of bits equal to the bad character in the pattern string corresponding to the location of the minus bad character in the pattern string the most right before the location (pay attention to the position of the character to the mismatch of the character before). In addition, if the “bad character” is not included in the pattern string, align the pattern string as a whole directly behind the character and continue the comparison. Bad characters are for text strings.

Let the text string S be “GCTTCTGCTACCTTTTGCGC” and the pattern string P be “CCTTTTGC”. The bad string matching rules are as follows:

When there is a bad character “C” in the S string, align the right-most character “C” in the P string with the bad character in the S string, and the P string moves 3 bits to the right accordingly (the position of the bad character “C” in the P string 4 minus the position of the bad character in the P string 1 is equal to 3).

step1:

step2:

step3:

Good suffixes rule

When characters are mismatched, the trailing digit is equal to the position of the good suffix in the pattern string minus the last position of the good suffix in the pattern string, and -1 if the good suffix does not appear again in the pattern string. Good suffixes are for pattern strings.

Set the text string S as “CGTGCCTACTTACTTACTTA” and the pattern string P as “CTTACTTAC”. The good suffix matching rules are as follows:

When P series already exist in the good suffix “TAC” match, because the target string (a “TAC”) on the alignment to the position of good suffix now, so P series of bits to the right for the good suffix position in P series 6 minus the good suffix in the string of the position of the last occurrence of 2 equal to 4 P, P series of the tail element and then start and matching.

step1:

step2:

step3:

BM algorithm is a combination of bad string (BC) and the good suffix rules (gs) to traverse, each step contrast two rules can skip the number of characters, skip the number of characters more strategy is used for this step, the next step pattern string in turn from right to left again and text string characters match the corresponding position, pay attention to the text string from left to right traversal, pattern string from right to left traversal.

“GTTATAGCTGATCGCGGCGTAGCGGCGAA”, for a text string S pattern string of P as “GTAGCGGCG”, BM algorithm matching process is as follows:

At the initial moment, the last character of the P string is mismatched with the corresponding character of the S string. If BC is used, six characters can be skipped, but if GS is used, characters cannot be skipped. Therefore, BC is used to move the P string to the right, so that the characters in the pattern string that match the bad character are aligned with the bad character.

step1:

step2:

step3:

step4:

The bad string rule and the suffix rule can be adopted separately, and the bad string rule is generally removed because the memory overhead of using the bad string rule is very high if the string is very long.

Sunday

The idea of Sunday algorithm is very similar to BM algorithm, except that the pattern string in Sunday algorithm is matched from left to right. When the matching fails, it pays attention to the next character of the last character in the main string to be matched. The rule is as follows:

  • If the character does not appear in the pattern string it is skipped, and the number of moves is equal to the pattern string length plus 1
  • If the character appears in a pattern string, the number of moves is equal to the pattern string length minus the character’s last position in the pattern string (the pattern string position is accumulated from 0).

The Sunday algorithm requires the creation of an offset table, which stores the offset of each character. Set P as pattern string, m as pattern string length, and w as character, then the offset can be calculated by:


For example, P = “leetcode” :

m=8

Shift [l] = 8-max (l position) = 8-0 = 8

Shift [e] = 8 – Max = 8-2 = 6

Shift [t] = 8 – Max = 8-3 = 5

Shift [c] = 8-max (c) = 8-4 = 4

Shift [o] = 8-max (o position) = 8-5 = 3

Shift [d] = 8-max (d position) = 8-6 = 2

Shift = m + 1 = 8 + 1 = 9

Set the text string S as “ATTAAGGCACATAC”, the pattern string P as “ACAT”, and the offset of the characters in the pattern string as:

Shift [A] = 4 – Max (A) = 4 – Max (A) = 2

Shift [C] = 4 – Max (C) = 4-1 = 3

Shift [T] = 4 – Max (T position) = 4-3 = 1

The matching process of Sunday algorithm is as follows:

step1:

step2:

step3:

step4:

END

String matching tries to write KMP first. If KMP is stuck, write Hash. If Hash doesn’t work, try BM. Although Sunday algorithm is fast, the most commonly used are KMP and Hash.