This is the fourth day of my participation in the August More text Challenge. For details, see: August More Text Challenge

We learned BF algorithm and RK algorithm in string matching algorithm (1), there is no more efficient string matching algorithm. Let’s talk about BM algorithm today.

BM algorithm

We can think of the pattern string matching the main string, as fixing the main string, and then the pattern string sliding back and forth. When mismatched characters are encountered, the BF and RK algorithms do this by sliding the pattern string back one bit and rematching from the first part of the pattern string. See the figure below.

Because of BF algorithm and RK algorithm, when it encounters mismatched characters, the pattern string just slides back one bit, so the time complexity is relatively high, then is there any algorithm that can slide a few more bits at a time? For example, if you encounter the character D in the main string A, since D is not in the pattern string, as long as d and the pattern string overlap, it will definitely not match. So we can just slide a few more bits, straight behind the D, and then continue matching, and that would be more efficient, wouldn’t it?

The BM algorithm that I’m going to talk about today is essentially looking for this pattern. With the help of this rule, when the pattern string and the main string encounter mismatched characters, it can skip some certain mismatched cases and slide back a few more bits.

BM algorithm principle

BM algorithm consists of two parts, namely bad character rule and suffix rule.

1. Bad character rules

In the BF algorithm and RK algorithm, in the process of matching pattern string and main string, we all match according to the order of the subscript of pattern string from small to large. The matching order of BM algorithm is the opposite, from large to small. As shown below.

Matches backwards from the end of the pattern string, and when a character in the main string is found not to match, it is called a bad character. We take the bad character D and look in the pattern string and find that D is not in the pattern string. At this point, we can slide the pattern string directly by 3 bits, to the end of the d character, and then start the comparison from the end of the pattern string.

At this point, we find that b in the main string does not match C in the pattern string. At this time, because the bad character B exists in the pattern string, and the position with subscript 1 in the pattern string is also the character B, we can slide the pattern string backward by 1 bit, so that the b in the main string and the B in the pattern string are relatively uniform. It is then rematched starting at the end of the pattern string.

From the above example, we can conclude the rule. When a mismatch occurs, we mark the characters in the pattern string corresponding to the bad character as Ai. If the bad character exists in the pattern string, we mark it as Bi below the pattern string. (If the bad character occurs multiple times in the pattern string, we mark the lower part of the pattern string as Bi to prevent the pattern string from sliding back too much and missing possible matches.) The number of bits that the pattern string slides backwards is ai-BI.

But it’s not enough to just use the bad character rule. Because the number of moving bits calculated by AI-BI may be negative. For example, the main string is aaaaaa, and the mode string is baaa. Therefore, the BM algorithm also needs to use the “good suffix rule”.

2. Good suffix rules

The good suffix rule is similar to the bad character rule. See the figure below.

When the pattern string slides to its position in the graph, the pattern string and the main string match two characters, and the third to last character does not match. We call the matched ab the good suffix, which we call {u}. We use it to search the pattern string for another substring {u*} that matches {u}. Let’s slide the pattern string to where the substring {u*} is aligned with the main string {u}.

If we can’t find another substring equal to {u} in the pattern string, we just slide the pattern string directly after the main string {u}. Because any of the previous slidebacks didn’t match the main {u} string. However, if there is no substring equal to {u} in the pattern string, we simply slide the pattern string to the end of {u} to see if we miss a possible match. As shown below, ab here is a good suffix, although there is no other matching substring {u*} in the pattern string, if we move the pattern string to the end of {u}, we miss the case where the pattern string matches the main string.

So, when the pattern string slides to where the prefix partially overlaps the {u} suffix in the main string, and the overlaps are equal, it is possible to have an exact match. In this case, we should not only look at whether the suffix in the pattern string has another matching substring. We also want to see if the suffix substring of the good suffix matches the prefix substring of the pattern string.

Here we will explain the suffix substring and prefix substring again. The suffix substring of string A is A substring whose last character is aligned with A. For example, the suffix substring of string ABC is c and BC. A prefix substring is A substring whose beginning character is aligned with A. For example, the string ABC is prefixed with a, ab. From the good suffix substring, we find the longest one that matches the prefix substring of the pattern string, say {v}. Then slide to the position shown in the figure.

So far, we’re done with bad characters and good suffixes, so let’s think about this next question. When the pattern string does not match a character in the main string, do we choose the good suffix rule or the bad character rule to calculate the number of bits to slide backwards?

We can calculate the number of digits that the bad character rule and the suffix rule slide back, respectively, and then take the largest of the two numbers as the number of digits that the pattern string slides back.

Let’s stop talking about it today, and give me a like if you like it. For more hardcore knowledge, please pay attention.