preface

When it comes to two string matching, it is natural to think of two layers of loop matching. In this way, we can realize whether a string contains another string. This algorithm is called BF algorithm.

BF algorithm

BF, also known as Brute Force, is a common pattern matching algorithm. The idea of BF is to match the first character of the target string S with the first character of the pattern string T. If they are equal, the second character of S is compared with the second character of T. If not, the second character of S is compared with the first character of T, and so on, until the final match is obtained.

BF algorithm implementation

public static int bruteforce(String s,String t){
    int length = s.length();// The length of the target string
    int plength = t.length();// The length of the pattern string

    // Loop over the target string
    for(int i=0; i<length-plength; i++){// Loop mode string
        int j=0;
        while((j < plength) && (s.charAt(i+j) == t.charAt(j))){
            j++;
        }
        if(j == plength){
            returni; }}return -1;
}
Copy the code

BF algorithm is a brute force algorithm without any optimization. It just uses two-layer cyclic comparison. When the string is large, the execution efficiency is very low and it is not suitable for comparing very large strings.

The algorithm is worst-caseThe time complexity is. Therefore, if we use force to search for ‘n’ strings in a string of ‘m’ characters, then we need to try n*m times.

Although brute force search is easy to implement and will always find a solution if it exists, its cost is proportional to the number of alternatives, and because of this, in many practical problems the cost of consumption increases rapidly with the size of the problem. Therefore, brute force search is often used when the problem size is limited, or when there are heuristic algorithms for a particular problem that can be used to reduce the set of candidate solutions to manageable sizes. It is also used when the simplicity of the implementation method is more important than the efficiency of the computation.

We saw that the brute force search algorithm, although it does not require preprocessing strings, is inefficient because it does a lot of unnecessary matching, so we need a more efficient algorithm.

AC automata algorithm

Aho — Corasick algorithm is a string search algorithm invented by Alfred V. Aho and Margaret J.Corasick for matching substrings in a preconstructed Trie tree in a string of input strings. It differs from regular string matching in that it matches all dictionary strings at once. The amortized algorithm has an approximate linear time complexity of about the length of the string plus the number of matches.

Suppose m is the length of the pattern, n is the length of the string to be searched, and k is the length of the alphabet. The algorithm needs to build the Trie tree of variation in advance, so it needsThe pretreatment time of. But the time complexity of the true search string is, greatly improve the string search time.

The algorithm mainly relies on constructing a finite state machine (similar to adding mismatched Pointers to a Trie tree). These additional mismatched Pointers allow for a rollback in the event of a string lookup failure (for example, if cat fails to match the word in the Trie tree, but there is another word cart in the Trie tree, the mismatched pointer will point to the prefix CA) and move to other branches of a prefix, avoiding repeated prefix matching and improving algorithm efficiency.

When a dictionary string collection is known (such as a computer virus library), the automaton can be extracted and stored offline for later use. In this case, the algorithm’s time complexity is the sum of the length of the input string and the number of matches, which is often the case with AC automaton algorithms.

structure

1. A Trie tree based on words or content. This is also called a goto table.

2. At the end of a word, there will be a character pointing to a Trie root or other branches that are the same as it. This is the back pointer, which avoids repeated prefix matching and improves algorithm efficiency. For example, the s at the end of the word his in the figure below points to s of she, so we can continue to search down on the original basis to reduce the number of prefix matches. This is also called a FAIL table.

To find the

The “child node” of the current node is searched first. If no match is found, the child of its suffix node is searched. If there is still no match, the child of the suffix node of the corresponding FAIL table is searched, and so on until the root node is reached.

When the algorithm finds a node, it prints all dictionary entries that end at the current location. The output step is to first find the dictionary suffix for the node, and then perform recursively until the node has no dictionary prefix. Also, if the node is a dictionary node, the node itself is printed.

implementation

There is an open source implementation on GitHub for those interested to peruse. Github.com/robert-bor/…

Is there an algorithm that doesn’t have to do so much up front, but can match quickly? The answer is the KMP algorithm.

KMP algorithm

It’s named after three inventors, starting with the famous scientist Donald Knuth.

The partial match table (PMT) is a partial match table. Let’s look at the partial matching table of pattern string “abababca” :

char: | a | b | a | b | a | b | c | a | index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | value: | | 0 0 | 1 | 2 | 3 | 4 | | 1 | 0Copy the code

The corresponding value is the partial matching table of pattern string “abababca”.

To understand partial match tables, we need to know prefixes and suffixes. “Prefix” means all header combinations of a string except the last character; “Suffix” refers to all combinations of the ends of a string except the first character.

Character string: Knuth Prefix: K, Kn, Knu, Knut Suffix: nuth, uth, th, HCopy the code

Now we find the partial match table of pattern string “abababca” based on “prefix” and “suffix”.

1. "a"The prefix and suffix of are empty sets with no common elements and length 0. 2."ab"The prefix is [a], the suffix is [b], there is no common element, length is 0; 3."aba"Is prefixed by [a, ab] and suffixed by [ba, a]"a", length 1; 4."abab"The prefix is [a, ab, ABA] and the suffix is [bab, ab, b]"ab", length 2; . 7."abababc"The prefix of is [A, ab, aba, abab, ababa, ababab], and the suffix is [bababc, ababc, babc, BC, c]. There is no common element and the length is 0. 8."abababca"Is prefixed with [A, ab, aba, abab, ababa, ababab, ababABC] and suffixed with [bababca, ababca, babca, bca, ca, a]"a", length 1;Copy the code

How do I look up a partial match table?

When we find a partial match, we can use the values in the partial match table to skip over (rather than redoing unnecessary old comparisons).

Move digit = number of matched characters - corresponding partial match table valueCopy the code

We give an example of matching the pattern string “abababca” with the string “bacbababaabcbab”.

bacbababaabcbab
 |
 abababca
Copy the code

First, since our pattern string starts with “A “, and then loops through the string, the second character is” A “, which means there’s a character match. Because the next character doesn’t match, we need to move. The number of bits moved is equal to the partial match table [number of characters matched -1], that is, the value of 0 in the lower table of the partial match table, which is 0. So we need to skip back 0 characters and continue the loop.

bacbababaabcbab
    |||||
    abababca
Copy the code

Now that 5 characters are matched, the number of characters to skip is 5-3 = 2, and we need to move back 2 bits.

Skip the characters / / x said bacbababaabcbab xx | | | abababcaCopy the code

Now that 3 characters are matched, the number of characters to skip is 3-1 = 2, and we need to move back 2 bits.

bacbababaabcbab
      xx|
        abababca
Copy the code

At this point, our pattern is longer than the rest of the characters in the text, so we know there is no match.

conclusion

BF algorithm belongs to the method of brute force cracking, using exhaustive search to realize the string, when the string is too long, the search speed is very slow.

Then there is the AC automata algorithm, which first builds a structure similar to the Trie tree and searches the corresponding string according to the Trie tree. Its time complexity is almost linear, but the Trie tree needs to be built in advance.

Is there an algorithm that is not exhaustive and does not require the Trie tree to be built in advance?

That’s the famous KMP algorithm, which avoids rechecking previously matched characters, our partial match table (PMT table), by using the fact that the word itself contains enough information at the time of the mismatch to determine where the next match will start.

PS: Clear mountains and clear waters begin with dust, knowledge is more valuable than diligence. Follow the wechat official account for the latest articles. Wechat official account: “Clean the dust chat”. Welcome to chat and talk about code.