The original link

preface

String matching algorithm is also frequently used in daily development. Of course, we can use regular matching to accomplish string matching, but there are many benefits to learning and understanding the relevant string matching algorithm for our technical growth.

define

String matching algorithm is a common problem in practical engineering, and also a common topic in written interview of various companies. This algorithm usually takes the original string (string) and substring (pattern) as inputs, and requires that the position of the first occurrence of the substring in the original string be returned.

1. The BF algorithm

BF (Brute Force) is the best algorithm to think of and implement. First, align the left end of the original string with the left end of the substring and compare them one by one. If the first character does not match, the substring moves back one bit to continue the comparison; If the first character matches, subsequent characters are compared until all match. Time complexity: O(nm). Where n is the length of the original string and m is the length of the substring.

function BF (src, dest) {
    var len1 = src.length,
        len2 = dest.length;
    var i = 0,
        j = 0;
    while (i < len1 && j < len2) {
        if (src[i] === dest[j]) {
            i++;
            j++;
        } else {
            i = i - j + 1;
            j = 0; }}if (j === len2) {
        return i - j;
    }
    return - 1;
}
Copy the code

2. Fairly RK algorithm

RK (Robin-Karp), hash search algorithm is an improvement on BF algorithm: in BF algorithm, every character needs to be compared, and we still need to compare all the remaining characters when we find the first character match. In the RK algorithm, a single comparison is attempted to determine whether the two are equal. RK algorithm can also carry out multi-pattern matching, and it is generally used in practical applications such as duplicate checking in papers.

First compute the HASH value of the substring, then compute the HASH value of the substring length of the original string respectively, and compare whether the two are equal: If the HASH value is different, the two must not match; if they are the same, due to the existence of HASH conflicts, they also need to determine again according to the BF algorithm.

In this example, the Hash value of substring “DEF” is Hd, and then the Hash value of “ABC”, “BCD”, “CDE”, and “DEF” is Ha, Hb, Hc, and Hd respectively. When Hd is equal, The substring “DEF” is still compared to the original string “DEF”. Time complexity: O(nm) (usually faster in practical applications, the expected time is O(n+m)).

To implement the RK algorithm, the most important thing is how to select the Hash function. Here we use the “division and remainder method” mentioned in the previous chapter “JavaScript Hashing.”

function hash (data) {
    var total = 0;
    for (var i = 0, len = data.length; i < len; i++) {
        total += 37 * total + data.charCodeAt(i);
    }
    total = total % 144451;
    return parseInt(total);
}

function isMatch (str, dest) {
    if(str.length ! == dest.length) {return false;
    }
    for (var i = 0; i < str.length; i++) {
        if(str[i] ! == dest[i]) {return false; }}return true;
}

function RK (src, dest) {
    if(! src || ! dest) { retunr- 1;
    }
    var len1 = src.length,
        len2 = dest.length;
    var destHash = hash(dest),
        index = - 1;
    for (var i = 0; i <= len1 - len2; i++) {
        var subStr = src.substr(i, len2);
        if (hash(subStr) === destHash && isMatch(subStr, dest)) {
            index = i;
            break; }}return index;
}
Copy the code

3. The KMP algorithm

KMP (Knuth-Morris-Pratt) algorithm, is one of the most classic string matching algorithm, is also the textbook of the top ten, has been voted as one of the greatest algorithms in the world today, Ruan Yifeng teacher also wrote a blog for KMP algorithm: “String matching KMP algorithm”; But the algorithm is widely considered obscure and difficult to implement. Below, I will not explain KMP, because ruan Yifeng teacher’s blog is easy to understand.

Before completing the KMP algorithm, we need to solve the most core problem: the generation of partial matching table. Partial matching table, commonly known as the analysis of the number of prefix and suffix matches of all strings in dest.

function getNext (str) {
    var res = [];
    var k = 0;
    for (var i = 0, len = str.length; i < len; i++) {
        if (i === 0) {
            res.push(0);
            continue;
        }
        while (k > 0&& str[i] ! == str[k]) { k = res[k -1];
        }
        if (str[i] === str[k]) {
            k++;
        }
        res[i] = k;
    }
    return res;
}
Copy the code

The implementation of this code is using DP (dynamic programming) ideas, more difficult to understand. To illustrate the legend, take the string ABCDABD as an example:

KMP algorithm implementation

function KMP (src, dest) {
    var next = getNext(dest);
    var len1 = src.length,
        len2 = dest.length;
    var i = 0,
        j = 0;
    while (i < len1 && j < len2) {
        if (src[i] === dest[j]) {
            i++;
            j++;
        } else {
            i = i - next[j] + 1;
            j = 0; }}if (j === len2) {
        return i - j;
    }
    return - 1;
}
Copy the code

4. The BM algorithm

The KMP algorithm mentioned above is not the most efficient algorithm and is not used much in practice. Most text editors use the Boyer-Moore algorithm for their “find” function (Ctrl+F).

The Boyer-Moore algorithm is not only efficient, but also clever and easy to understand. I recommend teacher Ruan Yifeng’s blog tutorial “Boyer-Moore Algorithm for String Matching” to help you understand, and there is no discussion here.

BM algorithm is more complex than other matching algorithms, and it is like the combination of KMP algorithm and Sunday algorithm.

5. Sunday algorithm

BM algorithm is not the most efficient algorithm, and Sunday algorithm is faster and easier to understand, which is a bit like a truncated version of BM algorithm.

If the SRC character does not match the dest character, check whether the next character (space after THIS word) in the SRC character appears in the dest character. If it exists, it is offset by the position that first appears from right to left; If not, offset the length of dest as a whole.

function getMoveLengthObj (str) {
    var resObj = {},
        len = str.length;
    for (var i = 0; i < len; i++) {
        resObj[str[i]] = len - i;
    }
    return resObj;
}

function Sunday (src, dest) {
    var moveObj = getMoveLengthObj(dest);
    var len1 = src.length,
        len2 = dest.length;
    var i = 0,
        j = 0;
    while (i < len1 && j < len2) {
        if (src[i] === dest[j]) {
            i++;
            j++;
        } else {
            i = i - j;
            var offset = moveObj[src[i + len2]];
            if (offset) {
                i += offset;
            } else {
                i += len2;
            }
            j = 0; }}if (j === len2) {
        return i - j;
    }
    return - 1;
}
Copy the code

Algorithm performance comparison

The name Spatial complexity Best time complexity Worst time complexity
BF algorithm T(1) O(nm) O(nm)
Fairly RK algorithm T(1) O(n + m) O(nm)
KMP algorithm T(m) O(n + m) O(nm)
BM algorithm T(2m) O(n) O(nm)
Sunday algorithm T(m) O(n) O(nm)

In addition to the string matching algorithm described above, there is a faster algorithm:

  • Python version (original) : FlashText
  • JavaScript version: flashtext.js

Refer to the link

  • Overview of string matching algorithms
  • BF, KMP, BM, Sunday algorithm explanation
  • Flashtext: a powerful tool for large-scale data cleaning