Here we go. String matching algorithm

Live each day like it could be your last.

Live each day as if it was your last.

The introduction

Avenue to simple, my ideal is to use the simplest code to achieve the most beautiful algorithm. String matching is widely used, such as Java’s indexOf(), JS’s whole-family bucket set (find, indexOf,include…). And so on. This article mainly shares the string matching algorithms that they depend on at the bottom. Two simple, two complicated. Without further comment, all source code has been uploaded to Github: link

Ps: BF algorithm is the most simple string matching algorithm, need to know and know why. The RK algorithm understands it to be BF + HASH. As for the BM and KMP behind, it is ok to understand its principle. The predecessors have already built the wheel, and they can think of using it when needed, but it is quite difficult to write it out.

BF algorithm

Bf algorithm commonly known as naive matching algorithm, why is it called this name, because it is very violent, in the main string, check the starting position are 0, 1, 2… N-m substrings of length m, n-m+1, and see if there’s anything that matches the pattern string.

parsing

Here I loop is the trace main string TXT,j is the trace pattern string pattern, first peripheral first determine the access times, tlen-plen.

J loop for comparison. Maybe the only thing that’s hard to understand here is I + j. Look at the test results.

    private int bfSearch(String txt, String pattern) {
        int tLen = txt.length();
        int pLen = pattern.length();
        if (tLen < pLen) return- 1;for (int i = 0; i <= tLen - pLen; i++) {
            int j = 0;
            for (; j < pLen; j++) {
                System.out.println(txt.charAt(i + j) + "--" + pattern.charAt(j));
                if(txt.charAt(i + j) ! = pattern.charAt(j))break;
            }
            if (j == pLen) return i;
        }
        return- 1; }Copy the code

variant

The function of I and j is the same as that of normal I +j. However, when a mismatch is found, the two Pointers of I and j need to be backdated. J returns to the beginning and I points to the next character.

    private int bfSearchT(String txt, String pattern) {
        int tLen = txt.length();
        int i = 0;

        int pLen = pattern.length();
        int j = 0;

        for (; i < tLen && j < pLen; i++) {
            System.out.println(txt.charAt(i) + "--" + pattern.charAt(j));
            if (txt.charAt(i) == pattern.charAt(j)) {
                ++j;
            } else{ i -= j; j = 0; }}if (j == pLen) {
            System.out.println("end... i = " + i + ",plen = " + pLen);
            return i - pLen;
        }
        return- 1; }Copy the code

The test code

Ps: hello world

    public static void main(String[] args) {
        BFArithmetic bf = new BFArithmetic();
        String txt = "hello world";
        String pattern = "world";
        int res = bf.bfSearch(txt, pattern);
        System.out.println(BF algorithm matching result: + res);
//        int resT = bf.bfSearchT(txt, pattern);
//        System.out.println("BF algorithm (display rollback) matching results :" + resT);
    }Copy the code

The test results

Fairly RK algorithm

Rk algorithm is equivalent to an advanced version of BF algorithm, which mainly introduces the hash algorithm. Reduced time complexity. The n- M +1 substrings in the main string are hashed by the hash algorithm, and then the size is compared with the hash value of the pattern string one by one. If the hash value of a substring is equal to the pattern string, then the corresponding substring matches the pattern string. Because the hash value is a number, comparing numbers for equality is very fast, so the efficiency of pattern string and substring comparisons is improved.

Initialize the

Here we prefab the pattern string, generate the corresponding hash value, and then randomly generate a large prime number for subsequent use.

    private RKArithmetic(String pattern) {
        this.pattern = pattern;
        pLen = pattern.length();
        aLen = 256;
        slat = longRandomPrime();
        System.out.println("Random primes:" + slat);
        aps = 1;
        // aLen^(pLen - 1) % slat
        for (int i = 1; i <= pLen - 1; i++) {
            aps = (aLen * aps) % slat;
        }
        patHash = hash(pattern, pLen);
        System.out.println("patHash = " + patHash);
    }Copy the code

To prepare

Randomly generate a large prime number

    private static long longRandomPrime() {
        BigInteger prime = BigInteger.probablePrime(31, new Random());
        return prime.longValue();
    }Copy the code

The hash algorithm

    private long hash(String txt, int i) {
        long h = 0;
        for (int j = 0; j < i; j++) {
            h = (aLen * h + txt.charAt(j)) % slat;
        }
        return h;
    }Copy the code

Verify that the string matches

    private boolean check(String txt, int i) {
        for (int j = 0; j < pLen; j++)
            if(pattern.charAt(j) ! = txt.charAt(i + j))return false;
        return true;
    }Copy the code

implementation

The implementation is still relatively easy to read, but instead of comparing hash values, the comparison is made.

    private int rkSearch(String txt) {
        int n = txt.length();
        if (n < pLen) return- 1; long txtHash =hash(txt, pLen);
        if ((patHash == txtHash) && check(txt, 0))
            return 0;

        for (int i = pLen; i < n; i++) {
            txtHash = (txtHash + slat - aps * txt.charAt(i - pLen) % slat) % slat;
            txtHash = (txtHash * aLen + txt.charAt(i)) % slat;
            int offset = i - pLen + 1;
            System.out.println("The first" + offset + "Sub-txthash =" + txtHash);
            if ((patHash == txtHash) && check(txt, offset))
                return offset;
        }
        return- 1; }Copy the code

The test code

    public static void main(String[] args) {
        String pat = "world";
        String txt = "hello world";
        RKArithmetic searcher = new RKArithmetic(pat);
        int res = searcher.rkSearch(txt);
        System.out.println("RK algorithm matching result:" + res);
    }Copy the code

The test results

BM algorithm

The BM algorithm wheel has been built. It is said to be the most efficient and commonly used string matching algorithm.

Analysis of the

The core idea is to greatly reduce the number of comparisons and reduce the time complexity by sliding the pattern string backward along the main string in big strides. The key of the algorithm lies in how to balance the steps of enough and no omission, and at the same time to improve the efficiency of execution. This requires the pattern string to follow both the bad character rule and the good suffix rule as it slides backwards, along with a few tricks.
Bad character rule: The characters in the pattern string and the main string are compared bit by bit from back to front. When the bad character that does not match is found, the subscript value si of the pattern string is recorded, and the nearest position xi before the subscript si of the bad character in the pattern string is found (if there is no bad character, it is marked as -1). Si-xi is the backward sliding distance. (PS: I think plus xi must be in front of si, that is smaller than si, so we don’t have to worry about the calculated distance being negative). But the bad character rule doesn’t slide back long enough, so it needs the good suffix rule.
Good suffix rule: Compares the characters in the pattern string to the main string bit by bit, stopping when a bad character occurs. If there is a substring {u} that has been successfully matched, the nearest {u} is found before {u} in the pattern string and is denoted as {u’}. The pattern string is then moved back so that {u’} of the pattern string overlaps {u} of the main string. If {u’} does not exist, the pattern string is moved directly after {u} in the main string. In order not to miss anything, you need to find the longest suffix substring of good suffixes that matches the prefix substring of the pattern string (which is also the suffix substring of the pattern string). Then move the pattern string to the right to its left edge to align the suffix substring of this good suffix with the left edge of the main string.

To prepare

Construct bad character hash table

private void generateBC(char[] patChars, int[] records) {
    for (int i = 0; i < aLen; i++) {
        records[i] = -1;
    }
    for(int i = 0; i < patChars.length; Int ASCII = (int) patChars[I]; int patChars[I]; records[ascii] = i; } System.out.println("Bad character hash table:");
    print(records);
}Copy the code

Good suffix

private void generateGS(char[] patChars, int[] suffix, boolean[] prefix) {
    int pLen = patChars.length;
    for(int i = 0; i < pLen; ++ I) {// suffix[I] = -1; prefix[i] =false;
    }
    for(int i = 0; i < pLen - 1; ++i) { int j = i; Int k = 0;while(j >= 0 && patChars[j] == patChars[pLen - 1 - k]) { --j; ++k; // suffix[k] = j+1; // Suffix [k] = j+1; } // If the common suffix substring is also the prefix substring of the pattern stringif (j == -1) prefix[k] = true; }}Copy the code

mobile

private int moveByGS(int index, int pLen, int[] suffix, boolean[] prefix) { int k = pLen - 1 - index; // Good suffix lengthif (suffix[k] != -1) return index - suffix[k] + 1;
    for (int i = index + 2; i <= pLen - 1; i++) {
        if (prefix[pLen - i])
            return i;
    }
    return- 1; }Copy the code

implementation

Suffix In the pattern string, look for another substring that matches the good suffix
Prefix Records whether the suffix substring of a pattern string matches the prefix substring of a pattern string

Private int bmSearch(String TXT, String pattern) {// record the last position of each character in the String int[] records = new int[aLen]; char[] txtChars = txt.toCharArray(); int tLen = txtChars.length; char[] patChars = pattern.toCharArray(); int pLen = patChars.length; generateBC(patChars, records); int[] suffix = new int[pLen]; boolean[] prefix = new boolean[pLen]; generateGS(patChars, suffix, prefix); Int index = 0; // int index = 0;while(index <= tLen - pLen) { int i = pLen - 1; // The pattern string matches from the back to the frontfor(; i >= 0; -- I) {// The subscript in the pattern string for bad characters is Iif(txtChars[index + i] ! = patChars[i])break;
        }
        if (i < 0) {
            return index;
        }
        int x = i - records[(int) txtChars[index + i]];
        int y = 0;
        if (i < pLen - 1) {
            y = moveByGS(i, pLen, suffix, prefix);
        }
        System.out.println("x = " + x + ",y = " + y);
        index = index + Math.max(x, y);
    }
    return- 1; }Copy the code

The test code

    public static void main(String[] args) {
        BMArithmetic bmArithmetic = new BMArithmetic();
        String txt = "hello world";
        String pattern = "world";
        int res = bmArithmetic.bmSearch(txt, pattern);
        System.out.println("BM algorithm matching result:" + res);
    }
Copy the code

The test results

BM Heuristic (Lite)

Analysis of the

BM algorithm is worthy of the so-called linear level of calculation, it is said that the efficiency of KMP algorithm is 3~4 times, have time must check.

If you have a problem and it doesn’t work, try to get√ new ideas.

Initialize the

private BoyerMoore(String pattern) { this.pattern = pattern; pLen = pattern.length(); int aLen = 256; records = new int[aLen]; // Initialize the record array. The default is -1for(int i = 0; i < aLen; i++) { records[i] = -1; } // The right-most position in which the character in the pattern string appearsfor(int j = 0; j < pLen; j++) { records[pattern.charAt(j)] = j; }}Copy the code

implementation

According to the name skip, two keywords can also be analyzed in reverse order and skip.

    private int bmSearch(String txt) {
        int tLen = txt.length();
        int skip;
        for (int i = 0; i <= tLen - pLen; i += skip) {
            skip = 0;
            for (int j = pLen - 1; j >= 0; --j) {
                System.out.println(txt.charAt(i + j) + "--" + pattern.charAt(j));
                if(txt.charAt(i + j) ! = pattern.charAt(j)) { skip = j - records[txt.charAt(i + j)];if (skip < 1) skip = 1;
                    break; }}if (skip == 0) return i;
        }
        return- 1; }Copy the code

The test code

    public static void main(String[] args) {
        String txt = "hello world";
        String pattern = "world";
        BoyerMoore bm = new BoyerMoore(pattern);
        int res = bm.bmSearch(txt);
        System.out.println("BM algorithm matching result:" + res);
    }Copy the code

The test results

KMP algorithm

Analysis of the

The KMP algorithm introduces an invalidation function — the next array. The key to this algorithm is how the next function is calculated. Amazing? No, it’s pins and needles. It’s hard to understand. Only debug step by step with.

To prepare

K = next[k]. Since the next character in the longest string of the previous one is not equal to the last one, we need to find the second long string of the previous one. The problem becomes finding the longest string from 0 to next(k). If the next character is not the same as the last one, we continue to find the second long string, that is, the next next(k), until we find it, or none at all.

    private int[] getNext(char[] patChars, int pLen) {
        int[] next = new int[pLen];
        next[0] = -1;
        int k = -1;
        for (int i = 1; i < pLen; i++) {
            while(k ! = -1 && patChars[k + 1] ! = patChars[i]) { k = next[k]; }if (patChars[k + 1] == patChars[i])
                ++k;
            next[i] = k;
        }
        System.out.println("Good prefix:");
        print(next);
        return next;
    }Copy the code

implementation

    private int kmpSearch(String txt, String pattern) {
        char[] txtChars = txt.toCharArray();
        int tLen = txtChars.length;
        char[] patChars = pattern.toCharArray();
        int pLen = patChars.length;
        int[] next = getNext(patChars, pLen);
        int index = 0;
        for (int i = 0; i < tLen; i++) {
            while(index > 0 && txtChars[i] ! = patChars[index]) { index = next[index - 1] + 1; } System.out.println(txtChars[i] +"--" + patChars[index]);
            if (txtChars[i] == patChars[index])
                ++index;
            if (index == pLen)
                return i - pLen + 1;
        }
        return- 1; }Copy the code

The test code

    public static void main(String[] args) {
        KMPArithmetic kmpArithmetic = new KMPArithmetic();
        String txt = "hello world";
        String pattern = "world";
        int res = kmpArithmetic.kmpSearch(txt, pattern);
        System.out.println("KMP algorithm matching result:" + res);
    }Copy the code

The test results

KMP algorithm (based on dFA of determined Priority State Automata)

Initialize the

    private KMPByDFA(String pattern) {
        this.pattern = pattern;
        this.pLen = pattern.length();
        int aLen = 256;
        dfa = new int[aLen][pLen];
        dfa[pattern.charAt(0)][0] = 1;
        int i = 0;
        for (int j = 1; j < pLen; j++) {
            for(int k = 0; k < aLen; Dfa [k][j] = dfa[k][I]; } dfa[pattern. CharAt (j)][j] = j + 1; I = dfa[pattern.charat (j)][I]; }}Copy the code

implementation

    private int kmpSearch(String txt) {
        int i = 0;
        int j = 0;
        int tLen = txt.length();
        for(; i < tLen && j < pLen; i++) { j = dfa[txt.charAt(i)][j]; } // Find a match and reach the end of the pattern stringif (j == pLen)
            return i - pLen;
        return- 1; }Copy the code

The test code

    public static void main(String[] args) {
        String txt = "hello world";
        String pattern = "world";
        KMPByDFA kmp = new KMPByDFA(pattern);
        int res = kmp.kmpSearch(txt);
        System.out.println("BM algorithm matching result:" + res);
    }Copy the code

The test results

end

Your likes and attention are the biggest support for me, thank you!

Here we go. String matching algorithm

The introduction

BF algorithm

parsing

variant

The test code

The test results

Fairly RK algorithm

Initialize the

To prepare

implementation

The test code

The test results

BM algorithm

Analysis of the

To prepare

implementation

The test code

The test results

BM Heuristic (Lite)

Analysis of the

Initialize the

implementation

The test code

The test results

KMP algorithm

Analysis of the

To prepare

implementation

The test code

The test results

KMP algorithm (based on dFA of determined Priority State Automata)

Initialize the

implementation

The test code

The test results

end

Related Posts

What is CPU virtualization technology? What are the advantages

MySQL > import SQL file from Navicat } mistake? Speed to solve

Stop brute force matching strings, efficient KMP is really sweet!