Naive string comparison

A naive string comparison is a comparison between the main string T and the matching string P, character by character, assuming that the main string has m characters and the matching string has N characters, then (m-n+1) N (m>n) are compared.

public static int simple(String t, String p) {
    if (t == null || p == null) {
        return -1;
    }
    if (t.length() < p.length()) {
        return -1;
    }
    char[] ct = t.toCharArray();
    char[] cp = p.toCharArray();
    for (int i = 0; i < ct.length - cp.length + 1; i++) {
        for (int j = 0; j < cp.length; j++) {
            if (ct[i + j] == cp[j]) {
                if (j == cp.length - 1) {
                    // It is fully matched
                    returni; }}else {
                break; }}}return -1;
}
Copy the code

The conjecture of simple algorithm

Each character of the main string is compared to the matching string, which is not necessary

A and B do not match, the matching string moves one position at a time, which is referred to as sliding. It is already known that the character B at the first position of the matching string does not match the character A at the first and second positions of the main string. If the character B matches directly, the main string does not match the character A.

Prefixes and suffixes

So how much does the matching string need to slide to match the current main string? Before you understand, first understand the concept of prefix and suffix

Prefix: The combination of all implementations of a string except the last letter.

Suffix: The entire implementation of a string except for the first letter

For example: abababa

Prefix: A, ab, ABA, abab, ababa, ababab

Suffixes: A, ba, aba, baba, ababa, bababa

Add a main string T: babaabababada, matching string P: abababa

First T[0] does not match P[0], and the next T[1] matches P[1], which, according to human thinking, is to move P one bit to the right.

T [1] = = P [0], T [2] = = P [1], T [3] = = P [2], [4] but T indicates P [3]

So with our naive algorithm, P is going to move one bit to the right, and the I pointer is going back, and we’re doing the same thing.

If the I pointer doesn’t have to fall back, let the J pointer fall back. In the figure above, T[4] ≠ P[3], but the one before P[3] must match, namely ABA, whose prefix is A and AB. The suffixes are BA and A, and only a can be overlapped with the prefix, so the maximum number of overlapped elements is 1.

Why do we calculate this maximum overlap number?

Suppose you have a string, abab, prefixed with a, ab, aba; Suffixes bab, ab, b; The prefix and suffix can overlap if you move abab two places to the right. It’s very important to understand this, and to understand this is to understand half of the KMP algorithm.

To determine the number of digits to move, you simply need to know all the prefix tables for a string

If only the last character of abababa, a, is not matched, then ababab is matched.

Ababab prefix: A, ab, aba, abab, ababa; Suffixes: babab, abab, baba, ab, b; Maximum overlap number: 4

Meaning of this operation:

P[6] does not match T[I], j goes back to 4, T[4] does not match P[I], abab can match the main string to the maximum number of matches 4

Don’t get it? Let me give you another example

Let’s say I have a matching string abad

There is no match in T[3], find the preceding string ABA, aba prefix: A, ab; Suffixes BA, a; The maximum number of overlaps is 1 (a at the beginning and the end). J is compared with I after backtracking to 1. At this point, the number of substrings in front of j has reached the maximum number of overlaps with T. Notice I is not backtracking.

That is, if j doesn’t match, look for the maximum number of overlapses in the substring before j, and then trace j back to that maximum. At this point, you need to rebuild a prefix list

The prefix table next

Next, add the prefix list

Suppose there is a matching string abaabbabaa

If j = 0 does not match, there is no need to compare, I increases forward, note that this is an important jump condition, next[0]=-1, -1 is a default given value;

If j = 1 does not match, the maximum value of prefix and suffix overlap of substring A is 0, next[1]=0;

If j = 2 does not match, the maximum number of overlaps between the prefix and suffix of the substring AB is 0, and next[1]=0;

If there is a mismatch at j =3, the maximum overlap between prefix and suffix of substring ABA is 1 next[2]=1;

If j =4 does not match, the maximum number of overlaps between the prefix and suffix of substring ABaa is 1 next[3]=1;

If j =5 does not match, the maximum number of overlaps between the prefix and suffix of the substring abaab is 2 next[4]=2;

.

How to write good code for this logic? Need to find a rule, in fact, 2 minutes:

  1. P[j + 1 ] == P[next[j]]

    Next [j-1] = next[2] = 1,P[next[2]] =P[1] = b; Ask next [j]

    Next [j] = next[j-1] + 1

    Since the maximum number of overlaps of ABA is 1, that is, at the position of B, now add ab after it, which has the same value as B of P[1] and will inevitably increase the number of overlaps by 1.

  2. P[j + 1] ≠ next[j]

    • P[0] == P[j + 1]

      Next [j] = 0. Next [j] = 0

    • P[0] ≠ P[j + 1]

      Next [j] = 0

After the above explanation, we derive the following from the previous next value.

If j =6 does not match, P[5] = b; Next [4] = 2, P = a, [2] because [5] P indicates P [2], and [0] P indicates P [5], the next [5] = 0;

If j =7 does not match, P[6] = a; Next [5] = 0, P = a, [0] because P = = P [0] [6], the next [6] = next [5] + 1 = 1;

If j =8 does not match, P[7] = b; Next [6] = 1, P = b [1], because [7] = P = P [1], the next [7] = next [6] + 1 = 2;

If j =9 does not match, P[8] = a; Next [7] = 2, P = a, [2] for P = = P [8], [2], the next [8] = next [7] + 1 = 3;

public static int[] getNext(String ps) {
    char[] p = ps.toCharArray();
    int[] next = new int[p.length];
    next[0] = -1;// Defaults to the given value
    // j represents the position of the current unmatched character
    // For example, abaabbabaa, if j == 2, then the mismatch character is a, and the substring before j is ab
    // j starts at 1, because 0 is already -1, 0 means the current character a does not match, the first character does not match, and the whole string does not match
    // 1 indicates that the calculation starts from b, which is preceded by a
    // There are two special positions in the calculation process, one is j== 0, the other is j== 1
    Next [j] == -1 next[j] == -1 next[j] == -1 next[j] == -1 next[j] == -1
    Next [1] = 0; // next[1] = 0; // next[1] = 0
    int j = 1;
    while (j < p.length) {
        int k = next[j-1]; // The previous values are required for calculation
        if (k == -1 || p[j-1] == p[k]) {
            // the reason why k == -1 is explained above
            next[j] = next[j-1] + 1; // k++
        } else {
            if (p[0] == p[j-1]) {
                next[j] = 1;
            } else {
                next[j] = 0;
            }
        }
        j++;
    }
    return next;
}
Copy the code

KMP implementation

With prefix lists, the KMP algorithm is relatively simple

 public static int kmp(String t, String p) {
        if (t == null || p == null) {
            return -1;
        }
        if (t.length() < p.length()) {
            return -1;
        }
        char[] ct = t.toCharArray();
        char[] cp = p.toCharArray();

        int[] next = getNext(p);

        int i = 0;
        int j = 0;
        while (i < ct.length && j < cp.length) {
            if (j == -1 || ct[i] == cp[j]) {
                if (j == cp.length - 1) {
                    // It is fully matched
                    return i - j;
                }
                // j == -1 indicates that the match starts with the first character
                // if the first character does not match, I moves to the right, j++ is still 0, and next time the match starts from the first character of the matching string
                // If j! = -1, so the characters at I and j are equal, compare the next one
                i++;
                j++;
            } else {
                // string j does not match, find the position in the prefix table where j needs to be backtrackedj = next[j]; }}return -1;
    }
Copy the code