Data Structure and Algorithm

Learn more about Java basics

This series of notes is based on the geek Time course data Structures and The Beauty of Algorithms

This directory

Breadth and depth first algorithms

BF and RK algorithms

BM algorithm

KMP algorithm

Sunday algorithm

Trie tree

AC automata

Breadth and depth first algorithms

Breadth First Search (BFS)

Breadth First Search (embosses-first-search), in fact, is a kind of “carpet” layer by layer Search strategy, (First find the closest to the starting vertex, then the next closest, then Search outward)

There are three important auxiliary variables: Visited, Queue and Prev

Visited is used to record vertices that have been visited, so as to avoid repeated visits of vertices. If vertex Q is visited, the corresponding visited[q] is set to true.
A queue is a queue that stores vertices that have been accessed but whose connected vertices have not yet been accessed. Because breadth-first search is layer by layer, vertices in layer K +1 can be accessed only after all vertices in layer K are accessed. When we access layer k, we need to record the vertices of layer K, and then we can find the vertices of layer K +1 through layer K.
Prev records the search path. When we start at vertex S and go breadth-first to vertex T, the prev array stores the path of the search. But this path is stored in reverse, prev[w] stores which predecessor vertices w traverses from.

Time complexity of BFS

Time complexity: The terminating vertex T is far away from the starting vertex S and needs to traverse the whole graph to find it. Every vertex gets in and out of the queue once, and every edge gets accessed once.

So the time complexity of breadth-first search is O(V+E), where V represents the number of vertices and E represents the number of edges. Because, for a connected graph, all vertices in the graph are connected, E must be greater than or equal to V-1, so the breadth first search complexity can also be shortened to O(E).

Space complexity of BFS

Spatial complexity: the space consumption of breadth-first search is mainly in several auxiliary variables: Visited array, Queue queue and Prev array. None of these three storage Spaces can exceed the number of vertices, so the space complexity is O(V).

Depth-first Search (DFS)

Depth-first Search is short for Depth-First Search. The most intuitive example of this is going through a maze. Depth-first search uses a well-known algorithmic idea called backtracking. This process of problem solving is very suitable for recursion.

The depth-first search code implementation also uses the prev, visited variables and the print() function, as in the breadth-first search code implementation. But in the depth-first search code implementation. There’s a special variable called found, which means that once we’ve found the terminating vertex T, we don’t recursively continue the search.Depth-first search does not find the shortest path from vertex S to vertex T.

Time complexity of DFS

Time complexity: Each edge can be accessed at most twice, once for traversal and once for rollback. So the time complexity of DFS is O(E), where E represents the number of edges.

Space complexity of DFS

Spatial complexity: Depth-first search algorithm consumes mainly visited, Prev array and recursive call stack. Visited, the size of prev array is proportional to the number of vertices V, and the maximum depth of recursive call stack will not exceed the number of vertices, so the total space complexity is O(V).

More work:

Basic algorithms for graphs (BFS and DFS)

BF and RK algorithms

Character storage matching algorithm is the base dependency of character matching function provided by all programming languages. It can be divided into single pattern matching algorithm and multi-pattern matching algorithm.

Single pattern matching: BF algorithm and RK algorithm, RK algorithm is an improvement of BF algorithm, it cleverly with the help of the hash algorithm, improve the efficiency of matching.

BF algorithm

BF algorithm is short for Brute Force, also known as naive matching algorithm.
Two concepts: main string and pattern string

If you look for string B in string A, string A is the main string, string B is the pattern string and the length of the main string is called n, and the length of the pattern string is called m. N >m because you are looking for a pattern string in the main string

The idea of BF algorithm can be summarized as follows: in the main string, we check that the starting positions are 0,1,2… N -m n-m+1 substring of length m, see if there are more pattern string matches.

4. In extreme cases, for example, the main string is “AAAAA… Aaaaa “, the mode string is “aaaab”. I’m comparing m characters at a time, n minus m plus 1 times, soThe worst time complexity is order m times n..

Although BF algorithm has high time complexity, it is very common in practical development.
- Reason 1: In most cases in real software development, the pattern string and the main string are not too long. Each time a pattern string matches a substring in the main string, it can be stopped if no matching character is encountered in the process, without all comparisons being made once. So theoretically the worst-case time is O(m*n), but it’s more statistical, and for the most part, this algorithm performs very efficiently.
- Reason 2: The naive string matching algorithm is simple in idea and very simple in code implementation, which means it is not prone to error. In engineering, on the premise of meeting performance requirements, simplicity is the first choice, which is also the commonly said KISS (Keep it Simple and Stupid) design principle.

Fairly RK algorithm:

The full name of the RK algorithm is rabin-Karp algorithm, which is a concatenation of the names of two inventors. It’s an upgraded version of the BF algorithm
The problem of BF algorithm is that each time to check whether the main string and substring match, each character needs to be compared in turn, so BF algorithm time complexity is relatively high. But with the introduction of hashing, the time complexity is immediately reduced.
RK algorithm ideas:

Hash n-m+1 substrings in the main string using the hash algorithm,Then, the hash value of the pattern string is compared one by one. If the pattern string is equal, the corresponding pattern string exists. 4. When calculating the hash value of characters through the hash algorithm, it needs to traverse every character in the substring, which only provides the efficiency of comparing pattern string with substring, but does not improve the overall efficiency. 5. In order to improve the efficiency of the hash algorithm to calculate the substring hash value, it can be solved by the design of the hash algorithm. Given that the character set of the string to be matched contains only K characters, a substring can be represented by a k-base number, which is converted to decimal and used as the hash of the substring. 6.7. This hash algorithm has a feature that in the main string, the calculation formula of the hash value of the two adjacent substrings has a certain relationship. 8. The calculation formula of adjacent forehead hash value has intersection:9. In the calculation process, we can improve the efficiency of 26m-1 by referring to the table. We calculate 26^0, 26^1, 26^2, 26^3… 26 to the m minus 1, and it’s stored in an array of length m, and the “power” in the formula corresponds to the index of the array. This takes the value directly from the array, saving the computation time.10. Time complexity of RK algorithm: 1. The whole RK algorithm consists of two parts, which calculate the substring hash value and the comparison between pattern string hash value and substring hash value. 2. In the first part, we only need to scan the main string once to calculate all the substring hash value, complexity is O(n). 3. The time complexity of the comparison between the pattern string hash value and each substring hash value is O(1), and a total of N-m +1 substring hash values need to be compared, so the time complexity of this part is also O(n). So RK algorithmThe overall time is order n.. 11. If the mode string is very long, the corresponding main string neutron string will also be very long, and the hash value calculated by the above hash algorithm may be very large. If it exceeds the range that the shaping data can represent in the computer, how to solve it?

A: We can add the digits of each letter in a string and hash the resulting sum. This hash algorithm produces a much smaller range of hashes. 12. How to resolve hash conflicts in case of occurrence?

A: If the two values are equal, compare each character in the substring.

Therefore, the conflict probability in the hash algorithm should be relatively low. If there are a large number of conflicts, it will lead to the degradation of time complexity and efficiency of RK algorithm. In extreme cases, if there are a lot of conflicts, and the substring is compared to the pattern string itself every time, the time complexity degrades to O(n*m).

BM algorithm

BM (Boyer-Moore) algorithm, is a very efficient character matching algorithm, experimental statistics of his performance is 3 to 4 times of the famous KMP algorithm. The principle of BM algorithm is complex

The core idea of BM algorithm

Think of the matching process between the pattern string and the main string as a pattern string constantly sliding back in the main string. When a character does not match, the BF and RK algorithms do this by sliding the pattern string back one bit and then re-matching from the first character of the pattern string.

BM algorithm, in essence, is looking for the rule that when the pattern string does not match, it can slide the pattern string back several more bits to improve the efficiency of matching

BM algorithm principle analysis

BM algorithm consists of two parts, namely

Bad Character Rule
Good suffix Shift

Bad character rules:

In BF and RK rules, during the matching process, characters in the main string are matched according to the order of the subscripts of the pattern string from small to large. This matching sequence is more consistent with our habits of thinking.
BM algorithm has a special matching order. It matches in reverse according to the sequence of pattern string subscripts from large to small

3. Match backwards from the end of the pattern string. When we find that a character cannot be matched, the unmatched character is called a bad character (the character of the bad main string). 4. The bad character C does not exist in the pattern string. That is, character C cannot match any character in the pattern string. At this point, you can slide the pattern string directly back three places, after C, and compare again from the last character of the pattern string. 5. The last character D in the pattern string cannot match a in the main string this time, but the pattern string cannot be slid back three places at this time. Because the bad character A is present in the pattern string, the position with subscript 0 in the pattern string is also the character A.

When generating a mismatch, we mark the character in the pattern string corresponding to the bad character as SI. If a bad character exists in the pattern string, we mark the bad character below the pattern string as xi. If it doesn’t exist, I’ll call it minus 1. The number of bits moved back in the pattern string is si minus XI.

7. Using bad character rule, BM algorithm has a very low time complexity in the best case, O(n/m). 8. However, using the bad character rule alone is not enough, because the number of moving digits calculated by si-xi can be negative, such as aaAAAAAAAAAAaaaa for the main string and BAAAA for the mode string. Not only does it not slide the pattern string backward, but it can go backwards. Therefore, BM algorithm also needs to use the “good suffix rule”.

Good suffixes rule

The good suffix rule actually works in a similar way to the bad character rule.

2. Call the matched BC suffix {u}. If we find another string {u*} that matches {u}, we slide the pattern string to the position where the substring {u*} aligns with {u} in the main string.3. If there is no substring equal to {u} in the pattern string, you cannot slide the pattern string directly after the main string {u}. This would result in missing cases where the pattern string and the main string could match.4. If there is no matching substring of good suffixes in the pattern string, then in the process of sliding the pattern string backward step by step, as long as {u} in the main string and the pattern string overlap, it will certainly not be able to match completely. But a perfect match is possible when the pattern string slides to the point where the prefix overlaps with the suffix {u} in the main string, and the overlaps are equal.Therefore, in view of this situation, we should not only pay attention to whether there is another matching substring of the suffix in the pattern string, but also examine the suffix substring of the suffix, whether there is a match with the prefix substring of the pattern string.

The suffix substring of a string s is the substring whose last character is aligned with s. For example, the suffix substring of ABC includes C and BC. Find the longest prefix substring that matches the pattern string, say {v}, and slide the pattern string to the position shown.

Good suffixes and bad characters

1. You can calculate the number of digits that the suffix and bad characters slide backwards, and then take the largest of the two digits as the number of digits that the pattern string slides backwards. 2. If neither of the above conditions is satisfied, then we need to move the pattern string backward to the position after the good suffix.Copy the code

BM algorithm code implementation

The bad character rule itself is not hard to understand. When a bad character is encountered, it is necessary to calculate the backward digit si-xi, where xi is the key calculation, how to find the position of the bad character in the pattern string China?
If the bad character is searched sequentially through the pattern string, the inefficiency affects the performance of the whole algorithm. Use a hash table to store every character in a pattern string and its subscript in a hash table. This allows you to quickly locate the bad character in the pattern string.

3. The most core content in the processing rules of good suffixes: + In the pattern, find another substring matching the good suffix; + In the suffix substring of good suffix, find the longest suffix substring that matches the pattern prefix substring; 4. Introduce the suffix array, whose subscript indicates the length of the suffix substring, and the array stores the starting position of another substring matching the suffix substring in the pattern string. Another Boolean array prefix is introduced to record whether the suffix substring of a pattern string matches its prefix substring.

We take the substring with subscripts from 0 to I (I can be 0 to m-2) and the whole pattern string to find the common suffix substring. If the common suffix substring is k, we record that the suffix[k] = j (j indicates the starting subscript of the common suffix substring). If j=0, which means that the common suffix substring is also the prefix substring of the pattern string, we record prefix[k]=true.

We have implemented the calculation process of suffix array and prefix array with code, which looks like the following:

// b indicates pattern string, m indicates length, suffix, Private void generateGS(char[] b, int m, int[] suffix, Boolean [] prefix) {for (int I = 0; i < m; ++ I) {// Initialize suffix[I] = -1; prefix[i] = false; } for (int i = 0; i < m - 1; ++i) { // b[0, i] int j = i; int k = 0; / / public suffix substring length while (> = 0 & j & b [j]] - [m - 1 k = = b) {/ / and b [0, 1 m -] public suffix substring - j; ++k; suffix[k] = j+1; } I if (j == -1) prefix[k] = true; // If the common suffix substring is also the prefix of the pattern string}}Copy the code

Follow the following steps to determine the match:
- First, match whether the whole suffix exists in the pattern string, if so, move to the same position of the pattern string, the matching method is whether the suffix[k] is not equal to -1
- If the whole suffix does not exist, then the matched suffix substring satisfies the requirement that there is a matched prefix substring, by checking whether prefix[k] equals true
- If neither rule finds a substring that matches the suffix and its substring, we move the entire pattern string backward by m bits.

The code implementation is as follows:

// a,b represents the main string and the pattern string; N, m denotes the length of the main and pattern strings. public int bm(char[] a, int n, char[] b, int m) { int[] bc = new int[SIZE]; GenerateBC (b, m, BC); Int [] suffix = new int[m]; boolean[] prefix = new boolean[m]; generateGS(b, m, suffix, prefix); int i = 0; While (I <= n-m) {int j; for (j = m - 1; j >= 0; --j) {// If (a[I +j]! = b[j]) break; If (j < 0) {return I; } int x = j - BC [(int)a[I +j]]; int y = 0; If (j < m-1) {// if there is a good suffix y = moveByGS(j, m, prefix); } i = i + Math.max(x, y); } return -1; } // j denotes the character subscript in the pattern string corresponding to the bad character; Private int moveByGS(int j, int m, int[] suffix, Boolean [] prefix) {int k = m-1-j; If (suffix[k]! = -1) return j - suffix[k] +1; for (int r = j+2; r <= m-1; ++r) { if (prefix[m-r] == true) { return r; } } return m; }Copy the code

Regardless of efficiency, both operations can be solved with “brute force” match lookups. However, to achieve high efficiency of BM algorithm, it is necessary to:

Since the good suffix is also the suffix substring of the pattern string itself, each suffix substring of the pattern string can be pre-calculated by preprocessing the pattern string before the pattern string and the main string officially match. Corresponding to the position of another matching substring.

More work:

No need to find, learning BM algorithm, this is enough (train of thought + detail note code)

KMP algorithm

KMP, in the course of learning these two lectures are better understood from start to finish how do you understand the next array in the KMP algorithm?

Sunday algorithm

The idea of Sunday algorithm is very similar to the idea of bad characters in BM algorithm. The difference is that the Sunday algorithm, after the mismatch, takes the character after the current part of the target string corresponding to the pattern string to do bad character matching.

BM algorithm and Sunday fast string matching algorithm String matching algorithm Sunday algorithm 【 pattern matching 】 – Sunday algorithm

Trie tree

What is a Trie tree?

It is a tree structure, a data structure that deals specifically with string matching, solving the problem of quickly finding a string in a set of strings.
The essence of a Trie tree is to combine duplicate prefixes using common prefixes between strings. Where the root node contains no information, each node represents a character in a string, and a path from the root node to the red node represents a string (red nodes are not all leaf nodes).
When looking up a string in a Trie tree, such as “her,” the string being looked up is split into individual characters H, E, r, and then matched from the root of the Trie tree. But if the string to be searched is “he”, use the same method as above, start from the root node, and match along a path. It is found that the last node of the path “E” is not red, that is, “he” is a prefix of a string, but cannot match any string exactly.

How to implement a Trie tree?

The Trie tree has two main operations. One is to construct a collection of strings into a Trie tree. The process can be decomposed into:

The procedure of inserting a string into a Trie tree
Query a string in the Trie tree

How do I store a Trie tree

A Trie tree is a multi-fork tree that needs to store Pointers to all the children of a node.
A classic storage method: using hash table idea, through an array of subscript and character mapping, to store Pointers to child nodes.

Suppose there are only 26 lower-case letters a through Z in the string, and from the array at subscript 0, we store Pointers to child node A, from the array at subscript 1, we store Pointers to child node B, and so on, from the array at subscript 25, we store Pointers to child node Z. If a child node of a character does not exist, null is stored at the location of the corresponding subscript. When looking for a string in the Trie tree, we can quickly find Pointers to matching child nodes by subtracting the ASCII of the character from the ASCII of “A”.

Time complexity of Trie tree lookups

Trie trees are very efficient for querying certain strings frequently within a set of strings. To build a Trie tree, you need to scan all the strings in order n, which is the sum of all the strings. But once the build is successful, subsequent queries can be very efficient.

For each query, if the character length to be queried is K, only about K nodes need to be compared. You can complete the query. It has nothing to do with the length or number of the original set of strings. So, once the Trie tree is built, the time complexity to find a string in it is O(k). K represents the length of the string to look for.

Are Trie trees memory intensive?

Trie trees use arrays to store Pointers to the children of a node. Even if a node has very few children, much less than 26, such as 2 or 3, an array of length 26 is maintained. The essence of Trie is to avoid storing the same prefixed substring of a set of strings repeatedly, but each character (corresponding to a node) is now stored far more than 1 byte. If the string contains not only lowercase letters, but also uppercase letters, numbers, and even Chinese characters, more storage is required. So not only does the Trie tree not save a lot of memory, but it can waste even more memory if there are not many repeated prefixes.

Optimization scheme of Trie tree

To sacrifice a bit of query efficiency, replace the array in each node with another data structure to store a node pointer. Such as: ordered array, jump table, hash table, red black tree and so on. Suppose you have an ordered array in which Pointers are sorted by the size of the characters in the children they point to. During the query, you can quickly find the pointer of the child node that a character should match by binary search.
Pinching optimization is the merging of a node with only one child node that is not the end node of a string. This saves space, but makes coding more difficult.

Trie number versus hash table, red black tree

The problem of string matching, generally speaking, is actually the problem of data search.
Looking for strings in a list of strings, the Trie number doesn’t actually do very well, and it’s very strict about what strings to deal with
- The character set contained in a character cannot be too large; if the character set is too large, a lot of storage space can be wasted. Even optimization comes at the expense of query and insert efficiency.
- Require the prefix of the string to coincide with the comparison, otherwise the space consumption will be much larger.
- If you want to solve a problem with a Trie tree, you need to implement a Trie tree from scratch and make it bug-free, which is engineering to complicate a simple problem.
- Data strung together by Pointers is discontiguous, and Trie trees use Pointers, so it’s not cache-friendly. There will be a performance discount.

To sum up: Trie trees are not suitable for exact match lookups. This problem is better solved with hash tables or red-black trees. Trie trees are good for looking for strings with matching prefixes.

This feature of Trie can be extended to a much broader application: auto-input completion.

AC automata

The video “Easy to master AC automata” is very good, ac automata feels like Tire tree +KMP

Single mode string matching:

BF: Simple scene, main string and mode string are not too long, O(m*n)
**KP: ** Not too large character set and not too long pattern string, otherwise hash values may conflict, O(n)
**naive-BM: the mode string should not be too long (because of the heavy preprocessing), such as the search scene in the IDE editor; Preprocessing O(m*m) and matching O(n) is more complicated and requires more extra space. PS: The bottleneck of pattern string length can be greatly relieved if the pre-processing algorithm is optimized with the idea of dynamic programming.
**KMP: ** Suitable for all scenarios, the overall implementation is also simpler than BM, O(n+m), only a next array O(n) extra space; But statistically it seems that BM is faster, and it’s not clear why.
In addition, there is a Sunday algorithm that is faster than BM/KMP and easier to implement + to understand.

String matching algorithm Sunday algorithm

Multi-mode string matching:

Naive -Trie: suitable for scenarios where there are many common prefixes (O(n*k)) or searches based on common prefixes (O(k)), such as auto-complete prompts in the search box.
AC automata: Suitable for the exact matching search of multiple pattern strings in a large number of texts, up to O(n).

Breadth and depth first algorithms

Breadth First Search (BFS)

Time complexity of BFS

Space complexity of BFS

Depth-first Search (DFS)

Time complexity of DFS

Space complexity of DFS

More work:

BF and RK algorithms

BF algorithm

Fairly RK algorithm:

BM algorithm

The core idea of BM algorithm

BM algorithm principle analysis

Bad character rules:

Good suffixes rule

Good suffixes and bad characters

BM algorithm code implementation

More work:

KMP algorithm

Sunday algorithm

Trie tree

What is a Trie tree?

How to implement a Trie tree?

How do I store a Trie tree

Time complexity of Trie tree lookups

Are Trie trees memory intensive?

Optimization scheme of Trie tree

Trie number versus hash table, red black tree

AC automata

Single mode string matching:

Multi-mode string matching:

Related Posts

About the pooling

2. Lookup in a two-dimensional array

Laravel + Elasticsearch implement Chinese search