Hello, I am Xiao Huang, a Java development engineer of Unicorn Enterprise. Thank you for meeting us in the vast sea of people. As the saying goes: When your talent and ability are not enough to support your dream, please calm down and study. I hope you can study with me and work hard together to realize your own dream.
One, the introduction
In the company’s weekly meeting last week, it was mentioned that for the risk control industry, keyword matching is one of the most important businesses, and the efficiency of keyword matching is also an important indicator that many risk controllers pay attention to.
How to use a convenient and fast scheme for keyword matching? The common practice of risk control is Trie tree. Today, we will look at the overlord of keyword matching — Trie tree (prefix tree).
Second, what is keyword matching
For risk control, risk control information is processed every day, one of the more important business direction – keyword matching
In simple terms, if the user enters silly force, oh my God, I want to bomb the school….. And so on a series of illegal words, we need to go to our keyword table for matching, and finally return the result: this is an illegal word
It seems that this may not be humanized enough, we can label the keyword table, put our keyword table, hit a series of labels related to politics, pornography, terrorism and so on.
In this way, our whole process looks very human
What is a Trie tree
A dictionary tree, as the name suggests, is a tree about “dictionaries”. That is, it is a way of storing dictionaries (and therefore a data structure rather than an algorithm). Each “word” in the dictionary is a path from the root node to some destination node, and the letters on each side of the path are joined together to form a word.
- (The orange nodes are “destination nodes,” meaning that all the letters on the path from the root node to the destination node form a word.)
As we can see from this diagram, a dictionary tree is just a tree with a letter on each edge, and some nodes of the tree are designated as marker nodes (target nodes).
That’s the idea of a dictionary tree. Combined with the above concepts, the dictionary tree shown in the figure above contains the words: A, ABC, BAC, BBC, ca
The Trie tree has the following functions:
- Maintain a collection of strings (that is, dictionaries)
- Inserting strings into a string collection (that is, a tree)
- Is there a string in the collection of query strings (that is, a query)
- Count the number of occurrences of a string in a collection (that is, statistics)
- Sort a collection of strings lexicographically (that is, lexicographically)
- If (j, j, j, j, j, j, j, j, j, j, j, j, j, j, j, j)
Why not just use HashMap
First of all, we need to be clear about our purpose, and the best way to do that is to pursue the appropriate technology for the business
To be clear, for a HashMap, the query efficiency of O(1) requires that the size of a single sample be ignored
For this, let’s look at the following code:
The String class
// Calculates the hashCode of the string
public int hashCode(a) {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
Copy the code
A HashMap class
public boolean containsKey(Object key) {
returngetNode(hash(key), key) ! =null;
}
static final int hash(Object key) {
int h;
return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
}
Copy the code
We can see that when we put a string into a HashMap, it iterates through the string to get its hashCode
So, if we don’t ignore the sample size, then our time complexity is actually O(K), where K is the size of the string
If we are currently querying the frequency of occurrences of the string, technically using a HashMap is not much different than using a Trie tree, but when we land our business, for data as large as a keyword, the space occupied by using HashMap is wasted
If I want to query for strings prefixed with “ABC”, our HashMap does not support this, and the Trie tree supports this nicely
In general, for keyword matching, in order to avoid space waste and business diversification, using trie tree is much more efficient than HashMap
Five, Trie tree code implementation
Initialization of Trie
Our Trie tree code in this article assumes that all occurrences of strings are lowercase
- Pass: The number of strings that pass the point
- End: Number of strings to end the point
- Tries: subtree
class Trie {
int pass;
int end;
Trie[] tries;
public Trie(a) {
pass = 0;
end = 0;
tries = new Trie[26]; }}Copy the code
2. Trie adds strings
public void insert(String word) {
if (word == null) {
return;
}
char[] str = word.toCharArray();
Trie trie = this;
int path = 0;
for (int i = 0; i < str.length; i++) {
path = str[i] - 'a';
// See if the tree already exists. If not, create it
if (trie.tries[path] == null) {
trie.tries[path] = new Trie();
}
trie.pass++;
trie = trie.tries[path];
}
trie.end++;
}
Copy the code
The frequency of occurrences of the Trie query string
/** ** How many times does the string appear **@param word
* @return* /
public int search(String word) {
if (word == null) {
return 0;
}
char[] str = word.toCharArray();
Node node = root;
int path = 0;
for (int i = 0; i < str.length; i++) {
path = str[i] - 'a';
if (node.nodes[path] == null) {
return 0;
}
node = node.nodes[path];
}
return node.end;
}
Copy the code
4. Trie deletes the string
- Java player is directly set to
null
, GC can be recycled, C++ players to manual recycling
public void delete(String word) {
if (search(word) == 0) {
return;
}
char[] str = word.toCharArray();
Node node = root;
int path = 0;
for (int i = 0; i < str.length; i++) {
path = str[i] - 'a';
if (--node.nodes[path].pass == 0) {
// Java garbage collection is directly collected
node.nodes[path] = null;
return;
}
node = node.nodes[path];
}
node.end--;
}
Copy the code
Six, summarized
That’s the end of the Trie tree. Think about the difference between a HashMap and a Trie tree, and think about which is the best solution. After all, bloggers can write wrong.
Also, the Trie tree code needs to be written several times. When writing this blog, I forgot how to write it again
- 208. Implement Trie (prefix tree)
- Title 2:211. Add and Search words – Data structure Design [some KNOWLEDGE of DFS required]
Interested in the source code of small partners, you can pay attention to love knocking code yellow public number, reply: algorithm source code can be obtained algorithm source
That’s it for this installment, and the next installment will definitely, definitely cover minimum spanning tree (Prim, Kruskal) algorithms.
I am a unicorn enterprise Java development engineer, hope you can point attention ah, have a question you can leave a message or private message to add my wechat: HLS1793929520, we see you next time!