Hello, I am Xiao Huang, a Java development engineer of Unicorn Enterprise. Thank you for meeting us in the vast sea of people. As the saying goes: When your talent and ability are not enough to support your dream, please calm down and study. I hope you can study with me and work hard together to realize your own dream.

One, the introduction

In the company’s weekly meeting last week, it was mentioned that for the risk control industry, keyword matching is one of the most important businesses, and the efficiency of keyword matching is also an important indicator that many risk controllers pay attention to.

How to use a convenient and fast scheme for keyword matching? The common practice of risk control is Trie tree. Today, we will look at the overlord of keyword matching — Trie tree (prefix tree).

Second, what is keyword matching

For risk control, risk control information is processed every day, one of the more important business direction – keyword matching

In simple terms, if the user enters silly force, oh my God, I want to bomb the school….. And so on a series of illegal words, we need to go to our keyword table for matching, and finally return the result: this is an illegal word

It seems that this may not be humanized enough, we can label the keyword table, put our keyword table, hit a series of labels related to politics, pornography, terrorism and so on.

In this way, our whole process looks very human

What is a Trie tree

A dictionary tree, as the name suggests, is a tree about “dictionaries”. That is, it is a way of storing dictionaries (and therefore a data structure rather than an algorithm). Each “word” in the dictionary is a path from the root node to some destination node, and the letters on each side of the path are joined together to form a word.

  • (The orange nodes are “destination nodes,” meaning that all the letters on the path from the root node to the destination node form a word.)

As we can see from this diagram, a dictionary tree is just a tree with a letter on each edge, and some nodes of the tree are designated as marker nodes (target nodes).

That’s the idea of a dictionary tree. Combined with the above concepts, the dictionary tree shown in the figure above contains the words: A, ABC, BAC, BBC, ca

The Trie tree has the following functions:

  • Maintain a collection of strings (that is, dictionaries)
  • Inserting strings into a string collection (that is, a tree)
  • Is there a string in the collection of query strings (that is, a query)
  • Count the number of occurrences of a string in a collection (that is, statistics)
  • Sort a collection of strings lexicographically (that is, lexicographically)
  • If (j, j, j, j, j, j, j, j, j, j, j, j, j, j, j, j)

Why not just use HashMap

First of all, we need to be clear about our purpose, and the best way to do that is to pursue the appropriate technology for the business

To be clear, for a HashMap, the query efficiency of O(1) requires that the size of a single sample be ignored

For this, let’s look at the following code:

The String class

	// Calculates the hashCode of the string
	public int hashCode(a) {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
Copy the code

A HashMap class

	public boolean containsKey(Object key) {
        returngetNode(hash(key), key) ! =null;
    }
    
    static final int hash(Object key) {
        int h;
        return (key == null)?0 : (h = key.hashCode()) ^ (h >>> 16);
    }
Copy the code

We can see that when we put a string into a HashMap, it iterates through the string to get its hashCode

So, if we don’t ignore the sample size, then our time complexity is actually O(K), where K is the size of the string

If we are currently querying the frequency of occurrences of the string, technically using a HashMap is not much different than using a Trie tree, but when we land our business, for data as large as a keyword, the space occupied by using HashMap is wasted

If I want to query for strings prefixed with “ABC”, our HashMap does not support this, and the Trie tree supports this nicely

In general, for keyword matching, in order to avoid space waste and business diversification, using trie tree is much more efficient than HashMap

Five, Trie tree code implementation

Initialization of Trie

Our Trie tree code in this article assumes that all occurrences of strings are lowercase

  • Pass: The number of strings that pass the point
  • End: Number of strings to end the point
  • Tries: subtree
class Trie {
    int pass;
    int end;
    Trie[] tries;

    public Trie(a) {
        pass = 0;
        end = 0;
        tries = new Trie[26]; }}Copy the code

2. Trie adds strings

	public void insert(String word) {
        if (word == null) {
            return;
        }
        char[] str = word.toCharArray();
        Trie trie = this;
        int path = 0;
        for (int i = 0; i < str.length; i++) {
            path = str[i] - 'a';
            // See if the tree already exists. If not, create it
            if (trie.tries[path] == null) {
                trie.tries[path] = new Trie();
            }
            trie.pass++;
            trie = trie.tries[path];
        }
        trie.end++;
    }
Copy the code

The frequency of occurrences of the Trie query string

		/** ** How many times does the string appear **@param word
         * @return* /
        public int search(String word) {
            if (word == null) {
                return 0;
            }
            char[] str = word.toCharArray();
            Node node = root;
            int path = 0;
            for (int i = 0; i < str.length; i++) {
                path = str[i] - 'a';
                if (node.nodes[path] == null) {
                    return 0;
                }
                node = node.nodes[path];
            }
            return node.end;
        }
Copy the code

4. Trie deletes the string

  • Java player is directly set tonull, GC can be recycled, C++ players to manual recycling
	public void delete(String word) {
            if (search(word) == 0) {
                return;
            }
            char[] str = word.toCharArray();
            Node node = root;
            int path = 0;
            for (int i = 0; i < str.length; i++) {
                path = str[i] - 'a';
                if (--node.nodes[path].pass == 0) {
                    // Java garbage collection is directly collected
                    node.nodes[path] = null;
                    return;
                }
                node = node.nodes[path];
            }
            node.end--;
        }
Copy the code

Six, summarized

That’s the end of the Trie tree. Think about the difference between a HashMap and a Trie tree, and think about which is the best solution. After all, bloggers can write wrong.

Also, the Trie tree code needs to be written several times. When writing this blog, I forgot how to write it again

  • 208. Implement Trie (prefix tree)
  • Title 2:211. Add and Search words – Data structure Design [some KNOWLEDGE of DFS required]

Interested in the source code of small partners, you can pay attention to love knocking code yellow public number, reply: algorithm source code can be obtained algorithm source

That’s it for this installment, and the next installment will definitely, definitely cover minimum spanning tree (Prim, Kruskal) algorithms.

I am a unicorn enterprise Java development engineer, hope you can point attention ah, have a question you can leave a message or private message to add my wechat: HLS1793929520, we see you next time!