• preface
  • The theoretical knowledge
    • [What is a Trie tree](# What is a Trie tree)
    • [Advantages and disadvantages of Trie](# Advantages and disadvantages of Trie)
    • [Application scenario of Trie](# Application scenario of Trie)
  • coded
  • Refer to the article
  • To contact me

preface

In the process of user query comprehension, there are many processes that require the use of dictionaries to “identify”. In the meantime, you can’t avoid using the Trie tree data structure.

So today we are going to in-depth study Trie tree related theoretical knowledge, and hands-on coding implementation.

The theoretical knowledge

What is a Trie tree

The following definition is taken from Wikipedia.

In computer science, a trie, also known as a prefix tree or dictionary tree, is an ordered tree used to hold associative arrays in which the keys are usually strings. Unlike binary lookup trees, keys are not stored directly in a node, but are determined by the node’s position in the tree. All descendants of a node have the same prefix, which is the string corresponding to the node, and the root node corresponds to the empty string. In general, not all nodes have corresponding values. Only leaf nodes and some internal nodes have corresponding keys with relevant values.

A simple Trie structure is shown below:

From the figure above, we can see some of the Trie’s features.

  • The root node contains no characters, and each child node except the root node contains one character.
  • The characters in the path from the root node to a node are connected to the string corresponding to that node.
  • The common prefix for each word is saved as a character node.

When implemented, a flag is placed in the node structure to indicate whether the node constitutes a word (keyword) or stores some other related value.

As you can see, the keys of a Trie tree are generally strings, and the Trie tree stores each key on a path rather than in a node. In addition, two keywords with common prefixes have the same path in the Prefix part of the Trie Tree, so the Trie Tree is also called a Prefix Tree.

The children of each node in the Trie tree are a collection of single characters, which makes it easy to sort all strings lexicographically. Just print the lexicographical order first, and walk through all the children in lexicographical order. So a Trie tree is also called a dictionary tree.

Advantages and disadvantages of Trie

The core idea of Trie tree is to exchange space for time and use the common prefix of string to reduce the cost of query time to improve efficiency.

Of course, in the case of large data volume, the Trie tree may not be larger than the hash table. As long as the space saved by sharing prefixes covers the extra overhead of the object.

The power of Trie lies in its time complexity. Both inserts and queries are O(N), where N is the length of the string to be inserted/queried, regardless of how many elements are stored in the Trie.

In terms of queries, some people would say that hash time O(1) is not faster? However, the efficiency of a hash search usually depends on how good the hash function is, and if a bad hash function causes a lot of collisions, it is not necessarily more efficient than a Trie tree.

The different keywords in the Trie tree do not conflict. It can only have hash like collisions if it allows multiple values to be associated with a single keyword.

In addition, Trie trees are faster for short strings without hashing values. Because in general, finding a hash value also requires traversing the string.

In other words, the time complexity of Trie trees is theoretically stable, while the time complexity of hash tables is unstable, depending on the hash function and the stored set of strings.

For industrial use, I recommend: If you don’t need the Trie tree prefix matching feature, just use the hash table.

The reasons are as follows:

  1. Hash table implementations are extremely simple, and most languages have well-established internal libraries. Easy to use.
  2. Most of the time in terms of K-V storage, hashing is due to the Trie tree, especially the hash tables in the language library that have been optimized by all the big names.
  3. Trie trees have to be implemented themselves and go through all kinds of logical testing, coverage assurance, pressure testing, etc., which is too expensive to put into use.

Application scenarios of Trie

As an engineer, the most important place for me to learn something is to understand its application scenarios. I have dabbled in all the technologies that only exist in books but have not been fully applied.

In learning Trie tree, I also spent a lot of time to find, record its application scenarios, listed here, if you have other application scenarios, please leave a message for discussion.

K-v storage and retrieval

This is the primitive use of the Trie tree mouth, where you need to compete with the hash table.

Word frequency statistics

We can modify the implementation of the Trie tree to change the number of words formed here to the number of words formed here for each node. So we can use it to search for common word frequency statistics in the scene.

Of course, this requirement hash table can also be implemented.

Lexicographic sort

All sets to be sorted are added to the Trie tree one by one, and all values are printed in order of precedence. When traversing all children of a node, output in lexicographical order.

Prefix matching

For example, find all strings in a collection of strings that begin with ab. All we need to do is construct a trie tree with all the strings and print it out asThe keyword on the path at the beginning is ok.

Trie tree prefix matches are often used in search prompts. For example, the second half of the automatic association function on various search engines.

Longest public prefix To find the longest public prefix of a set of strings, just build the set of strings into a Trie tree and traverse from the following nodes until multiple nodes appear (that is, forks occur).

An auxiliary structure used as an auxiliary structure for other data structures, such as suffix trees, AC automata, etc

coded

First implement Trie tree nodes:

package com.huyan.trie;

import java.util.*;

/** * Created by pfliu on 2019/12/06. */
public class TNode {
    /** * The current node character */
    private char c;
    /** * The number corresponding to the current node */
    int count = 0;

    private TNode[] children;

    private static int hash(char c) {
        return c;
    }

    @Override
    public String toString(a) {
        return "TNode{" +
                "c=" + c +
                ", count=" + count +
                ", children=" + Arrays.toString(children) +
                '} ';
    }

    TNode(char c) {
        this.c = c;
    }

    /** * adds the given character to the given list. *@paramNodes Specifies a list of nodes *@paramC The given character *@returnThe inserted node */
    private static TNode add(final TNode[] nodes, char c) {
        int hash = hash(c);
        int mask = nodes.length - 1;

        for (int i = hash; i < hash + mask + 1; i++) {
            int idx = i & mask;
            if (nodes[idx] == null) {
                TNode node = new TNode(c);
                nodes[idx] = node;
                return node;
            } else if (nodes[idx].c == c) {
                returnnodes[idx]; }}return null;
    }

    /** * Puts the current node into the given node list. * Used to transfer the node list when resize *@paramNodes List *@paramNode Specifies the node */
    private static void add(final TNode[] nodes, TNode node) {
        int hash = hash(node.c);
        int len = nodes.length - 1;

        for (int i = hash; i < hash + len + 1; i++) {
            int idx = i & len;
            if (nodes[idx] == null) {
                nodes[idx] = node;
                return;
            } else if (nodes[idx].c == node.c) {
                throw new IllegalStateException("Node not expected for "+ node.c); }}throw new IllegalStateException("Node not added");
    }

    /** * inserts the given character into the children of the current node. *@paramC The given character *@returnThe inserted node */
    TNode addChild(char c) {
        // Initialize the child node list
        if (children == null) {
            children = new TNode[2];
        }

        // Try to insert
        TNode node = add(children, c);
        if(node ! =null)
            return node;

        // resize
        // Move the node list to the new child node list
        TNode[] tmp = new TNode[children.length * 2];
        for (TNode child : children) {
            if(child ! =null) {
                add(tmp, child);
            }
        }

        children = tmp;
        return add(children, c);
    }

    /** * Find the list of children of the current node where char equals the given character *@paramC Specifies a char *@returnThe corresponding node */
    TNode findChild(char c) {
        final TNode[] nodes = children;
        if (nodes == null) return null;

        int hash = hash(c);
        int len = nodes.length - 1;

        for (int i = hash; i < hash + len + 1; i++) {
            int idx = i & len;
            TNode node = nodes[idx];
            if (node == null) {
                return null;
            } else if (node.c == c) {
                returnnode; }}return null; }}Copy the code

Then implement the Trie tree.

package com.huyan.trie;

import java.util.*;

/** * Created by pfliu on 2019/12/06. */
public class Trie {

    /** * Root node */
    final private TNode root = new TNode('\ 0');

    /** * Add a word to Trie **@paramWord Word to be added *@paramValue corresponds to value */
    public void addWord(String word, int value) {
        if (word == null || word.length() == 0) return;
        TNode node = root;
        for (int i = 0; i < word.length(); i++) {
            char c = word.charAt(i);
            // Add the current char to the trie and get the node corresponding to the current char
            node = node.addChild(c);
        }
        node.count = value;
    }

    /** * find word corresponding int value. * *@paramWord Given word *@returnInt. */ stored on the last node
    public int get(String word) {
        TNode node = root;
        for (int i = 0; i < word.length(); i++) {
            node = node.findChild(word.charAt(i));
            if (node == null) {
                return 0; }}return node.count;
    }

    private int get(char[] buffer, int offset, int length) {
        TNode node = root;
        for (int i = 0; i < length; i++) {
            node = node.findChild(buffer[offset + i]);
            if (node == null) {
                return 0; }}return node.count;
    }

    /** * starts at offset for the given string. * Finds the first int value that is the largest match. * *@paramSTR the given string *@paramOffset Specifies the offset from the start of the search@returnThe first matching string is the int value of the last node. * /
    public String maxMatch(String str, int offset) {
        TNode node = root;
        int lastMatchIdx = offset;

        for (int i = offset; i < str.length(); i++) {
            char c = str.charAt(i);
            node = node.findChild(c);
            if (node == null) {
                break;
            } else if(node.count ! =0) { lastMatchIdx = i; }}return lastMatchIdx == offset ? null : str.substring(offset, lastMatchIdx + 1);
    }

    /** * starts with the offset <b> reverse </b> of the given string. * Finds the first int value that is the largest match. * *@paramSTR the given string *@paramOffset Specifies the offset from the start of the search@returnThe first matching string is the int value of the last node. * /
    public int maxMatchBack(String str, int offset) {
        TNode node = root;
        int lastMatchIdx = offset;

        for (int i = offset; i >= 0; i--) {
            char c = str.charAt(i);
            node = node.findChild(c);
            if (node == null) {
                break;
            } else if(node.count ! =0) { lastMatchIdx = i; }}return offset - lastMatchIdx + 1;
    }

    /** * starts at offset for the given string. Check the length. * Finds the first int value that is the largest match. * *@paramBuffer Specifies the string *@paramOffset Specifies the offset from the start of the search@returnThe first matching string is the int value of the last node. * /
    public int maxMatch(char[] buffer, int offset, int length) {
        TNode node = root;
        int lastMatchIdx = offset;

        for (int i = offset; i < offset + length; i++) {
            char c = buffer[i];
            node = node.findChild(c);
            if (node == null) {
                break;
            } else if(node.count ! =0) { lastMatchIdx = i; }}return lastMatchIdx - offset + 1;
    }

    public static void main(String[] args) {
        Trie trie = new Trie();

        for (String s : Arrays.asList("HuYan"."Huyan twenty")) {
            trie.addWord(s, 1);
        }

        String input = "Yan Shi is writing an article.";

        System.out.println(trie.maxMatch(input, 0)); }}Copy the code

The basic functions of Trie are basically realized in the code, but there are many application methods of Trie, such as matching prefix, such as finding the length of the longest matching prefix. That’s not always going to happen.

Refer to the article

www.cnblogs.com/huangxinche…

zh.wikipedia.org/wiki/Trie


To the end.

To contact me

Finally, welcome to follow my personal public account [Huyanshi], I will update many backend engineers’ study notes from time to time. Also welcome direct public number private letter or mailbox to contact me, must know all without reserve, endless.


ChangeLog


All the above are personal thoughts, if there is any mistake welcome to comment.

Welcome to reprint, please sign and keep the original link.

Contact email: [email protected]

For more study notes, see personal blog or follow wechat official account < huyan10 >——>HuYan ten