Trie tree

The name Trie comes from the word for retrieval because it allows you to find the word you want in a dictionary using only a prefix. Although the pronunciation is the same as “Tree”, to distinguish the dictionary Tree from ordinary binary trees, programmer Xiao Wu usually stresses the end of “Trie”, which can be interpreted as “TreeE”.

Trie tree, also known as “dictionary tree”. As the name implies, it is a tree structure. It is a data structure that deals specifically with string matching to solve the problem of quickly finding a string in a collection of strings.

In addition, Trie trees are also called prefix trees (because the descendants of a node have a common prefix, such as pan for panda).

Its keys are all strings, which can efficiently query and insert. The time complexity is O(k), and K is the length of the string. The disadvantage is that it consumes memory if a large number of strings have no common prefix.

Its core idea is to make the query efficient by minimizing unnecessary string comparisons, i.e., “trade space for time”, and use common prefixes to improve the query efficiency.

Trie tree characteristics

Suppose you have five strings: code, cook, five, file, fat. Now you need to look inside several times to see if a certain string exists. Is there a more efficient way to match each of these five strings with the same string?

If you organize these five strings into the structure shown below, would it be faster to scan them visually than to look them up?

From the figure above, you can see three characteristics of the Trie tree:

  • The root node contains no characters, and each node except the root node contains only one character
  • From the root node to a node, the characters passing through the path are concatenated to be the string corresponding to the node
  • All children of each node contain different characters

Understand the Trie tree construction process through animation. Each step in the construction process is equivalent to inserting a string into the Trie tree. When all strings have been inserted, the Trie tree is constructed.

Insert operation of Trie tree

Inserting words into a Trie tree is a simple operation. Each letter of a word is inserted into the tree one by one. Before inserting the letter, check whether the node corresponding to the letter exists. If the node exists, share the node. If the node does not exist, create the corresponding node. For example, to insert the new word “cook”, here are the steps:

  • Insert the first lettercAnd found thatrootThere are child nodes below the nodec, the node is sharedc
  • Insert the second letteroAnd found thatcThere are child nodes below the nodeo, the node is sharedo
  • Insert the third letteroAnd found thatoThere is no child node below the nodeo, creates child nodeso
  • Insert the third letterkAnd found thatoThere is no child node below the nodek, creates child nodesk
  • So far, the wordcookAll letters have been inserted into the Trie tree, and then the node is setkThe flag bit marks the path inroot->c->o->o->kThe characters of all nodes on this path can form a single wordcook

Trie tree query operations

When looking for a string in the Trie tree, for example looking for the string code, you can split the string into individual characters C, O, D, e, and start matching from the root of the Trie tree. As shown, the green path is the path that matches in the Trie tree.

What if you’re looking for the string COD? You can use the same method as above, starting from the root node and matching along some path, as shown in the figure. The green path is the path where the string COD matches. However, the last node of the path “D” is not orange and is not a word flag bit, so the COD string does not exist. That is, COD is a prefix substring for a string, but it doesn’t exactly match any string.

Don’t be a salt fish, be a cook 🙂

The Trie tree deletion operation

The deletion operation of Trie tree is similar to that of binary tree. The location of the deleted node needs to be considered. The analysis is divided into three situations:

Delete whole words (e.ghi)

  • Look for the first character from the root nodeh
  • findhAfter the child node, continue the searchhThe next child node ofi
  • iIs the wordhiTo remove the flag bit
  • iNode ishiTo remove the leaf node
  • Discovered after deletionhThe node is a leaf node and is not a word flag bit, which is also deleted
  • And you’re donehiWord deletion operation

Delete prefix words (e.gcod)

cod
d

Delete branch words (e.gcook)

Delete whole words
cook
o
cook

Application of Trie tree

In fact, Trie trees are used everywhere in everyday life, like this one:

Specifically, it is often used to count and sort large numbers of strings (but not limited to strings), so it is often used by search engine systems for text word frequency statistics. Its advantages are that it minimizes unnecessary string comparisons and improves query efficiency over hash tables.

1. Match the prefix

For example: Find all strings in a collection of strings that begin with five minutes. We simply construct a trie tree from all the strings and output the keywords on the path starting with five −> minutes −> minutes.

Trie tree prefix matches are often used in search prompts. For example, when entering a web address, it can automatically search for possible choices. When no exact match is found, the most similar possible prefix can be returned

2. String retrieval

Give a list of N words and an article written in all lowercase English. Write down all the new words that are not in the list in the order they appear first.

The retrieval/query function is the original Trie tree function. Given a list of strings, find if a string has been used before, and compare it character by character, starting at the root node:

  • If a different character is found along the way, the string does not exist in the collection.
  • If all the characters are compared and all are the same, you need to determine the flag bit of the last node (which marks whether the node represents a keyword).

Limitations of Trie trees

As mentioned above, the core idea of Trie is to swap space for time, using the common prefix of string to reduce the overhead of query time and achieve the purpose of efficiency.

Suppose there are m characters, and several strings of length N form a Trie tree, then the degree of each node is M (that is, the number of possible child nodes of each node is M), and the height of the Trie tree is N. It is clear that we are wasting a lot of space to store characters, and the worst space complexity of the Trie tree is O(m^n). It is also because each node has an out degree of M that we can efficiently search down character by character along each branch of the tree, instead of iterating through all the strings. In this case, the Trie tree has a worst-case complexity of O(n).

This is the embodiment of space for time and the use of common prefixes to reduce the cost of query time.