What is a prefix tree?
Prefix trees are a special form of n-fork trees. In general, a prefix tree is used to store strings. Each node in the prefix tree represents a string (prefix). Each node has multiple children, and the paths to different children have different characters. The string represented by the child node is made up of the original string of the node itself, along with all the characters on the path to the child node.
Here is an example of a prefix tree:
In the example above, the value we mark in the node is the string represented by the node. For example, we start at the root node, select the second path ‘B’, then select its first child ‘A’, then continue to select the child ‘D’, and we end up at the leaf node ‘bad’. The value of a node is formed from the characters in the path from the root node.
Note that the root node represents an empty string.
An important feature of prefix trees is that all descendants of a node share a common prefix with strings associated with that node. That’s where the prefix tree name comes from.
Let’s look at this example again. For example, strings represented by nodes in a subtree rooted in node “B” all have a common prefix of “b”. Vice versa, strings with the common prefix “b” are all in the subtree with “b” as the root, and strings with different prefixes come from different branches.
Prefix trees have a wide range of applications, such as auto completion, spell checking and so on. I will describe practical application scenarios in later chapters.
How to represent a prefix tree?
In previous articles, we introduced the concept of prefix trees. In this article, we will discuss how to represent this data structure in code.
Before reading, briefly review the node structure of an N-fork tree.
//Definition for a Node.
class Node {
public int val;
public List<Node> children;
public Node(a) {}
public Node(int _val) {
val = _val;
}
public Node(int _val, List<Node> _children) { val = _val; children = _children; }};Copy the code
What makes prefix trees special is the correspondence between characters and child nodes. There are many different ways to represent the nodes of a prefix tree, and we’ll cover only two of them here.
Method one – array
The first method is to store child nodes in arrays.
For example, if we store only strings containing the letters A through Z, we can declare an array of size 26 in each node to store its children. For a particular character c, we can use c – ‘a’ as an index to find the appropriate child node in the array.
class TrieNode {
// change this value to adapt to different cases
public static final int N = 26;
public TrieNode[] children = new TrieNode[N];
// you might need some extra values according to different cases
};
/** Usage: * Initialization: TrieNode root = new TrieNode(); * Return a specific child node with char c: root.children[c - 'a'] */
Copy the code
Accessing child nodes is very quick. Accessing a particular child node is easier because, in most cases, it is easy to convert a character to an index. But not all child nodes need to do this, so this can lead to wasted space.
Method 2: Map
The second approach is to use a Hashmap to store child nodes.
We can declare a Hashmap in each node. The keys of a Hashmap are characters, and the values are the corresponding child nodes.
class TrieNode {
public Map<Character, TrieNode> children = new HashMap<>();
// you might need some extra values according to different cases
};
/** Usage: * Initialization: TrieNode root = new TrieNode(); * Return a specific child node with char c: root.children.get(c) */
Copy the code
It is easier to access a particular child node with the corresponding character. But it might be a little slower than using arrays. However, since we only store the child nodes we need, we save space. This approach is also more flexible because we are not constrained by fixed lengths and fixed ranges.
supplement
We have already mentioned how to represent child nodes in a prefix tree. In addition, we need to use some other values as well.
For example, we know that each node of a prefix tree represents a string, but not all strings represented by a prefix tree are meaningful. If we only want to store words in the prefix tree, we might need to declare a Boolean value in each node as a flag indicating whether the string represented by that node is a word.
Prefix tree insertion
We have already discussed this in another article.
Lift to ask:
Do you remember how to insert a new node into a binary search tree?
When we insert a target value into a binary search tree, in each node we need to determine which child node the target value needs to go to based on the relationship between the node value and the target value. Similarly, when we insert a target value into the prefix tree, we also need to determine our path based on the inserted target value.
More specifically, if we insert a string S into the prefix tree, we start at the root node. We will select a child node or add a new child node based on S[0] (the first character in S). The second node is then reached and the choice is made according to S[1]. Go to the third node, and so on. Finally, we iterate through all the characters in S in turn and reach the end. The end node will be the node representing the string S.
Here’s an example:
Let’s summarize the above strategy in pseudocode:
1. Initialize: cur = root
2. for each char c in target string S:
3. if cur does not have a child c:
4. cur.children[c] = new Trie node
5. cur = cur.children[c]
6. cur is the node which represents the string S
Copy the code
In general, you need to build your own prefix tree. Building the prefix tree is essentially calling the insert function multiple times. But remember to initialize the root node before inserting the string.
Prefix tree search
Search the prefix
As we mentioned in the introduction to prefix trees, all descendants of nodes share a common prefix with the corresponding string of that node. Therefore, it is easy to search for any word that begins with a particular prefix.
Similarly, we can search down the tree for a given prefix. Once we can’t find the child node we want, the search ends with failure. Otherwise, the search succeeds.
Let’s summarize the above strategy in pseudocode:
1. Initialize: cur = root
2. for each char c in target string S:
3. if cur does not have a child c:
4. search fails
5. cur = cur.children[c]
6. search successes
Copy the code
Search for words
You might also want to know how to search for specific words, rather than prefixes. We can use this word as a prefix and do the same search as above.
- If the search fails, it means that no word begins with the target word, and the target word never exists in the prefix tree.
- If the search is successful, we need to check if the target word is a prefix of a word in the prefix tree, or if it is a word in itself. To further solve this problem, you may need to modify the structure of the nodes slightly.
Tip: Adding booleans to each node may help you solve this problem effectively.
application
A Trie (pronounced “try”) or prefix tree is a tree data structure used to retrieve keys in a string data set. This efficient data structure has several applications:
1. Automatic completion
Figure 1. Google’s search suggestions
Spell check
Figure 2. Spell checking in a word processor
3. IP route (longest prefix match)
Figure 3. The longest prefix matching algorithm using the Trie tree, using the forwarding table to select the path in Internet Protocol (IP) routing.
4. T9 typing prediction
Figure 4. T9 (nine grid input), which was commonly used for mobile phone input in the 1990s
5. Word games
Figure 5. Trie tree can efficiently solve Boggle word game by pruning search space
There are other data structures, such as balanced trees and hash tables, that allow us to search for words in string datasets. Why do we need a Trie tree? Although a hash table can find a key in O(1) time, it cannot efficiently perform the following operations:
- Find all key values with the same prefix.
- A data set that enumerates strings in lexicographical order.
Another reason Trie trees are superior to hash tables is that as the size of the hash table increases, a large number of conflicts occur and the time complexity can increase to O(n), where n is the number of inserted keys. A Trie tree can use less space than a hash table when storing multiple keys with the same prefix. In this case, the Trie tree only needs O(m) time complexity, where mm is the key length. Finding a key in a balanced tree takes O*(mlog*n) time.
Complete implementation
The node structure of Trie tree
A Trie tree is a rooted tree whose nodes have the following fields:.
- Up to R links to children, each of which corresponds to a letter in the alphabet data set. In this article, R is assumed to be 26, the number of lowercase Latin letters.
- Boolean field to specify whether the node is the end of the corresponding key or just the key prefix.
Figure 6. Representation of the word “leet” in the Trie tree
class TrieNode {
// R links to node children
private TrieNode[] links;
private final int R = 26;
private boolean isEnd;
public TrieNode(a) {
links = new TrieNode[R];
}
public boolean containsKey(char ch) {
return links[ch -'a'] != null;
}
public TrieNode get(char ch) {
return links[ch -'a'];
}
public void put(char ch, TrieNode node) {
links[ch -'a'] = node;
}
public void setEnd(a) {
isEnd = true;
}
public boolean isEnd(a) {
returnisEnd; }}Copy the code
Insert a key into the Trie tree
We insert a key by searching the Trie tree. We start at the root and search for its links corresponding to the first key character. There are two cases:
- Links exist. Move along the link to the next child of the tree. The algorithm continues to search for the next key character.
- The link does not exist. Create a new node and connect it to the parent node’s link that matches the current key character.
Repeat the above steps until the last character of the key is reached, then mark the current node as the end node, and the algorithm is complete.
Figure 7. Inserting a key into the Trie tree
class Trie {
private TrieNode root;
public Trie(a) {
root = new TrieNode();
}
// Inserts a word into the trie.
public void insert(String word) {
TrieNode node = root;
for (int i = 0; i < word.length(); i++) {
char currentChar = word.charAt(i);
if(! node.containsKey(currentChar)) { node.put(currentChar,newTrieNode()); } node = node.get(currentChar); } node.setEnd(); }}Copy the code
Complexity analysis
- Time complexity: O(m), where MmAs the bond length. In each iteration of the algorithm, we either check or create a node until we reach the end of the key. You just need to mmTime operation.
- Space complexity: O(m). In the worst case, newly inserted keys and existing keys in the Trie tree do not have a common prefix. So I’m going to add m nodes and use O(m) space.
Find the key in the Trie tree
Each key is represented in the TRIe as a path from the root to the internal node or leaf. We start at the root with the first key character. Checks links corresponding to key characters in the current node. There are two cases:
-
Links exist. We move to the next node in the path following the link and continue searching for the next key character.
-
There is no link. If there are no key characters and the current node is marked isEnd
, returns true. Otherwise, there are two possibilities, both returning false:
- I still have key characters left, but I can’t follow the key path of the Trie tree. I can’t find the key.
- No key characters left, but the current node is not marked as
isEnd
. That is, the key to be found is just a prefix for another key in the Trie tree.
Figure 8. Finding a key in the Trie tree
class Trie {...// search a prefix or whole key in trie and
// returns the node where search ends
private TrieNode searchPrefix(String word) {
TrieNode node = root;
for (int i = 0; i < word.length(); i++) {
char curLetter = word.charAt(i);
if (node.containsKey(curLetter)) {
node = node.get(curLetter);
} else {
return null; }}return node;
}
// Returns if the word is in the trie.
public boolean search(String word) {
TrieNode node = searchPrefix(word);
returnnode ! =null&& node.isEnd(); }}Copy the code
Complexity analysis
- Time complexity: O(m). Each step of the algorithm searches for the next key character. In the worst case, it takes m operations.
- Space complexity: O(1).
Find the key prefix in the Trie tree
This approach is very similar to that used when searching for keys in a Trie tree. We traverse the Trie tree from root until there are no characters in the key prefix, or we cannot continue the path in the Trie with the current key character. The only difference from the “search key” algorithm mentioned above is that it always returns true when the end of the key prefix is reached. We do not need to consider whether the current Trie node is marked “isend” because we are searching for the prefix of the key, not the whole key.
Figure 9. Finding the key prefix in the Trie tree
class Trie {...// Returns if there is any word in the trie
// that starts with the given prefix.
public boolean startsWith(String prefix) {
TrieNode node = searchPrefix(prefix);
returnnode ! =null; }}Copy the code
Complexity analysis
- Time complexity: O(m).
- Space complexity: O(1).
The complete code
class Trie {
private final int ALPHABET_SIZE = 26;
private Trie[] children = new Trie[ALPHABET_SIZE];
boolean isEndOfWord = false;
public Trie(a) {}
public void insert(String word) {
Trie tmp = this;
for (char i : word.toCharArray()) {
if (tmp.children[i-'a'] = =null) {
tmp.children[i-'a'] = new Trie();
}
tmp = tmp.children[i-'a'];
}
tmp.isEndOfWord = true;
}
public boolean search(String word) {
Trie tmp = this;
for (char i : word.toCharArray()) {
if (tmp.children[i-'a'] = =null) {
return false;
}
tmp = tmp.children[i-'a'];
}
return tmp.isEndOfWord ? true : false;
}
public boolean startsWith(String prefix) {
Trie tmp = this;
for (char i : prefix.toCharArray()) {
if (tmp.children[i-'a'] = =null) {
return false;
}
tmp = tmp.children[i-'a'];
}
return true; }}Copy the code