What is a prefix tree?
Prefix tree, also known as word lookup tree, Trie tree, is a tree structure, is a variant of the hash tree. It is typically used to count, sort and save large numbers of strings (but not limited to strings), so it is often used for text word frequency statistics by search engine systems.
Its advantages are that it reduces query time by using the common prefix of strings, minimizes unnecessary string comparisons, and improves query efficiency over hash trees.
What does a prefix tree look like?
Prefix tree itself is a multi-fork tree, except root node does not contain characters, other nodes, each node contains a character, and multiple nodes of a node, is not repeated characters, but for the whole multi-fork tree, the same node may have repeated characters. Because of this feature, prefix trees save a lot of space for storing the same prefix.
A prefix tree from the root node to the bottom of the leaf node, all the character concatenation, is a complete string.
The child nodes B, C, and D of the same node A are the prefix of B, C, and D for the resulting string.
Prefix tree example 1
For example, IF I construct a prefix tree for [Cang Teacher, the appearance of Cang teacher, the synonym of Cang Teacher, the antonym of Cang Teacher], I can get a structure as follows:
Prefix tree example 2
Suppose you have five strings: code, cook, five, file, fat build the prefix tree
Prefix tree characteristics
- The root node contains no characters, and each node except the root node contains only one character
- From the root node to a node, the characters passing through the path are concatenated to be the string corresponding to the node
- All children of each node contain different characters
Prefix tree build logic
To build the prefix tree, just insert the participating strings into the tree one by one. When all strings are inserted, the prefix tree is built
For a single string, it is predictable that each character is at the first level of the prefix tree, assuming root is at level 0.
So for the string “talent”, t must be at the first layer, A must be at the second layer, L must be at the third layer, and so on
The insertion distance logic for a string is as follows:
(1) Determine whether there is a node with the first character of the string in the first layer. If there is, select directly and create the node as the child node of the node in the upper layer. Then enter the node
(2) All child nodes of nodes in the upper layer act as the second layer. In the second layer, judge whether the node of the second character in the string exists, and repeat the process of 1
(3) Until the string is fully inserted.
The efficiency of prefix trees
The query time of the dictionary tree is O(logL), where L is the length of the string. So the efficiency is relatively high. Dictionary trees are more efficient than hash tables.
Hash table:
The hash function is used to hash all the words into keys, and the query can be performed directly through the hash function. It is known that the efficiency of the hash table is O(1). Of course, this is because if the hash function is well selected, the calculation is less, and the conflict is less, the word query speed will be very fast. What if the hash function has a relatively large amount of computation and a high collision law? These are all factors to consider.
In addition, the hash table does not support dynamic query. When we want to query the word apple, the hash table must wait for the user to complete the input of the word apple. You can’t hash when you type into appL.
Dictionary tree (tries tree) :
Prefix trees are typically spatial-temporal data structures.
For word queries, it is better to use a dictionary tree, but there are also conditions, the space allows, dictionary tree space is more wasteful compared to hash, after all, hash can use a bit array.
Application scenarios
Search for pop-up prompt words
Fast string checking (more suitable for prefix matching)
The existence of a string can be quickly determined by traversing the prefix tree.
For example, I might check to see if “aaabc” exists by going through all the strings and comparing them one by one, or comparing whether they are map keys. But for prefixes, you just need to check whether there is a corresponding character in each layer, that is, whether there is a in the first layer, whether there is a in the second layer, and so on.
In particular, if the string prefixed by aaabc exists, the hash method will not be able to determine.
String retrieval, word frequency statistics, search engine hot queries
The known strings (dictionaries) are stored in the trie tree to see if other unknown strings occur or how often.
When counting frequencies, we make it possible for each node to record the number of strings that end with the current node, because the string is unique for any node in the prefix tree that ends. Therefore, the number of occurrences of the string can be recorded directly in the node.
For example:
- 1) Have a file of 1 GB size, each line is a word, the size of the word is not more than 16 bytes, the memory limit is 1 MB. Returns the 100 words with the highest frequency.
If the repetition of words is very high, this can be achieved by using a prefix tree, traversing the entire prefix tree to get the statistics for each node, and then sorting to get the words that fit
- 2) Please write down all the new words that are not in the vocabulary list in the order that they appear first.
- 3) Give a dictionary of bad words. The words are all lowercase letters. Give a paragraph of text, and each line of text is also lowercase. Determine if the text contains any bad words. For example, if rob is a bad word, then the text problem contains bad words.
- 4) 10 million strings, some of which are repeated, need to remove all the repeated strings, keep the non-repeated strings
- 5) Search for hot queries: The search engine will record all search strings used by the user each time through a log file. Each search string is 1-255 bytes long. Let’s say there are 10 million records. The number of repeat reads for these query strings is high. Although the total number is 10 million, if you subtract the repeat sum, it is not more than 3 million. The more repetitions a query string has, the more users it has and the more popular it is. Please count the top 10 most popular query string, the use of memory should not exceed 1G.
The longest public prefix of a string
Trie trees use common prefixes for multiple strings to save storage space, whereas when we store a large number of strings on a Trie tree, we can quickly get common prefixes for certain strings. For example:
For example, given N strings, find the common prefix of any two strings.
Insert N strings into the prefix tree and look for the common prefix, which is essentially looking for the nearest common node of the two nodes.
The sorting
A Trie tree is a multi-fork tree, and if the entire tree is traversed first, the corresponding string output is the result of lexicographical sorting.
Each node of the prefix tree is in order, that is, in lexicographic order. Therefore, the lexicographic order of all strings can be obtained by traversing the prefix tree once.
As an auxiliary structure for other data structures and algorithms
Such as suffix trees, AC automata and so on.
Implement prefix tree
The title
Version 1 is correct
public class Trie { private Trie[] children; private boolean isEnd; Public Trie() {// Children = new Trie[26]; isEnd = false; } public void insert(String word) { Trie node = this; for (int i = 0; i < word.length(); i++) { char ch = word.charAt(i); int index = ch - 'a'; if (node.children[index] == null) { node.children[index] = new Trie(); } node = node.children[index]; } node.isEnd = true; } public boolean search(String word) { Trie node = searchPrefix(word); return node ! = null && node.isEnd; } public boolean startsWith(String prefix) { return searchPrefix(prefix) ! = null; } private Trie searchPrefix(String prefix) { Trie node = this; for (int i = 0; i < prefix.length(); i++) { char ch = prefix.charAt(i); int index = ch - 'a'; if (node.children[index] == null) { return null; } node = node.children[index]; } return node; }}Copy the code
The right reason
(1) For each layer of the prefix tree, the Trie pointer is redirected to the next layer of the object, i.e. node = node.children[index];
The maximum xOR value of two numbers in an array
The title
Version 1 is correct
TrieNode root; public int findMaximumXOR(int[] nums) { int len = nums.length; root = new TrieNode(new TrieNode[2]); // Create root node. For (int I = 0; i < len; i++) { build(nums[i]); } // match each number in the binary tree with int res = 0; for (int i = 0; i < len; i++) { int t = query(nums[i]); res = Math.max(res,t); } return res; } public void build(int x){ TrieNode now = root; For (int I = 30; int I = 30; i >= 0; Int u = x >> I &1; int u = x >> I &1; If (now.son[u] == null) {// If this node does not exist, create now.son[u] = new TrieNode(new TrieNode[2]); } // Enter the next node now = now.son[u]; } } public int query(int x){ TrieNode now = root; int res = 0; for (int i = 30; i >= 0 ; Int u = x >> I & 1; If (now.son[1-u]! = null) {// If (I = 1) {// if (I = 1) {// if (I = 1) { now = now.son[1 - u]; } else {now = now. Son [u]; } } return res; } // Simple prefix tree structure class TrieNode{TrieNode son[]; public TrieNode(TrieNode[] son) { this.son = son; }}Copy the code
The right reason
(1) Treat each number as a 32-bit string, i.e., an int number corresponds to a 32-bit binary number, with each bit being a 0 or 1. This turns each number into a string, and then builds a prefix tree. As shown in the figure below.
(2) Then invert each digit and match it in the binary tree, that is, the original digit is 7, and the corresponding binary is 000111. Then invert each digit and match it in the prefix tree, and the resulting digit is the maximum xOR of this digit and any digit in the array. So if I take the maximum xor of all the numbers, that’s my final answer.