Today’s article is based on a practical problem I solved at work last week. It is a relatively common problem, no matter what kind of development you do, you probably encounter it. Specifically, we have a list of colleges and universities (2657), and we need to find the titles containing these colleges and universities from a large number of article titles, which is actually a fuzzy query. The corresponding pseudocode is as follows

selected_titles = []
forThe titleinMassive headlines:forColleges and universitiesinList of Universities:ifContains (college): selected_title.add (title)breakCopy the code

For big data development, the corresponding PSEUDO-code for SQL looks like this

select title
from tb
where 
  title rlike '| | Beijing university, tsinghua university... 2657 universities'Copy the code

Both approaches meet our requirements, but their common problem is that queries are inefficient. If we were to match not 2,657 colleges but hundreds of thousands or even millions, this would take an inconceivable amount of time.

Optimization of this kind of problem often need to make an issue on the data structure, in this problem, we can optimize the structure of data and only the “list of colleges and universities”, the pseudo code above we hold “list of colleges and universities” is an array of data structure, when we find a title contains a certain university, need to traverse it again from beginning to end “list” of colleges and universities, And the longer the list, the longer it takes to traverse.

Now that we are aware of the shortcomings of the array data structure, our next focus is to find a data structure that can do fuzzy queries without going through the entire “college list”. This data structure is the Trie tree that we are going to introduce today. The word is a little strange to the eye, and it is a tree structure, so it will feel very complicated. In fact, the design idea of this data structure is very simple, and it is easy to learn.

Let’s take a look at Trie trees. For the sake of illustration, assume that the “college list” contains only the following five elements

ABC, ABD, BCD, BCE, C, CAB, CDECopy the code

The corresponding two data structures are as follows:

Regardless of the time complexity of these two data structures, let’s first take a look at why Trie trees are more efficient than arrays. Let’s say we’re looking for a string called CDE, and in an array structure, we’re going to go through the array, compare it seven times to find the result, and we’re going to do a lot of work. Tire tree, however, can be found only three times, which has obvious advantages. Due to the tree structure, we do not need to consider the two branches starting with A and B on the left, which greatly reduces the number of comparisons and thus reduces “useless work”. Here is an animation to show how to create a Trie tree and find a string in the Trie tree (see GIF in the source directory if the video doesn’t play).

The process of building a tree is essentially iterating through each element of the string and creating nodes in the tree. The string lookup process is essentially a string traversal of the tree. Trie tree creation and string lookup are relatively simple.

I don’t know if you noticed that the nodes in the Trie tree above have two colors — white and green. The string from the root node to the current node in green is the string from the list of colleges and universities, which is the string we used to build the Trie tree. Take, for example, the leftmost leaf node “C,” which represents the “ABC” string in the “list of colleges and universities.” Similarly, the string “AB” is not an element in the “college list” because the “B” node is not green, so when we look up the string “AB” in this tree, we cannot find it. This is something that you should be aware of, and we’ll see it in the code below.

In addition, some friends may have a question, our initial requirement is not fuzzy query, in the Trie tree explanation section how to say string whole word (exact) matching. This is because full word matching is the most basic search method supported by Tire tree. On this basis, fuzzy matching can be easily realized by some flexibility.

Next, let’s look at the code implementation (Python), which first creates two arrays

colleges = utils.read_file_to_list('key_words.txt')
titles = utils.read_file_to_list('titles.txt')Copy the code

Colleges and titles are one-dimensional arrays with each element being a string.

If you understand the design of Trie trees, it’s easy to write the following code. The first step is to define a class that represents a Trie tree node

class TrieNode:
    def __init__(self):
        self.nodes = dict()
        # is_end=True represents the string that constructs the Trie tree from the root node to the current node.
        self.is_end = FalseCopy the code

Is_end =True is the green node we talked about above.

The code that creates the Trie tree is in the TrieNode class

def insert_many(self, items: [str]):
    Param Items: String array :return: None ""
    for word in items:
        self.insert(word)
​
def insert(self, item: str):
    Insert a phrase into the Trie tree :param item: string to be inserted :return: None """
    curr = self
    for word in item:
        if word not in curr.nodes:
            curr.nodes[word] = TrieNode()
        curr = curr.nodes[word]
    curr.is_end = TrueCopy the code

Now write the code to find the Trie tree, in the TrieNode class

def suffix(self, item: str) -> bool:
    Param item: string to be matched :return: True or False """
    curr = self
    for word in item:
        if word not in curr.nodes:
            return False
        curr = curr.nodes[word]  Get the child node
        if curr.is_end:  # if is_end=True, the current string contains a string from the list of colleges
            return True
    return False  # failed to matchCopy the code

This is not a whole word match, but a prefix match, that is, to determine whether the string item to be searched starts with a string in the list of colleges and universities.

So let’s write fuzzy matching, the code is in the TrieNode

def infix(self, item: str) -> bool:
    for i in range(len(item)):
        sub_item = item[i:]  # Divide the string to be found into different substrings
        If the prefix of the substring matches in the Trie tree
        Select * from list of colleges and universities; select * from list of colleges and universities;
        # which has realized the tile rlike '| | Beijing university, tsinghua university... Other university 'function
        if self.suffix(sub_item):
            return True
    return FalseCopy the code

In this case, the item is divided into different substrings for prefix matching. If the substring matches, it is said that the whole string item contains a certain string in the “list of colleges and universities”.

Finally, let’s run the above code and record the search time to compare it with the original array structure version. The following code

# Array version
cnt = 0
start_time = int(time.time() * 1000)
for title in titles:
    for x in colleges:
        if x in title:
            cnt += 1
            break
​
end_time = int(time.time() * 1000)
print(cnt)
print('spend: %.2fs' % ((end_time - start_time) / 60.0))
​
# Trie tree version
root = TrieNode()
root.insert_many(colleges)
cnt = 0
start_time = int(time.time() * 1000)
for title in titles:
    if root.infix(title):
        cnt += 1
​
end_time = int(time.time() * 1000)
print(cnt)
print('spend: %.2fs' % ((end_time - start_time) / 60.0))Copy the code

The following output is displayed:

5314
spend: 9.13s
5314
spend: 0.23sCopy the code

As you can see, 9s was used for array matching, while 0.23s was used for Trie tree matching!

This method to improve the performance of fuzzy query of massive data introduced today is realized by writing code. For big data developers who often write SQL, it is only necessary to build a UDF to use it. It is necessary to establish a Trie tree with “university list” in the initialization code of UDF.

Today’s content is shared here, I hope to help you. Public number reply keyword trie obtain complete source code

Welcome the public account “du Code” to export the dry goods you can’t see elsewhere.