This article originated from personal public account: TechFlow, original is not easy, for attention
Today, for the 20th article in our machine learning series, we’ll look at fP-growth algorithms.
It’s a little bit less popular, at least than Apriori. Apriori is mentioned in many data mining textbooks, but FP-growth is rarely mentioned. The reason is also simple, because from the perspective of function, FP-growth and Apriori are basically the same, equivalent to the performance optimization version of Apriori.
But I have to say that optimization can sometimes be awkward, because optimization means high performance. On the other hand, applications with higher performance requirements, whether in the enterprise or in academic research, have long had better choices, and could have used more powerful algorithms and models, rather than an optimized version of such an old algorithm. For those scenarios with low performance requirements, a simple Apriori is sufficient and optimization is not necessary.
But whatever the fate of the algorithm, it does at least have some merit in terms of principles and ideas. Let’s see how it works.
FP – growth and FP – tree
The core value of FP-growth lies in acceleration. In the Apriori algorithm introduced before, every time we screen frequent item sets from candidate sets, we need to scan the full database to calculate the support. Obviously, this cost is very high, especially when we have a large amount of data.
The essence of FP-Growth is to build an FP-tree that scans the entire data set only twice, so the overall performance is obviously much faster than Apriori. The reason why the fP-growth algorithm can be so fast is that it only mines data on the FP-tree rather than the whole data set. Therefore, a large part of data can be omitted in this way, thus saving a lot of computing resources.
From here we can see that FP-tree is the essence of the whole algorithm. Before we go into the whole tree construction and some of the details, let’s take a look at what the letters FP mean. In fact, the letters FP are the abbreviation of frequent pattern, which translates to frequent pattern, but can also be understood as frequent items. Frankly speaking, the FP-tree will only store the data of frequent items. Every time we mine frequent itemsets and association rules, we are based on FP-Tree, which filters infrequent data.
This is a well-generated FP-tree, so let’s take a look at what it looks like and then explain the details and principles.
The head pointer table
The structure above may seem confusing at first glance, and it is not at all clear what to make of it. This is quite normal, but if we take the figure apart, we can divide it into two parts, with a linked list on the left and a tree on the right. But the list on the left points to the tree on the right, so the whole thing looks a little complicated.
Let’s ignore the part of the tree on the right and the Pointers that are associated with the pointer table and the tree, and just look at the head pointer table on the left. It’s much easier just to look at this part, which is actually a set of single frequent terms.
As mentioned earlier, we only need to traverse the data set twice in the process of using the FP-growth algorithm. So the first time we walk through the data set is right here, so we walk through the data set, and we figure out how many times all the elements are present. The less frequent elements are then filtered out according to the threshold, and the retained result is a collection of single frequent items.
The logic here is very simple, just two things. The first thing is to count the number of occurrences of each individual item, and the second thing is to filter out infrequent items based on a threshold.
def filter_unfreq_items(dataset, min_freq):
data_dict = defaultdict(int)
# Count the number of occurrences of each item
for trans in dataset:
for item in trans:
data_dict[item] += 1 # Filter by threshold ret = {} for k, v in data_dict.items(): if v >= min_freq: ret[k] = v return ret Copy the code
Using this method, we generate the data in the head pointer table. We will then convert this data into a linked list during the establishment of the FP-tree, which is the head pointer table on the left. We don’t yet know how to build a tree, but at least we can turn dict into a pointer table. The logic is simple: simply add a reference to the dict value.
def transform_to_header_table(data_dict):
return {k: [v, None] for k, v in data_dict.items()}
Copy the code
None is a reference to the next location in the list, but right now we only have the list header, so set all to None.
FP – tree is established
Once we have the data of the head pointer table, which is the single element data of the high frequency, obviously we need to use it. Obviously, we can use it to filter out the whole data set, to filter out the low-frequency elements.
In fact, the construction process of FP-tree is essentially a process of inserting the filtered results into the tree. We’ll talk about that later, but let’s start with filtering.
Pure filtering is, of course, very simple; we just need to determine whether the elements in the dataset are present in the header pointer table. But not only that, because we want to insert it into the tree later. Using the idea of Huffman trees, we want the more frequent elements to be located closer to the root node. The lower the frequency, the farther away from the root node, thus optimizing the efficiency of our query. To do this, we need to sort the data.
Let’s implement this part. This part is divided into two parts. The first part is filtering according to the head pointer table, and the second part is sorting according to the frequency of occurrence in the head pointer table.
def rank_by_header(data, header_dict):
rank_dict = {}
for item in data:
If the element is high frequency, keep it, otherwise discard it
if item in header_dict:
rank_dict[item] = header_dict[item] # Sort the elements by their frequency of occurrence as a whole item_list = sorted(rank_dict.items(), lambda x: x[1], reverse=True) return [i[0] for i in item_list] Copy the code
With this data, we are one step away from building something. As mentioned earlier, the construction of fP-tree is very simple, which is to sort the elements by the frequency of occurrence and then insert them from the root. Update a node on a path during insertion. This is confusing to say, but it makes sense when you look at the graph:
We start the tree empty, with nothing, and then insert the first data {r, z}. Since the frequency of occurrence of Z is greater than r, z and then R are inserted first, and then a {z,x,y,s,t} is inserted, which is also sorted according to the overall occurrence frequency. Since z has already been inserted, we update the number of occurrences to 2, and then we find that there are no duplicated elements, so we build a new branch.
Essentially, you insert in order of frequency from shallow to deep, updating the number of occurrences of the same element if it has occurred before.
This logic should be easy to understand. Before we implement the logic, let’s create a class for the tree node.
class Node:
def __init__(self, item, freq, father):
self.item = item
self.freq = freq
Parent pointer
self.father = father # define pointer self.ptr = None Dict is used to store the child node, which is easy to find by item self.children = {} # Update frequency def update_freq(self): self.freq += 1 # Add kids def add_child(self, node): self.children[node.item] = node Copy the code
For this class, we just need to look at the code, which should be completely easy, but you can also add a visual method to debug it if you want. But to be honest, it’s hard to see the tree structure very well just by printing, so I’m not going to add it.
After the definition of the tree node is written, the next step is to implement the update of the FP-tree. In fact, if you understand the above logic, this part of the code is very simple, we just need to use the code to generate data in order to insert into the tree.
The insertion of a sequence into a tree is very common, and it can be found in many data structures, the most classic being the Trie tree. If you’ve learned it, you’ll see that this insert operation is really almost identical to the Trie, and if you haven’t, it doesn’t matter, which makes sense.
def create_FP_tree(dataset, min_freq=3):
header_dict = filter_unfreq_items(dataset, min_freq)
root = Node('Null'.0.None)
for data in dataset:
# Sort by the total number of occurrences
item_list = rank_by_header(data, header_dict) print(item_list) head = root Insert into the tree in sort order for item in item_list: if item in head.children: head = head.children[item] else: new_node = Node(item, 0, head) head.add_child(new_node) head = new_node head.update_freq() return root Copy the code
Update the header pointer table
Now that the Fp-tree is complete, we need to add the logic to update the header table so that for each item, we can find all the locations of the element on the FP-tree based on the header table.
Taking a closer look at the image above, we highlight one of the links:
As you can see from the figure above, the head pointer table is used to create a linked list of all occurrences of the element.
So when do we need to add values to this list?
A little analysis reveals that this is actually when we create new nodes in the tree. With that in mind, all we need to do is add a few lines to the code above to update the head pointer list with the same value when a new node is created in the tree.
def create_FP_tree(dataset, min_freq=3):
header_dict = filter_unfreq_items(dataset, min_freq)
root = Node('Null'.0.None)
for data in dataset:
# Sort by the total number of occurrences
item_list = rank_by_header(data, header_dict) print(item_list) head = root Insert into the tree in sort order for item in item_list: if item in head.children: head = head.children[item] else: new_node = Node(item, 0, head) head.add_child(new_node) # the position of the head pointer is null, so we simply set the head pointer to the current position if head_table[item][1] is None: head_table[item][1] = new_node else: Otherwise, we add the current element to the end of the list insert_table(head_table[item][1], new_node) head = new_node head.update_freq() return root Copy the code
We can create a dict to maintain the end of the head pointer table, but I’m being lazy here, so I’ll do it the easiest way: first go to the end of the list and then add:
def insert_table(head_node, node):
while head_node.ptr is not None:
head_node = head_node.ptr
head_node.ptr = node
Copy the code
Fast data search using fP-tree
Now that we have completed the construction of the FP-Tree, it is time to focus on the mining of frequent itemsets. So what do we do with the FP-tree once we have it?
If we take a closer look at the FP-tree, we can see that the tree is actually a condensed version of the data. It can be thought of as the data extracted after removing the infrequent elements from the previous complete data set. According to the principle of APriori algorithm, what we need to do next is to use the frequent item set of length 1 to build the frequent item set of length 2, and so on until we find all the frequent item sets.
But in the FP-growth algorithm, we modify this logic a little bit by fixing one element at a time and looking for all of its frequent itemsets. To find frequent item sets, we first need to obtain the data set. The original data contains too much irrelevant information, which is no longer needed. We can obtain the data we need efficiently through FP-tree.
Since we inserted the FP-tree in a strict order according to the number of occurrences of elements, elements with higher occurrences are placed in higher positions. The number of times a link in the tree appears in the dataset is equal to the number of the lowest node in the link.
Let’s look at an example:
The position of S in the red box is the lowest, so the frequency of {z, x, y, s} in the whole data set on the whole link is 2. Then, after determining S, we can restore all the data composed of frequent items containing S by going up to the FP-tree.
We first implement an auxiliary method that iterates up all elements of the link:
Def up_forward(node): path = [] while node.father is not None: path.append(node) node = node.father # filter root return pathCopy the code
The second helper method is to restore all the data for an element after it is fixed. To restore all the data, it is not enough to retrieve only one node, we need to know all the locations of the element in the FP-tree. In this case, we need to use the head pointer table, using the head pointer table, we can find all the elements in the FP-tree position, we just need to call the above method, we can restore the data set.
Again, this logic is not difficult, in fact, through a linked list, and then call the above method up the tree, get all the data.
def regenerate_dataset(head_table, item):
dataset = {}
if item not in head_table:
return dataset
# find the starting position of the list by head_table
node = head_table[item][1] Walk through the list while node is not None: For each position in the list, call up_forward to get the data on the Fp-tree path = up_forward(node) if len(path) > 1: Set element as key Item is removed from the new data, so that there is no item in the newly constructed data # To mine new frequent items with item as the premise dataset[frozenset(path[1:])] = node.freq node = node.ptr Restore kv data to array form ret = [] for k, v in dataset.items(): for i in range(v): ret.append(list(k)) return ret Copy the code
Recursive tree building, mining frequent itemsets
At this point, we should have a relatively clear understanding of fP-tree, its function is to quickly find the frequency of the set of certain elements, because the set of the same elements are stored in the same tree chain.
So how do we mine frequent itemsets based on FP-tree?
This is the essence of the real algorithm.
Let’s look at the example above:
Let’s assume that our fixed element is r, and we can quickly find frequent terms z,x,y, and S that co-occur with R through fp-tree. Using the method above, we can get a new data set that must contain r and frequent terms. We remove r from this data, and then construct a new FP-tree for the remaining data. If there are other elements in the new FP-tree, this element must be able to form a binary frequent item set with R. Assuming that the element is x, then we repeat the above operation, remove x from the data, and construct the FP-tree again to mine the ternary frequent item set containing x and R until there are no elements in the fP-tree.
This becomes a recursive call, meaning that fP-Tree itself does not mine frequent itemsets, only frequent items. However, we can artificially add a precondition. When we mine the frequent item on the premise that the data must contain X, it is actually a binary frequent item set containing X. We’re adding more and more premises, so we’re mining more and more elements of the frequent itemset.
If you’re a little confused, let’s take a look at the code:
def mine_freq_lists(root, head_table, min_freq, base, freq_lists):
Sort head_table by frequency
frequents = [i[0] for i in sorted(head_table.items(), key=lambda x: x[0]] for freq in frequents:
# base is the frequent itemset that is listed as a premise
new_base = base.copy() Add the current element to the frequent itemset new_base.add(freq) # Add answers freq_lists.append(new_base) Get data based on the current frequent itemset (new_base) via fp-tree new_dataset = regenerate_dataset(head_table, freq) Create a new head_table new_head_table = transform_to_header_table(filter_unfreq_items(new_dataset, min_freq)) If null, there are no longer frequent itemsets if len(new_head_table) > 0: If so, build a new FP-tree new_root = create_FP_tree(new_dataset, new_head_table, min_freq) # Recursive mining mine_freq_lists(new_root, new_head_table, min_freq, new_base, freq_lists) if __name__ == "__main__": dataset = create_dataset() data_dict = filter_unfreq_items(dataset, 3) head_table = transform_to_header_table(data_dict) root = create_FP_tree(dataset, head_table, 3) data = regenerate_dataset(head_table, 'r') freq_lists = [] mine_freq_lists(root, head_table, 3, set([]), freq_lists) Copy the code
At the end
At this point, the whole FP-growth algorithm for mining frequent itemsets is over. Compared with Apriori, it has much more technical details. If beginners feel that it is not easy to understand, this is normal.
The core idea of Apriori is to use two frequent itemsets of length L to build frequent itemsets of length L +1, while FP-growth is slightly different. It takes a frequent item set with length L as the premise to screen out the data set containing this frequent item set. Use this data set to build a new FP-tree and find new frequent items from this FP-tree. If it can be found, then it can be combined with frequent itemsets of length L to form frequent itemsets of length L +1. Then, we repeat the process.
How to build fP-tree, how to maintain the head pointer table are very simple problems.
Please pay attention to me