This article originated from personal public account: TechFlow, original is not easy, for attention
Today, in the 18th article on algorithms and Data structures, we take a look at a classic data structure — and look up sets.
First, let’s explain the name of the data structure, and look up is actually an abbreviation, and means merge, and look up means find, and set is set. So the full name for a syndication set is a merge lookup set, so as the name implies, this is a data structure used to merge and find collections.
And look up the definition of the set
Although the set is an abstract concept, there are many operations about the set in our life, but we are often used to it.
For example, let’s say that both A and B are members of the social elite, and both of them have large assets in their names. All of a sudden, two people announce that they’re married, and obviously, normally, their assets would be consolidated into community property. If we look at the assets of both of them as a set, then marriage actually represents a merger of the two sets. For another example, in ancient times, transportation was difficult, and villages on either side of the river often had little contact with each other. If one day a bridge was built over the river, connecting the two sides of the river, and if we regard the villages that cross the two sides closely as a collection, the bridge would obviously merge the two collections.
The examples above are all about collection merging, and there are many examples of collection lookups. The most classic is DNA testing, for example, there are often some people claim that they are descendants of certain family. If we want to verify their identity the best way to do that is to look at the DNA lineage, half of our genes come from our father, half of our genes come from our mother, and these genes can also be viewed as collections. If someone claims to be a royal family, and we match the genes of the royal family with those of his own family, and it turns out they’re not from the same set, then of course the claims fall apart. Scientists are said to have used this method to prove that Liu Sheng, the king of Jing in Zhongshan, was not descended from Liu Bang, but came from an unknown southern population.
It is the combination and search of sets that solve the problems related to these two kinds of sets.
Collection lookup
So first of all, let’s look at the set lookup, let’s look at the example above, how can we tell if two different people are from the same family?
That’s easy. We just need to find out if they share a common ancestor. If we analyse their genes as coming from a common ancestor, it means they are probably from the same family. If we lay out the kinship, we get a tree. The ancestor is the root, and the youngest child is the leaf. Like a tree, each node can have only one parent (consider the parent as a node), but can have multiple child nodes.
So we determine whether two people are from the same ancestor, that is, whether two nodes are in the same tree.
Because we’re dealing with a limited amount of data, not as long as genetic links, we can deal with this very simply. We just need to find the root of the tree in which the two nodes are located and compare them to see if their roots are equal. If they are equal, it indicates that the two nodes come from the same tree and naturally belong to the same set; otherwise, they belong to different sets.
At this point, not only do we know how to find out if two nodes belong to the same set, but we also know that we should represent the set in a tree structure.
Combination of sets
Now that we understand that we are representing sets by tree, and that we can determine whether two nodes belong to the same set by looking at the roots of the tree, merging sets becomes easy. All we need to do is merge the two trees, which is not easy if we need to think about tree depth or logical structure, but we don’t need that much information, we just need to make sure that the elements in the set are correct. We can connect one tree directly to the other tree.
Let’s say we have two trees A and B that need to be merged:
We just need to connect the root node of B to the tree of A:
So all the nodes in the A tree and B tree belong to the same set, and the result is correct using the logic we just found. However, we find that if we merge the two trees in this way, the depth of the merged tree will become longer, which will reduce the efficiency of our follow-up search, because we will go farther from the leaf node to the root node. So we can direct the root of tree B to the root of tree A:
Code implementation
For each node in the tree, since we’re looking for the set we need to look for its roots, not its leaves. So for each node in the tree, we just need to record its parent. We can use a dict to store the parent of all nodes, and for the root node, we usually set its parent to itself. That is, when we find a node whose parent is equal to itself, we have found the root.
Once we understand this logic, it is very simple to implement it in code:
father = {}
def query(x):
if father[x] == x:
return x
return query(father[x])
def union(a, b):
fa, fb = query(a), query(b)
if fa == fb:
return
father[fb] = fa
Copy the code
You see, this code is less than ten lines long, so it certainly works, but it still has a lot of room for improvement.
The efficiency of optimization
We’ve implemented the algorithm, but it’s not optimal, and we can make some optimizations. First of all, it’s easy to think that since every time we determine whether an element is in the same set we do it by the root, and we don’t care about the structure of the tree, we only care about what the root is. In this case, we can completely leave the structure of the tree and make it flat so that all nodes except the root node are leaf nodes, for example:
The results of our query are the same for these two trees, as well as the number of nodes and edges. However, the difference is that we need to traverse the root of the tree on the left at most three times, while all the nodes on the right only need to be searched once at most. This optimization is even more pronounced if the tree structure is more complex.
But the problem is that it’s expensive to make changes to the tree structure, and in theory every time we merge, the tree structure changes. Do we have to spend every time we mergeTime for tree reconstruction? However, we are not sure of the proportion of lookups and merges. It is possible that there are many lookups and it is very cost-effective to reconstruct the tree. If there are few lookups and most of the lookups are merges, then we may have spent a lot of money on tree reconstruction, but it is not useful.
Another problem is that we reconstruct the tree for all the nodes in the tree, we lift all the nodes up to the children of the root node. But the problem is that maybe not all of the nodes participate in the search, maybe we only search a few of the nodes, then we also calculate the rest of the nodes for nothing.
There is a very simple and very clever optimization for this problem — lazy computing.
This is not the first time we’ve seen lazy computing, we’ve seen lazy computing before when we introduced the Scapegoat tree. In the case of the scapegoat tree, there is a lot of overhead because the tree structure changes as we remove nodes. So instead of deleting the node, we mark it as deleted. In this way, when we find this node, the content of this node does not participate in the query, so as to simulate deletion. When the number of nodes marked with deletion reaches a certain level, we will delete them in batches.
The same lazy computing idea is used in the look-up set. If we want to do tree reconstruction, we need to look for the root node. Finding the root node is expensive, so when do you need to find the root node again other than when you’re reconstructing the tree? Obviously, when we’re looking to see if two nodes belong to the same set. In this case, we can refactor as we search, pointing all nodes on the search link to the root node.
Let’s take an example. Suppose one search was made for the F element in the picture below, then the whole tree would look like the one on the right:
This logic is easy to implement with recursion, as simple as:
def query(x):
if father[x] == x:
return x
father[x] = query(father[x])
return father[x]
Copy the code
In addition, we have another optimization. When we merge two trees, the order in which we merge will affect the result. Here’s an example:
As you can see, if you add two subtrees, A+B and B+A get different results. Both results are certainly true, but the problem is that they differ in depth. If we add B, which is smaller, to A, which is larger, we get A tree depth of 3, but if we reverse it, we get A tree depth of 4. So we think we can maintain the tree depth of each subtree, and when we merge, always add the smaller tree depth to the larger tree depth, not the other way around.
In the end, we put all the ideas together and wrote a complete code, very simple, with less than 40 lines of core logic.
class DisjointSet:
def __init__(self, element_num=None):
self._father = {}
self._rank = {}
Each element becomes a collection when initialized
if element_num is not None:
for i in range(element_num):
self.add(i)
def add(self, x):
Add a new collection
# Skip if it already exists
if x in self._father:
return
self._father[x] = x
self._rank[x] = 0
def _query(self, x):
# if father[x] == father[x] == father[x
if self._father[x] == x:
return x
self._father[x] = self._query(self._father[x])
return self._father[x]
def merge(self, x, y):
if x not in self._father:
self.add(x)
if y not in self._father:
self.add(y)
The root of two elements was found
x = self._query(x)
y = self._query(y)
# if they are equal, they belong to the same set
if x == y:
return
Otherwise merge the smaller tree with the larger root
if self._rank[x] < self._rank[y]:
self._father[x] = y
else:
self._father[y] = x
If the tree depth is equal, the depth of the merged tree is +1
if self._rank[x] == self._rank[y]:
self._rank[x] += 1
# check whether it belongs to the same set
def same(self, x, y):
Copy the code
conclusion
Tree depth optimization is a little bit problematic if you think about it, because we’re doing tree reconstruction when we’re querying for roots, so it’s very likely that the tree depth will change during this process. For example, if we happen to be looking for the deepest leaf node in the tree, the tree depth will decrease after the search. But although there is such a small problem, but our actual use of little impact. One is that it’s not very likely to happen, and the other is that if it does happen, it won’t be so bad. A single merge in the wrong order will, at most, cause a constant deviation in the tree depth for a small number of nodes.
In addition, there are other optimization methods, such as merging based on the number of nodes in the tree. I personally think that merging a tree with fewer nodes into a tree with more nodes is more unreliable, because a tree with more nodes does not necessarily have a bigger tree depth, but rather a smaller tree depth. So in general we’re used to using tree depth as the basis.
This algorithm is very classic, it is not difficult to understand, the amount of code is also very small, high efficiency, learning curve is very smooth, can be said to have almost no shortcomings except the use of narrow scenarios. After all, no algorithm is perfect, which is the beauty of algorithms.
Hope everyone can harvest, the original is not easy, brazen for a concern ~