This is the sixth day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021
Introduction to Min-Hash Algorithm
Introduction to Min-Hash Algorithm in Detail and Easy to understand [2]
Introduction to Min-Hash Algorithm with Super detail and Easy to understand
Introduction to Min-Hash Algorithm with Super detail and Easy to understand
The actual min-hash algorithm
As we mentioned in the previous section, shuffling a matrix by row is a fairly computationally complex operation. Therefore, if we want to put the min-hash algorithm into practical application, we must find another hash function to replace the shuffling operation, as a new minimum hash operation to get hash signature. So, what are the methods adopted in practice? That’s what we’re going to cover in this section.
In practical applications, a group of random mapping hash functions with good uniformity is usually used to hash each element of the original set. After each hash function is used to hash the whole set once, the minimum value in the result set is selected and added into the new set. After each set of hash functions has hashed the entire set separately, the resulting new set is our hash signature.
For example, we have a set of hash functions hi(xj), where hi represents a set of hash functions (let’s say k), and xj represents the elements in the set where we need to compute the hash signature. All elements in the set are hashed using h1, and the minimum value is selected and added to a new set. Then use h2 to repeat the above operation and add the minimum value in the result to the new set until hk ends. Our new set has K elements. This new set is our hash signature, and the above operation is our new minimum hash operation.
The above is just a brief illustration of the idea in practical application. However, we still need to solve the following problems:
- Are the hash signatures of the original set as similar to each other after this new minimum hash operation?
- How do you choose such a set of hash functions?
Now, let’s tackle the first problem first.
Suppose we have two sets C1 and C2 that need to calculate similarity, then we will write them as a minimum hash operation respectively, and the resulting values will be written as min-hash (C1) and min-hash (C2). Because of random improbability, then the probability that min-hash (C1) is equal to min-hash (C2) is equal
You can view the above formula from this point of view, using the same hash function on C1, C2 minimum hash (note that this is an operation, not an operation), all the results of the set C1, C2In the collectiontheThe minimum valueThe probability that the corresponding element belongs to the intersection of the two sets is.
Now that the first problem has been proved, let’s tackle the second problem: how to select such a set of hash functions.
In fact, as a rule of thumb, we usually use the following formula to get the desired number of hash functions by directly varying the coefficients:
In practice, we can get an infinite number of hash functions by adjusting the size of the two coe coefficients and the mod for three parameters.
Well, this is really the end of the introduction to the Min-hash algorithm, and in the next section, we’ll give you an actual code example to deepen your understanding of it.