What is approximate nearest neighbor search and what are the common methods
Can briefly explain approximate nearest neighbor lookup (5point)
Common methods (5 points)
Approximate nearest neighbor lookup:
NN with ANN
NN, Nearest Neighbor Search
KNN, K-nearest Neighbor, K: search the first K data items Nearest to the target data
ANN, Approximate Nearest Neighbor to improve the retrieval efficiency at the expense of the accuracy within the acceptable range
The nearest neighbor retrieval has linear complexity, and ANN method can be used when processing large scale data
LSH, locally sensitive hash is one of the ANN’s
Common methods:
LSHalgorithm:
How to search for massive and high-dimensional data:
If the data is low dimensional and small => search in a linear way
The data is not only massive, but also high dimensional => need to reduce the dimension, using the index search
LSH, locality-sensitive Hashing, locally Sensitive Hashing
You need to find one or more data similar to the data
Approximate Nearest Neighbor Lookup method (ANN)
Traditional Hashtables are used to retrieve data and cannot put similar data into the same Bucket, such ash =x mod w
LSH maintains the relationship between adjacent data after mapping, that is, maintaining locality-sensitive Locality
Through the Hash Function, each Bucket will fall into some raw data, and the data belonging to the same Bucket will most likely be adjacent (non-adjacent data will also be Hash into the same Bucket).
The original data set is divided into multiple subsets, and the data in each subset is most likely to be adjacent, and the number of elements in the subset is less.
To facilitate neighbor lookup => Find adjacent elements in a small set
MinHashalgorithm:
Document similarity calculation:
K-shingle, also known as k-gram, any string of length K in a document. Each document can be represented as a collection of k-shingles that occur one or more times in the document
For example, document=”abcdabd”, when the set of 2-shingle is {ab, BC, CD,da,bd}
If two documents are similar, many of their shingles will also be the same
The longer the text, the greater the value of K. The empirical value of K reference, short text K=5, long text K=10
Jaccard similarity
Massive data, high dimension => When the Matrix is very large, the target needs to find a Hash function, and calculate the original Jaccard similarity as the similarity Matrix calculation after dimension reduction (Input Matrix => Signature Matrix)
If the Jaccard similarity of the original document is high, there is a high probability that their hash values will be the same
If the Jaccard similarity of the original document is low, the probability of their hash values being different is high. MinHashing for Jaccard similarity
What is a Hash function
The original Jaccard similarity calculation is equivalent to the similarity Matrix calculation after dimension reduction (Input Matrix => Signature Matrix).
If the Jaccard similarity of the original document is high, there is a high probability that their hash values will be the same
If the Jaccard similarity of the original document is low, the probability of their hash values being different is high. MinHashing for Jaccard similarity
What is k – shingle
Also called k-gram, any string of length K in a document. Each document can be represented as a collection of k-shingles that occur one or more times in the document
For example, document=”abcdabd”, when the set of 2-shingle is {ab, BC, CD,da,bd}
If two documents are similar, many of their shingles will also be the same
The longer the text, the greater the value of K. The empirical value of K reference, short text K=5, long text K=10
Minimum hash
MinHash perform
How to sort massive data => Storage space and calculation time
=> There are multiple hash functions and hash I to obtain the latest row number. If the column element is 1 and the new row number is smaller than the original M value, update the M value
for each hash function hi do
compute hi (r);
for each column c
if c has 1 in row r
for each hash function hi do
if hi (r) is smaller than M(i, c) then
M(i, c) := hi (r);
For the hash function:
h(x)= x mod 5
g(x)=(2x+1) mod 5
The number of rows +1, each time you run the h and g functions
If the value on the column is 1, then check to see if the Hash gets a smaller value, and if so, update
=> Perform m times of row vector substitution by using M Hash functions for row index (solves the problem of row vector substitution)
MinHashLSH
In addition to solving the problem of calculating the similarity between Ci and Cj, when the amount of data is large, the calculation times of the similarity between Ci and Cj are
When the amount of data N is large (>1 million), the calculation amount is very large
=> Assign similar users to the same bucket with a high probability. In this way, the “candidate set of similar users” of each user will be much smaller and the computational complexity will be greatly reduced
LSH is an approximate solution for similarity
Divide the Signature vector into bands based on MinHash
The Signiture matrix is divided into B groups, and each group is composed of R rows
Hash each group and set a different bucket space for each group. As long as a group of two columns has the same MinHash, both columns will hash to the same bucket as candidate similarity.
Suppose that for a row, the probability of two columns having the same signature value is S (the similarity of the two columns).
The probability that all values are equal in a band is
The probability of different values in a band is
There are b bands in both documents, and the probability that all b bands are different is
At least one of the b bands has the same probability
=> The probability of two columns becoming candidate similar pairs is
The And then OR method requires that all elements of each band must be identical, And then that at least one of the bands must be identical. The hash collision can only occur if these two parameters are met.
Assume s=0.8, 20 bands, each band 5 lines, i.e. B =20, r=5
The probability that all values are equal in a band is
The probability of different values in a band is
The probability that all b bands are different is
At least one of the b bands has the same probability
After S exceeds a certain threshold, the probability of two users becoming candidate users increases rapidly and approaches 1. This threshold, where the probability changes most steeply, is approximately zero
MinHash is similar to Jaccard
Class C rows have no effect on the result calculation and can be deleted
P(h(Ci)=h(Cj))=P(h(Ci)=h(Cj))=P
= Number of rows of class A/number of rows of all =A /(A +b)
P(h(Ci)=h(Cj))= Jaccard(Ci,Cj)
The Jaccard similarity of Ci and Cj is estimated with the probability that their MinHash values are equal
SimHash algorithm flow
SimHash algorithm:
Step1, set the number of SimHash bits, such as 32 bits, considering the storage cost and the size of the data set
Step2, initialize SimHash and initialize each part to 0
Step3, extract features from the text, such as 2-shingles
“the cat sat on the mat”=>{“th”, “he”, “e “, ” c”, “ca”, “at”, “t “, ” s”, “sa”, ” o”, “on”, “n “, ” t”, ” m”, “ma”}
Step4, use the traditional hash function to calculate the hashcode of each word
For example, “th”.hash = -502157718, “he”.hash = -369049682,…
Step5, for each bit of the Hashcode of each word, if the bit is 1, then add the weight of the corresponding bit value of Simhash (usually the frequency of occurrence); Otherwise, subtract its weight
Step6, calculate the finally obtained 32-bit SimHash, if the bit is greater than 1, set it to 1; Otherwise, set it to 0
How to get the fingerprint of each document through SimHash algorithm
Obtain the fingerprint of each document through SimHash algorithm
Calculate the hamming distance between the fingerprints of the two documents
Usually, if the Hamming distance between two documents is less than 3, the similarity is considered high => The two documents are basically the same