What is approximate nearest neighbor search and what are the common methods

Can briefly explain approximate nearest neighbor lookup (5point)

Common methods (5 points)

Approximate nearest neighbor lookup:

NN with ANN

NN, Nearest Neighbor Search

KNN, K-nearest Neighbor, K: search the first K data items Nearest to the target data

ANN, Approximate Nearest Neighbor to improve the retrieval efficiency at the expense of the accuracy within the acceptable range

The nearest neighbor retrieval has linear complexity, and ANN method can be used when processing large scale data

LSH, locally sensitive hash is one of the ANN’s

Common methods:

LSHalgorithm:

How to search for massive and high-dimensional data:

If the data is low dimensional and small => search in a linear way

The data is not only massive, but also high dimensional => need to reduce the dimension, using the index search

LSH, locality-sensitive Hashing, locally Sensitive Hashing

You need to find one or more data similar to the data

Approximate Nearest Neighbor Lookup method (ANN)

Traditional Hashtables are used to retrieve data and cannot put similar data into the same Bucket, such ash =x mod w

LSH maintains the relationship between adjacent data after mapping, that is, maintaining locality-sensitive Locality

Through the Hash Function, each Bucket will fall into some raw data, and the data belonging to the same Bucket will most likely be adjacent (non-adjacent data will also be Hash into the same Bucket).

The original data set is divided into multiple subsets, and the data in each subset is most likely to be adjacent, and the number of elements in the subset is less.

To facilitate neighbor lookup => Find adjacent elements in a small set

MinHashalgorithm:

Document similarity calculation:

K-shingle, also known as k-gram, any string of length K in a document. Each document can be represented as a collection of k-shingles that occur one or more times in the document

For example, document=”abcdabd”, when the set of 2-shingle is {ab, BC, CD,da,bd}

If two documents are similar, many of their shingles will also be the same

The longer the text, the greater the value of K. The empirical value of K reference, short text K=5, long text K=10

Jaccard similarity

Massive data, high dimension => When the Matrix is very large, the target needs to find a Hash function, and calculate the original Jaccard similarity as the similarity Matrix calculation after dimension reduction (Input Matrix => Signature Matrix)

If the Jaccard similarity of the original document is high, there is a high probability that their hash values will be the same

If the Jaccard similarity of the original document is low, the probability of their hash values being different is high. MinHashing for Jaccard similarity

What is a Hash function

The original Jaccard similarity calculation is equivalent to the similarity Matrix calculation after dimension reduction (Input Matrix => Signature Matrix).

If the Jaccard similarity of the original document is high, there is a high probability that their hash values will be the same

If the Jaccard similarity of the original document is low, the probability of their hash values being different is high. MinHashing for Jaccard similarity

What is k – shingle

Also called k-gram, any string of length K in a document. Each document can be represented as a collection of k-shingles that occur one or more times in the document

For example, document=”abcdabd”, when the set of 2-shingle is {ab, BC, CD,da,bd}

If two documents are similar, many of their shingles will also be the same

The longer the text, the greater the value of K. The empirical value of K reference, short text K=5, long text K=10

Minimum hash

MinHash perform

How to sort massive data => Storage space and calculation time

=> There are multiple hash functions and hash I to obtain the latest row number. If the column element is 1 and the new row number is smaller than the original M value, update the M value

for each hash function hi do

compute hi (r);

for each column c

if c has 1 in row r

for each hash function hi do

if hi (r) is smaller than M(i, c) then

M(i, c) := hi (r);

For the hash function:

h(x)= x mod 5

g(x)=(2x+1) mod 5

The number of rows +1, each time you run the h and g functions

If the value on the column is 1, then check to see if the Hash gets a smaller value, and if so, update

=> Perform m times of row vector substitution by using M Hash functions for row index (solves the problem of row vector substitution)

MinHashLSH

In addition to solving the problem of calculating the similarity between Ci and Cj, when the amount of data is large, the calculation times of the similarity between Ci and Cj are

When the amount of data N is large (>1 million), the calculation amount is very large

=> Assign similar users to the same bucket with a high probability. In this way, the “candidate set of similar users” of each user will be much smaller and the computational complexity will be greatly reduced

LSH is an approximate solution for similarity

Divide the Signature vector into bands based on MinHash

The Signiture matrix is divided into B groups, and each group is composed of R rows

Hash each group and set a different bucket space for each group. As long as a group of two columns has the same MinHash, both columns will hash to the same bucket as candidate similarity.

Suppose that for a row, the probability of two columns having the same signature value is S (the similarity of the two columns).

The probability that all values are equal in a band is

The probability of different values in a band is

There are b bands in both documents, and the probability that all b bands are different is

At least one of the b bands has the same probability

=> The probability of two columns becoming candidate similar pairs is

The And then OR method requires that all elements of each band must be identical, And then that at least one of the bands must be identical. The hash collision can only occur if these two parameters are met.

Assume s=0.8, 20 bands, each band 5 lines, i.e. B =20, r=5

The probability that all values are equal in a band is

The probability of different values in a band is

The probability that all b bands are different is

At least one of the b bands has the same probability

After S exceeds a certain threshold, the probability of two users becoming candidate users increases rapidly and approaches 1. This threshold, where the probability changes most steeply, is approximately zero

MinHash is similar to Jaccard

Class C rows have no effect on the result calculation and can be deleted

P(h(Ci)=h(Cj))=P(h(Ci)=h(Cj))=P

= Number of rows of class A/number of rows of all =A /(A +b)

P(h(Ci)=h(Cj))= Jaccard(Ci,Cj)

The Jaccard similarity of Ci and Cj is estimated with the probability that their MinHash values are equal

SimHash algorithm flow

SimHash algorithm:

Step1, set the number of SimHash bits, such as 32 bits, considering the storage cost and the size of the data set

Step2, initialize SimHash and initialize each part to 0

Step3, extract features from the text, such as 2-shingles

“the cat sat on the mat”=>{“th”, “he”, “e “, ” c”, “ca”, “at”, “t “, ” s”, “sa”, ” o”, “on”, “n “, ” t”, ” m”, “ma”}

Step4, use the traditional hash function to calculate the hashcode of each word

For example, “th”.hash = -502157718, “he”.hash = -369049682,…

Step5, for each bit of the Hashcode of each word, if the bit is 1, then add the weight of the corresponding bit value of Simhash (usually the frequency of occurrence); Otherwise, subtract its weight

Step6, calculate the finally obtained 32-bit SimHash, if the bit is greater than 1, set it to 1; Otherwise, set it to 0

How to get the fingerprint of each document through SimHash algorithm

Obtain the fingerprint of each document through SimHash algorithm

Calculate the hamming distance between the fingerprints of the two documents

Usually, if the Hamming distance between two documents is less than 3, the similarity is considered high => The two documents are basically the same