Welcome to my station around ~~

background

Faiss is Facebook’s open source vector recall engine for finding N vectors that are most similar to a vector.

Faiss was first released on February 23, 2018, but its author Matthijs had already published a paper on nearest Neighbor search in 2011, before joining Facebook, and Faiss was based on that paper. After reading this paper, the Faiss index method becomes clear.

Problem description

Let me give you a D vectorAnd collection, need to find andK nearest neighbors with the shortest distance.

Taking Euclidean distance as an example, it can be expressed as:


In our application,

Problem size

Let’s try to do an exhaustive search in the roughest way possible to see how complex this solution is.

  1. Construct the distance matrix: every two vectorswithIs calculated by the formulaAnd it takes time for, the distance matrix containsElement, the total time is
  2. K nearest neighbors are found from the distance matrix. If the minimum heap algorithm is used, the time complexity is

take

get

  • The distance matrix contains 400T elements, assuming that each distance is a float and occupies 32bit, at least 1600TB of space
  • The operation time complexity of constructing distance matrix is of the order of magnitude
  • The time complexity of finding k nearest neighbors from the distance matrix is of the order of magnitude

Nearest neighbor offline table

In general, vector sets are updated daily, so you can try to save k nearest neighbors of each vector directly

  1. The construction of the distance matrix +N k nearest neighbor search time is
  2. Construct nearest neighbor offline tablespace occupancy

In our scenario

  1. Finding N k-nearest neighbor lookups takes an order of magnitude of
  2. Storage space is(deleted after construction)+320G(assuming each index is represented by int and the score is represented by float)

Time to find k’s nearest neighbor: O(1)

Vector Quantization

In our scenario, the most precise distance is not required, allowing a certain degree of error. In this case, we can introduce vector quantization to reduce the number of vectors significantly.

The so-called vector quantization is the original infinite spaceMapping to a finite set of vectorsIn whichIs a natural number. Will this fromTo the collectionThe function of phi is called phi,In information theoryFor the codebook.

Of course, the mapping function here is not randomly specified, which needs to meet the principle of minimum error. One method is to set the optimization function to the minimum square error

Well, that’s the target function of the K-means method! Therefore, we can use K-means as the method to find the best Codebook.

Now let’s analyze the space and time complexity of vector quantization

Suppose we map the original 2000W vectors to a set of size 20W (on average each center point represents 100 vectors, already introducing a large error)

  1. Distance matrix: only need to store correspondenceThe distance between the vectors, and the space occupied by isIn this case, is
  2. If a vector is identified by int, it is approximatelymemory
  3. Time complexity: The time complexity of k-means algorithm is, includingIs the number of iterations, N is the number of original space vectors, k is the number of center points, and D is the vector dimension. In this case, the time complexity isWhen the number of iterations is 25, it has reached the order of magnitude of violent search. As long as the number of iterations increases slightly, the time complexity will be even higher.

Product Quantization PQ

A lot of times we have different distributions between different parts of the vector

So you can divide the vectorsVector quantization is carried out for each part. Assuming average division, the dimension of each part is

A vector, can be divided into m group vectors, the codebook of each group is, the corresponding quantizer is denoted as.. The final global Codebook isThe name product quantization also comes from this.

Let’s take m=4, and to get to the magnitude of 20W in the last video,You only need to reach 22, and you only need to reach 67 to get back to 2000W.

In order toFor example, the number of codebook center points in each group reaches 2000W67. Now let’s analyze the corresponding space and time complexity

  1. Distance matrix between centers: n, the distance matrix size is. If the distance is represented by float, the distance matrix in this example is about 70M
  2. If a vector is identified by int, it is approximatelymemory
  3. Clustering time complexity: the time complexity of k-means algorithm is stillIn this case, k-means needs to be performed on group M respectively, and the total time complexity isIn this case, is, 3 orders of magnitude lower than the above example
  4. Time complexity of distance matrix: the product of dimension D and the number of distance matrix elements, i.eAnd about

The essential reason that product quantization greatly reduces space occupation lies in: the vector space of expression is, but the disk space occupied is

The empirical value given in the paper is, the corresponding vector space size isabout

Distance calculation in the Quantization scenario

In the case of no Quantization, distance is computed directly for two points

In the Quantization scenario, however, the distance between x and y is not computed directly, but by the center point. There are two variants of this approach:

SDC

SDC(Symmetric Distance Computation): The Distance between two vectors is measured by the Distance between the center point of the two vectors. The error is less than or equal to the distance from x to the center plus the distance from y to the center.

In the PQ scenario, the SDC distance is expressed as follows:

Since the distance between the center vectors has been stored (see the previous section), the distance between X and y can be calculated only by looking up the table, which is about 70M in size, and the calculation of the distance matrix takes orders of magnitude

If the space of the distance matrix is not occupied, the time complexity is, which is about

ADC

ADC(Asymmetric Distance Computation): The Distance between X and Y is measured by the Distance between the center of X and Y. Using the triangle property: the difference between the two sides is less than the third side, so the error must be less than or equal to the distance between y and the center

In the PQ scenario, ADC distance is expressed as:

Let’s call the ith element of vector x,

In this scenario, if you want to get the distance by looking up the table, the size of the distance matrix in the data preparation link is about(Each vector needs to calculate the distance with each center vector in m groups), which takes an order of magnitude of time

If the SDC is directly calculated, the complexity is the same as that of the SDC without table lookup, which is about

With a look-up strategy, the ADC pays more memory costs with higher accuracy

IVFADC

In the last video, the direct calculation time was spent comparing all the center vectors. A natural approach is to first find a rough candidate central node, avoiding calculations with a large number of impossible nearest neighbors.

Coarse quantization + residual quantization

Therefore, Matthijs proposed the process of rough quantization + residual quantization in his paper. To be specific, construct a size of is from the entire data setA small codebook (suppose 1000), the quantizer is denoted asSo each of these vectors will have a residual. The original vector may have particularly large distribution differences/imbalances, but this problem can be greatly mitigated by residual results.

Again toUse the PQ step becauseThe “energy” relative to the original vector is lower, so the simulation can be more accurate through the PQ step. The quantizer of PQ step is denoted as, then y passesTo represent. So the distance between two vectors x and yIt can be approximated by


The codebook size of residual quantization is(take 64 as an example), if you want to put theSo I’m going to precompute, so I’m going to precompute for each x with all the center vectorsThe time complexity is, the example in this paper is, the space complexity isThe example in this paper is 20G*4B=80GB.

Index structure

By inverting the index, the search efficiency can be greatly improved

It is proposed in this paperAn inverted index stores rough center pointsList of corresponding vectors. Each vector is represented in the following format, where ID is the index ID of the vector and code is the index list of the corresponding PQ center point. The number of central points in each group in PQ is, you need toA bit to indicate which center point, totalA bit

When searching, you can passFunction retrieves all vectors in the corresponding cluster

Indexing process

  1. Through quantizerThe vectorMap to
  2. Calculation of residual
  3. The residualTo quantify the, which contains m groups
  4. To construct aEntry and joinThe corresponding inversion list
The search process

Because in many cases, the nearest neighbor is not necessarily in the current cluster, you need to find not only the current cluster, but also neighboring clusters.

  1. To calculateIs closest to the input parameter xA central point,
  2. If there is a central point left unprocessed, remove a central pointAnd calculate the corresponding. Otherwise, skip to Step 6
  3. To calculateDistance from the center points within each group,
  4. Due to the same inverted index correspondingIt’s the same thing, so the distance between x and y is just the residual distance. Due to theThe distances from each center point have been calculated, so each vector only needs to look up the table m times. O(m)
  5. Return to Step 2
  6. The minimum heap is used to obtain K vectors with the smallest distance, since the expected number of elements per inverted index is, so it takes time

The time of the whole search process is as follows:

The experimental results

To optimize the

  1. In residual quantization, the residuals from different vectors to their respective centers are quantized together, which implies the assumption that the distribution in different clusters is the same. This assumption is a bit of an error, but if you don’t do it, your footprint will increaseThe Times
  2. Different ways of grouping vectors can lead to significant differences in performance. Experiments in the paper have shown that placing related fields in the same group can improve accuracy by 2-3 times in some scenarios, rather than randomly grouping them. Before indexing, vectors can be organized in a proper order using some similarity analysis.
  3. If 1 is selected, only the vectors in the current cluster will be searched, and the result may be much worse than SDC. The author’s suggestion in the paper is to takeHowever, different scenarios should be tested first to achieve a balance between memory space and time.

conclusion

Faiss essentially encodes a vector as a combination of a finite number of vectors, converting a distance calculation between vectors into a distance calculation between a finite number of vectors that can be calculated in advance.

Key to simplifying calculations:

  1. Candidate vectors are narrowed down to neighborhood vectors
  2. Product quantization leads to a jump in expressiveness:
  3. The distance matrix of product quantization result is calculated in advance, so that the distance calculation becomes table lookup operation. (Product quantization keeps the space requirements of the distance matrix within tolerable limits)

For example: the following vectors 12, 13! / example_1 p1-jj.byteimg.com/tos-cn-i-t2…).

First, they are subjected to rough clustering calculation

It was found that their cluster IDS were all 1

The residuals were then calculated in three groups

The product quantization of residuals is performed

Then the distances of vectors 12 and 13 can be directly quantized by summing up the product of three groups of vectors, where vector distances are all calculated in advance

Reference: the Product Quantization for on his Neighbor Search “: Lear. Inrialpes. Fr/pubs / 2011 / J…