preface

What is a selection algorithm?

In computer science, a selection algorithm is one that finds the KTH smallest number in a list or array;

Computes the KTH largest (smallest) element in the set. It’s topK related, but the algorithm only needs to find the KTH one.

Lucene’s source code defines an interface for the selection algorithm:

/** An implementation of a selection algorithm, ie. computing the k-th greatest * value from a collection. */
// Select algorithm, topK problem
public abstract class Selector {

  /** Reorder elements so that the element at position {@code k} is the same
   *  as if all elements were sorted and all other elements are partitioned
   *  around it: {@code [from, k)} only contains elements that are less than
   *  or equal to {@code k} and {@code (k, to)} only contains elements that
   *  are greater than or equal to {@code k}. */
  // Reorder the elements so that the element at position K is the split point
  public abstract void select(int from, int to, int k);

  void checkArgs(int from, int to, int k) {
    if (k < from) {
      throw new IllegalArgumentException("k must be >= from");
    }
    if (k >= to) {
      throw new IllegalArgumentException("k must be < to"); }}/** Swap values at slots <code>i</code> and <code>j</code>. */
  // Swap the contents of the two slots
  protected abstract void swap(int i, int j);
}
Copy the code

The interface defined has exchange in addition to selection.

Lucene has two implementation for selection algorithm, fast selection algorithm and cardinality selection algorithm. This article will analyze the source code of the fast selection algorithm in detail. The path is: org.. Apache lucene. Util. IntroSelector.

The full annotated version of the source is available on Github. IntroSelector source

The principle is introduced

In computer science, Quickselect (English: Quickselect) is a selection algorithm that finds the KTH smallest element from an unordered list. It has to do with quicksort in principle. Like quicksort, it is also called Hall selection algorithm because it was proposed by Tony Hall. [1] Similarly, it is an efficient algorithm in practical application, with good average time complexity, but not ideal worst time complexity. Fast selection and its variants are the most commonly used efficient selection algorithms in practical applications.

The general idea of quick selection is the same as quicksort: select an element as the base to partition the elements, and divide the elements less than and greater than the base into two areas to the left and right of the base. The difference is that quick selection does not recursively visit both sides, but only recursively enter the elements on one side to continue the search. This reduces the average time complexity from O(n log n) to O(n), but the worst case is still O(n2).

For quicksort, I think everyone is very clear about its principle, so I won’t repeat it here.

It is well known that the worst time complexity of quicksort is O(n2). So is quick selection.

The worst that can happen is that every time you pick a split point, you pick the wrong one. If you take the first one in a sorted array, for example, you won’t split at all.

Therefore, the optimization of fast selection mainly focuses on the selection of segmentation points.

Most left/most right as the split point

This is the kind of implementation that we usually implement, and the performance is almost linear, that is, O(n). But he can’t solve the sorted array problem, he’ll degenerate to O(n2).

Randomly select the segmentation points

Since our array is unsorted, the entire array is essentially random. So this scenario is essentially the same as the one above, depending on luck.

The median method is used to select the segmentation points

Take the first, the last, the middle, and the median of the three elements as the split point. In this way, the partially sorted data can still achieve linear complexity. But on a special array constructed artificially, it still degrades to O(n2).

I guess the algorithm idea: random selection, the worst case, because every time you pick the worst, that is, the biggest number. Adding the median of three numbers can ensure that the selected segmentation point is neither the largest nor the smallest, deliberately avoiding the worst case.

Median method (also called BFPRT method, named after the initials of 5 authors’ names)

Selection method of primary segmentation point:

  1. Divide all elements into groups of 5. We get (n/5) 5-tuples.
  2. For each 5-tuple, find the median by insertion sort.
  3. For the (n/5) median, recursively call this method to find the median.

Time complexity analysis

Why 5?

In BFPRT algorithm, why choose 5 as grouping? First, even numbers are excluded, because the median is easier to calculate for odd numbers. If you take 3, plug it into the formula above, there's no difference. If you pick 7,9 or more, it will take longer to insert sort, and the constant will be too large to justify the loss.Copy the code

The practical application

According to the above principle, the conclusion can be roughly drawn:

  • The median of three gives good linear complexity, but there is a very small probability that the extreme case will lead to O(n2).
  • Median method of median provides absolute linear time complexity guarantee. But his constants are big and sometimes wasteful.

So in practical application of course is to learn from each other.

Therefore, the best implementation of fast selection in practical application should be to use the three-digit median method to select the segmentation point, set the threshold value, and switch to the median of the median (BFPTR) to ensure the time complexity in the worst case

Coincidentally, that’s how Lucene does it.

The Lucene sourceorg.apache.lucene.util.IntroSelector.

Version 8.7.0

define

This class is an abstract class, which is only responsible for providing fast selection of partition point selection, left and right partition, not responsible for specific storage media, switching algorithm, etc. So it has three abstract methods waiting to be implemented by subclasses.

  • Void swap(int I, int j): swap algorithm, swap the values of the subscripts I and j
  • Void setPivot(int I): Sets the subscript I to the split point
  • Int comparePivot(int j): Compares the value on j subscript with the split point, returning the size.

These three methods have nothing to do with the essence of quick selection, but a simple implementation is presented for ease of understanding.

  /** * this is a simple, fast selection implementation based on int arrays */
public static class TestSelector extends IntroSelector{
    Integer[] actual;
    Integer pivot;

    public TestSelector(Integer[] actual) {
      this.actual = actual;
    }

    @Override
    protected void swap(int i, int j) {
      ArrayUtil.swap(actual, i, j);
    }

    @Override
    protected void setPivot(int i) {
      pivot = actual[i];
    }

    @Override
    protected int comparePivot(int j) {
      returnpivot.compareTo(actual[j]); }}Copy the code

Core Select method

public final void select(int from, int to, int k) {
    checkArgs(from, to, k);
// The maximum depth of recursion
    final int maxDepth = 2 * MathUtil.log(to - from, 2);
    quickSelect(from, to, k, maxDepth);
    }
Copy the code

The core method is relatively simple, input parameters are: left subscript, right subscript, K to be sought.

  1. Check the parameters
  2. Define the maximum depth of recursion
  3. Call quick selection

What is the maximum depth of recursion

In the principle part, the three medians are used for quick selection in practical application. However, if the recursion is too many times, it will be considered as an extreme case and will switch to the median of the median to select the segmentation point. The threshold defined here is: ‘Recursion depth > 2*lg(n).

quickSelect

Quick is an adjective that describes the quick selection of a class, so the median selection of the three methods is implemented quickly.

His flow charts are as follows:

Combined with comments in the code, it should be easier to understand.

The core logic can be summarized as follows:

  1. The segmentation point is calculated by the median of the three
  2. Move data left and right partitions according to the split point
  3. I’m going to recurse from the left to the right and pick the side where K is

Insert logic: slowSelect if you find that the recursion limit is reached at the start of each slowSelect.

SlowSelect method

It is clear that the author considers this method to be slower while the previous one is faster. This is a little bit different from what we learned, where we learned about the time complexity of mathematical proofs, where the speed tends to be a little bit more industry average, a little bit more sensitive to constants.

Flow chart:

Code:

Core logic:

  1. If it’s equal, it’s found. Return
  2. Median method of median is used to find the segmentation point that should be selected at present
  3. Divide left and right according to the split point, small side, large side
  4. According to the size of the partition point and K, the left and right sides are selected for recursive search

Partition method is used, nothing special, is the common fast partition method, but the code is another style, there is no need to post.

The pivot method

This method implements the median of the solution median for [left,right].

This so-called median of medians, which is theoretically easy to solve, is just another recursive method. Why does it get complicated?

Think about it:

  • The purpose of quick selection is to find the KTH largest element in an unsorted array.
  • To figure out the median, we’re doing it mathematicallyThe medianIn an unsorted array, find the firstlength/2Big element.

They are essentially homogeneous, so the Lucene code, in order to reuse code, in the process of solving the median of the median, using the partial slowSelect code, is very exquisite, but for the people just look at the code, will feel more confused. (yes, say is my own, only when I am writing articles is suddenly wake up).

The code is as follows:

It involves a method of finding the median and partitioning within 5 elements, which is essentially a direct insertion sort, and then taking the median. Because the total number is controlled, the performance of insert sort is completely satisfied and the implementation is simple.

conclusion

  1. Both quicksort and quick selection are particularly useful, with quicksort applied to a large number of industrial sorts and quick selection applied to topK problems
  2. At the heart of quicksort and selection lies the selection of so-called pivot elements (cut points)
  3. The choice of cutting point, there are many kinds of optimization methods, the performance requirements are not high write casually, the performance requirements are high according to this article. Try to use the median of the three to solve the cut point, and pay attention to avoid extreme situations. Set the threshold and use the median of the median to solve the cut point.

Lucene’s code is clever and difficult to understand. But high efficiency.

Refer to the article

Zh.wikipedia.org/wiki/%E5%BF…

zhuanlan.zhihu.com/p/64627590


To the end.

To contact me

Finally, welcome to follow my personal public account [Huyanshi], I will update many backend engineers’ study notes from time to time. Also welcome direct public number private letter or mailbox to contact me, must know all without reserve, endless.





All the above are personal thoughts, if there is any mistake welcome to comment.

Welcome to reprint, please sign and keep the original link.

Email address:[email protected]

For more study notes, see personal blog or follow wechat official account < huyan10 >——>HuYan ten