To understand TopK problem and its variants

This is the real interview question I met in the interview Tencent, in a lot of face classics can also see its figure, today we will thoroughly understand it!

Problem description

How do I find the largest 100 numbers from 10W of data?

First look at the problem, 10W data, on the heap to create an array of violence is no problem, to find the largest 100 numbers, so first start with the simplest and most violent method.

1. The sorting method

As we all know, the time complexity of both quicksort and heap sort can reach O(N∗log2N)O(N *log_2N)O(N∗log2N) O(N∗log2N), we just need to sort 10W data and fetch the first 100. This method is very violent and can be used when the total number of data is not very large, such as taking the first 20 out of 100; Of course, we only need to briefly mention this solution in the interview, and we can say the next optimization method. Sorting is not the focus of this article.

Now for optimization, we only need the first 100, why do we have to sort all the data?

2. Local sorting

If we recall bubbling sort and selection sort, in the first k cycles, we can get the first k maxima/minima.

Take bubbling sort (descending) as an example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < n-i-1; j++) {
    if (arr[j] < arr[j+1]) 
      swap(arr, j, j+1); // Switch arr[j] and arr[j+1]}}Copy the code

So here, we take advantage of the properties of these two sorting algorithms by simply writing code:

// We just need to replace the outermost n with k
for(int i = 0; i < k; i++) { 
	for (int j = 0; j < n-i-1; j++) {
		/ /...}}Copy the code

In this way, the largest number of the first K can be obtained and located in the first k positions in the ARR, and such time complexity becomes O(N∗ k)O(N * k)O(N∗ k).

Simply compare the time complexity of the first two methods: O(N∗log2N)O(N *log_2N)O(N∗log2N) O(N∗log2N)O(N * K)O(N∗ log2N)O(N * K)O(N∗K) O(N * K)O(N∗K) O(N∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K)O(N ∗K) We can use local sorting.

3. Partition

Recall that in each step of quicksort, you divide the data to be sorted into two groups, where any number of the data in one group is larger than any number in the other group, and then do the same for both groups, and then continue…

As shown in the figure below, the data in ARR are divided into two parts: less than k and greater than K:

Now, let’s see how we can use this idea to figure out the largest number of K.

Let’s assume that we have an array S, from which we pick a random number X, and then divide the array S into two parts:

A: Greater than or equal to X
B: Less than X

As shown in the figure below, we perform Partition operation on array S, and two cases can be obtained:

If the number of A’s is greater than K, then the largest number of K’s in S is the largest number of K’s in A;

The top ten students in grade (S) must be the top ten students in grade (A).
If A number is less than K, we need to find the rest in B, which is A + B K – | the |;

Similarly, grade (S) (K) of the top ten must be the top three grade (A) combined with grade 4-100 (B) in the first seven (K – | | A);

If you don’t understand the above part, you can refer to this small example below. If you do, you can skip it:

We simply repeat the above operation until the first K numbers are found, with an average time complexity of O(N∗log2K)O(N * log_2K)O(N∗log2K).

Attached here is a pseudo-code:

Based on this pseudo-code, I simply wrote the following code :(Java implementation, but in a general way to write, for CPP, go have reference value)

Make sure you do it yourself. It’s not enough to just look at the code, in case the interviewer asks you to write it by hand. In addition, this code is easy to understand, there are actually a lot of irregularities, such as variable names in uppercase letters and so on, which you can try to optimize when writing.

public int[] KBig(int[] S, int K) {
  if (K <= 0) {
    return new int[0];
  }
  if (S.length <= K) {
    return S;
  }
  Sclass sclass = Partition(S);
  return contact(KBig(sclass.Sa, K), KBig(sclass.Sb, K - sclass.Sa.length));
}

public Sclass Partition(int[] S) {
  Sclass sclass = new Sclass();
  int p = S[0]; // Omit the random selection of elements
  for (int i = 1; i < S.length; i++) {
    if (S[i] > p) {
      sclass.Sa = append(sclass.Sa, S[i]);
    } else{ sclass.Sb = append(sclass.Sb, S[i]); }}if (sclass.Sa.length < sclass.Sb.length) {
    sclass.Sa = append(sclass.Sa, p);
  } else {
    sclass.Sb = append(sclass.Sb, p);
  }
  return sclass;
}
Copy the code

Notice that the pseudocode returns two arrays, we use a class to store them:

class Sclass { // Just store two arrays
  int[] Sa = new int[0];
  int[] Sb = new int[0];
}
Copy the code

Auxiliary functions:

/** * Inserts value * at the end of array arr@paramArray arr *@paramThe value value *@returnReturns the inserted array */
int[] append(int[] arr, int value) {
  int[] res = new int[arr.length + 1];
  System.arraycopy(arr, 0, res, 0, arr.length);
  res[arr.length] = value;
  return res;
}

/** * joins two arrays together *@paramA array a star@paramB array b star@returnReturns the joined array */
public int[] contact(int[] a, int[] b) {
  int[] res = new int[a.length + b.length];
  for (int i = 0; i < a.length; i++) { // Universal copy mode
    res[i] = a[i];
  }
  // In Java, you can actually copy using system.arrayCopy
  System.arraycopy(b, 0, res, a.length, b.length);
  return res;
}
Copy the code

When you write code, test will find, in fact, this method returns the largest number of K is not sorted (actually the title also didn’t ask you to sort, and the process of Partition if you very clearly, you can easily know here return is the maximum number of K disorder) we need to consider to be clear about application scenarios, Some scenarios don’t require sorting, some scenarios do, learn to choose.

Binary search

We’re looking for the largest number of K in the array S, so if we know the KTH largest number, does that make things easier? As the smart reader may have noticed, if we know the KTH largest number p in array S, then we can find the largest number K by simply walking through the array. (that is, all numbers greater than or equal to P), the time complexity of this step is O(N)O(N)O(N).

One reader might ask, what if I have multiple values equal to p, and I go through this and I get more than K?

In fact, there are many ways to solve this problem, but I’ll just say one simple way, which is when you iterate, you just take out the numbers that are greater than P, and then you fill in the corresponding p based on the difference between the numbers that are greater than P and K.

Example: S = [1, 2, 3, 3, 5], p = 3, K = 2; Size () = 1 p, that is, [5] = 1 K, that is, [5,3] = the largest number of K.

To return to our binary search method, we need to find the KTH largest number in S, with the following pseudocode:

Vmax: the maximum value in array S
Vmin: minimum value in array S
Delta: thanThe smallest difference between any two elements of any N number that are not equalSmall.If all elements are integers, delta can take on 0.5.

The algorithm’s time complexity is O (N ∗ log2 (∣ Vmax – Vmin ∣ delta)) O (N * log_2 (\ frac {| V_ (Max} – V_ |} {min} {delta})) O (N ∗ log2 (delta ∣ Vmax – Vmin ∣)).

In the case of evenly distributed data, the time complexity is O(N∗log2N)O(N*log_{2}N)O(N∗log2N).

In the case of integers, there’s another way to look at this algorithm. It is assumed that all integers are between [0,2m−1][0,2^{m-1}][0,2m−1], which means that all integers can be represented in binary as m bit (0, 1,… , marked m-1). We can start by looking at the (m-1) bit of the binary bit and dividing the N integers into two parts according to whether the bit is 1 or 0. And integer is divided into values for [0, 2 m – 1-1] [0, 2 ^ {} m – 1-1] [0, 2 m – 1-1] and [2 m – 1, 2 m – 1] [2 ^ {} m – 1, 2 ^ m – 1] [2 m – 1, 2 m – 1] two range. The (m-1) bit of the integer in the former interval is 0, and the (m-1) bit of the integer in the latter interval is 1. If the number of integers with bit 1, A, is greater than or equal to K, then continue to look for the largest K of all integers with bit 1. Otherwise, find the largest K-A in the integer with the bit 0. Then consider the binary bit (m-2), and so on. The idea is essentially the same as the floating-point case above.

5. BFPRT algorithm

This algorithm is more complicated, and we won’t go into details here, but it is similar to the idea of quicksort, but it can pick the KTH largest/smallest element from a sequence of n elements, and ensure that the worst time complexity is O(n)O(n)O(n).

Why not O(N)O(N)O(N) algorithms instead of algorithms that look slower? It is important to note that we usually talk about average/worst time complexity, ignoring the coefficients, and considering how easy it is to implement in real application scenarios (if it is too complex, it may cause more bugs than it is worth), as well as a variety of issues. It is not a no-brain-mind choice of low time complexity.

So this works with what we said earlier, that given the KTH largest number p in array S, we can just iterate through the array again and find the largest number K. This step is also O(N)O(N)O(N).

So the total time is order N, order N, order N.

Algorithm steps:

Divide n elements into groups of five into n/5(upper bound) groups.
Take the median of each group and sort by any method, such as insertion sort.
The recursive call selection algorithm looks for the median of all medians in the previous step, set to x, or select the smallest one in the case of even medians.
Divide the array by x, and let the number less than or equal to x be k, and the number greater than x be n-k.
If I ==k, return x; If I! =k, recursively looking for the i-th -k smallest element in the element greater than x. Terminating condition: n=1, returns the I element.

6. Max/min heap

The solutions we talked about earlier have one thing in common: if you have a large amount of data, you have to access the data multiple times.

So what if instead of asking for 100 numbers out of 10W, the interviewer asks for a billion? At this point, data cannot be read into memory at once, so we need to traverse as little data as possible.

Recalling our heap sorting, we need to maintain a maximum heap/minimum heap, and this is the key point. Before we can take out from the 10 billion data K, and then use this number to establish a minimum heap, K to iterate through all the data after each took out a number, if the minimum value of pile is greater than the current, the minimum in the heap will replace the current minimum, and maintaining the order of the heap, iterate through all the data it only once, the maximum number of K we can get order. The time complexity of the maintenance heap is O(log2K)O(log_2K)O(log2K), so the overall time complexity of the algorithm is O(N∗log2K)O(N*log_2K)O(N∗log2K) O(N∗log2K)O(N ∗log2K).

Anyway, we’re using the smallest heap to store the largest number of k, so why not use the largest heap? Because the update has to change the order, there is no need to reinvent the wheel.

Now we’re going to talk a little bit more about how this algorithm works, and if you’re familiar with heap sort you can probably write it yourself, so you can skip this part.

We use an array H[] to build a heap with K=8:

We know that for every element H[I] in the heap, its parent is H[I /2], its left child is H[2* I + 1], and its right child is H[2* I +2]. Each time a new number X is considered, the pseudo-code for the update operation is as follows:

I’m going to read the pseudo-code, and I’m going to start by saying is X greater than the minimum in the heap, and if it’s less than the minimum in the heap, I don’t have to look, it’s definitely not one of the largest K numbers; If it is greater than the minimum, then replace the minimum, as shown below:

If X is larger than H[1], then X will swap with H[1] :

After the exchange, p=q, so we will continue to judge the size of X and H[3], assuming X is less than H[3], then X will stop here and end the loop.

7. To summarize

methods	Time complexity	The characteristics of
The sorting method	$O(N *log_2N)$	Simple implementation, small amount of data, speed requirements are not sensitive
Local sorting method	$O(N * K)$	When the implementation is simple, the data is small, and the speed is not sensitive, $K < log_2N$ Can be considered
Partition	$O(N * log_2K)$	Fast speed, disorderly return data
Binary search	$O(N*log_2N)$	The speed is fast and can be achieved by using bits in specific scenarios
BFPRT	$O(N)$	The actual effect is not as good as imagined
Maximum minimum reactor	$O(N * log_2K)$	Support large amount of data, and can be updated, ordered

Reference Book: The Beauty of Programming