Reservoir sampling

The official label of the reservoir sampling problem in the likou is 2 lines. According to my work, there may be 3 or 4 lines. The proportion is relatively low, we can choose to master according to their own actual situation.

The reservoir sampling algorithm is clever in its thinking, simple in its code and easy to understand, and is good to know, if not to know.

Problem description

Given a data stream, we need to randomly pick k numbers in this data stream. Because of the large length of this data stream, it needs to be traversed and processed rather than loaded into memory all at once.

Write a random selection algorithm such that all data in the data stream are equally likely to be selected.

This question can be expressed in many ways. Let’s say you pick k points randomly from a rectangle, k words randomly from a list of words, etc., with equal probability. No matter how you change the description, it’s essentially the same. Today we’re going to look at how to do that.

Algorithm description

The algorithm is called reservoir ID sampling.

The basic idea is as follows:

  • Construct an array of size K and put the first K elements of the data stream into the array.
  • The first k numbers of the data stream are not processed.
  • Select rand from [1, I], where I indicates the current number.
  • If rand is greater than or equal to k do nothing
  • If rand is less than k, rand and I are swapped, that is, the current number is selected instead of the number already selected (spare).
  • Finally return to the surviving spare tire

The core of this algorithm is to select a number with a certain probability and replace the number that has been selected before with another probability in the subsequent process. So essentially the probability of each number being selected is the probability of being selected * the probability of not being replaced.

Pseudo code:

The pseudocode refers to an algorithm book and is slightly modified.

Init : a reservoir withThe size: kfor i= k+1 to N
    if(random(1, i) < k) {
        SWAP the Mth value and ith value
    }
Copy the code

Does that guarantee that the numbers chosen are equally likely? The answer is yes.

  • When I <= k, the probability of I being selected is 1.
  • At the k+1 number, the probability of the k+1 number being selected (the probability of walking into the upper if branch) is kk+1\frac{k}{k+1}k+1k. At the k+ 2 number, The probability of the k+2 number being selected (the probability of walking into the if branch above) is kk+2\frac{k}{k+2}k+2k, and so on. So the probability of the NTH number being chosen is kn\frac{k}{n}nk
  • The probability of being selected is analyzed above, and the probability of not being replaced is analyzed next. At the k + 1 number, the probability of the first k number being replaced is 1k\frac{1}{k}k1. By the first k + 2 number, the probability of the k + 2 number being replaced is 1k\frac{1}{k}k1, and so on. That is, all the probabilities of substitution are 1k\frac{1}{k}k1. So if you know the probability of being replaced, then the probability of not being replaced is actually 1 minus the probability of being replaced.

So for the first k numbers, the probability of being selected is 1 * the probability of not being replaced by k + 1 * the probability of not being replaced by k + 2 *… The probability of not being replaced by n is 1 * (probability of 1 – being replaced by k + 1) * (probability of 1 – being replaced by k + 2) *… (1 – Probability of being replaced by N), that is, 1×(1− Kk +1× 1K)×(1− Kk +2× 1K)×… * (1 – kn * 1 k) = kn1 \ times (1 – \ frac {k} {k + 1} \ times \ frac {1} {k}) \ times (1 – \ frac {k} {k + 2} \ times \ frac {1} {k}) \ times… \ times (1 – \ frac {k} {n} \ times \ frac {1} {k}) = \ frac {k} {n} 1 x (1 – k + 1 k x k1) x (1 – k + 2 k x k1) x… X (1 – nk (k1) = nk.

For the ith (I > k) number, the probability of finally being selected is the probability of the ith step being selected * the probability of not being replaced by the I + 1 step *… * The probability of not being replaced by the NTH step, i.e. kk+1×(1−kk+2×1k)×… * (1 – kn * 1 k) = kn \ frac {k} {k + 1} \ times (1 – \ frac {k} {k + 2} \ times \ frac {1} {k}) \ times… \ times (1 – \ frac {k} {n} \ times \ frac {1} {k}) = \ frac {k} {n} k + 1 k x (1 – k + 2 k x k1) x… X (1 – nk (k1) = nk.

In short, the probability of being selected for any number is kn\frac{k}{n}nk, satisfying the requirement of equal probability.

Related topics

  • 382. Linked list random nodes
  • 398. Random number index
  • 497. A random point in a nonoverlapping rectangle

conclusion

Reservoir sampling algorithm core code is very simple. But it’s not easy to imagine, especially if you haven’t seen it before. The core point is that the probability of each number being selected is the probability of being selected * the probability of not being replaced. So we can do some kind of dynamic thing where each round has a probability of picking and changing numbers. Above we have given the probability of equal proof process, you might as well try to prove it yourself. Then combine the related topics at the end of the article to practice, the effect will be better.