Now we have a set of numbers, and we don’t know what the total number of numbers is. Describe an algorithm that can randomly pick k numbers from this set of numbers with equal probability that each number will be picked.

If there are n numbers in this set, then the probability of picking each number is k/n, but the difficulty of this problem is that we don’t know the total number of numbers in this set, that is, we don’t know n, so how do we calculate the probability of picking each number?

Reservoir algorithm

Swimming pool (reservoir) we are familiar with, some swimming pool water is alive, there are inlet pipes and outlet pipes, then with the volume of water flowing through the pool, will all the water in the pool be replaced? Of course not. Some water may stay in the pool for a long time, and some may run out as soon as they enter. Imitating this phenomenon, the reservoir sampling algorithm was born. The key of the reservoir algorithm is to ensure that the water flowing into the reservoir and the water already in the reservoir are retained in the reservoir with the same probability. And the reservoir algorithm can solve this kind of sampling problem under the time complexity O(N) without knowing the total amount in advance.

Core principles

This part involves the formula, in order to ensure the effect directly pasted over the diagram.

Python implementation

Next, try implementing the reservoir algorithm in Python. Since the reservoir algorithm samples without knowing the total amount in advance, define a method to receive a single element and place this method in the class to hold the sampled data.

import random


class ReservoirSample(object):

    def __init__(self, size):
        self._size = size
        self._counter = 0
        self._sample = []

    def feed(self, item):
        self._counter += 1
        # the ith element (I <= k) goes directly into the pool
        if len(self._sample) < self._size:
            self._sample.append(item)
            return self._sample
        # the ith element (I > k) enters the pool with probability k/I
        rand_int = random.randint(1, self._counter)
        if rand_int <= self._size:
            self._sample[rand_int - 1] = item
        return self._sample
Copy the code

The test code

Next implement a test case to verify implementation of the algorithm is correct, since it is random sampling, unable to pass the word test to verify whether it is right, so through executed multiple times to verify, such as a number from 1 to 10 random sampling in 3, then execute sampling 10000 times, if the algorithm is correct, the final results 1-10 of the sampling frequency should be the same, They’re all around 3,000.

import unittest
from collections import Counter

from reservoir_sample import ReservoirSample


class TestMain(unittest.TestCase):

    def test_reservoir_sample(self):
        samples = []
        for i in range(10000):
            sample = []
            rs = ReservoirSample(3)
            for item in range(1.11):
                sample = rs.feed(item)
            samples.extend(sample)
        r = Counter(samples)
        print(r)

if __name__ == '__main__':
    unittest.main()
Copy the code

The output is as follows

Counter({7: 3084, 6: 3042, 10: 3033, 3: 3020, 8: 3016, 5: 2997, 4: 2986, 2: 2972, 9: 2932, 1: 2918})
Copy the code

The graph above outputs the number of times each number is sampled, so you can see the distribution clearly

You can see that the reservoir algorithm is very good for random sampling, with the same sampling probability for each element.

code

The algorithm and test code mentioned above have been put on Github and can be downloaded and used directly.

【Python private house dishes 】