This article is a translation of Swift Algorithm Club. Swift Algorithm Club is an open source project produced by Raywenderlich.com to realize algorithms and data structures by Swift. Currently, there are 18000+⭐️ on GitHub. I have made a brief statistics, and there are about 100 algorithms and data structures. It’s basically a good resource for iOSer learning algorithms and data structures. 🐙andyRon/ swift-algorithm-club-CN is my project of learning and translating for Swift Algorithm Club. Due to the limited capacity, if you find any errors or improper translation, please correct them. Welcome to Pull Request. Also welcome interested, have time partners to participate in translation and learning 🤓. Of course, we welcome ⭐️, 🤩🤩 ️ ⭐ ⭐. The translated text and code of this article can be found at 🐙 swift-algorithm-club-CN /Selection Sampling


Selection Sampling

Objective: To randomly select k items from a set of N items.

Suppose you have a deck of 52 cards, and you need to draw 10 cards at random. This algorithm allows you to do that.

Here’s a very fast version:

func select<T>(from a: [T], count k: Int)- > [T] {
  var a = a
  for i in 0..<k {
    let r = random(min: i, max: a.count - 1)
    ifi ! = r {swap(&a[i], &a[r])
    }
  }
  return Array(a[0..<k])
}
Copy the code

As often happens with shuffle algorithms, it divides the array into two regions. The first field contains the options; The second area is all the remaining terms.

An example. Suppose an array is:

[ "a", "b", "c", "d", "e", "f", "g" ]
Copy the code

We want to pick 3 items, so k is equal to 3. In the loop, I is initially 0, so it points to “a”.

[ "a", "b", "c", "d", "e", "f", "g" ]
   i
Copy the code

We calculate a random number between I and the size of the array a.mount. Let’s say this random number is 4. Now we swap “a” with “e” with index 4 and move forward by I:

[ "e" | "b", "c", "d", "a", "f", "g" ]
         i
Copy the code

| bar said split between the two regions. “E” is the first element we choose. We continue to need to pay attention to | bar on the right side of everything.

Again, we want a random number between I and a.ount, but because I has been shifted, the random number is never less than 1. That’s why we’re never swapping “e’s “again.

Assuming a random number of 6, we swap “b” with “g” :

[ "e" , "g" | "c", "d", "a", "f", "b" ]
               i
Copy the code

I have a random number, let’s say it’s 4. We swap “c” with “a”, and the final item on the left already selected is:

[ "e", "g", "a" | "d", "c", "f", "b" ]
Copy the code

That’s it. Very simple. The performance of this function is O(k), because as soon as we select k, we’re done.

Here’s an alternative algorithm, called Reservoir Sampling:

func reservoirSample<T>(from a: [T], count k: Int)- > [T] {
  precondition(a.count >= k)

  var result = [T] ()/ / 1
  for i in 0..<k {
    result.append(a[i])
  }

  for i ink.. <a.count {  / / 2
    let j = random(min: 0.max: i)
    if j < k {
      result[j] = a[i]
    }
  }
  return result
}
Copy the code

There are two steps:

  1. Use the first one in the original arraykElement fillingresultThe array. This is called a reservoir.
  2. Randomly replace the elements in the reservoir with those from the remaining pool.

This algorithm has O(n) performance, so it is a bit slower than the first algorithm. However, its biggest advantage is that it can be used with arrays that are too big to fit in memory, even if you don’t know what the size of the array is (in Swift this might be similar to a lazy generator that reads file elements).

The first two algorithms have one disadvantage: they do not preserve elements of the original order. In the input array, “a” comes before “e”, but now it’s in a different order. If the order is to remain the same, the above method cannot be used.

The following alternative method preserves the integrity of the original order, but requires more space to participate:

func select<T>(from a: [T], count requested: Int)- > [T] {
  var examined = 0
  var selected = 0
  var b = [T] ()while selected < requested {                          / / 1
    let r = Double(arc4random()) / 0x100000000          / / 2
    
    let leftToExamine = a.count - examined              / / 3
    let leftToAdd = requested - selected

    if Double(leftToExamine) * r < Double(leftToAdd) {  / / 4
      selected += 1
      b.append(a[examined])
    }

    examined += 1
  }
  return b
}
Copy the code

The algorithm uses probability to decide whether to include a number in the selection.

  1. The loop steps through the array from beginning to end. It goes on until we pick k terms from the set of n. Here, k is requested and n is A.ount.

  2. Calculates a random number between 0 and 1. We want 0.0 <= r < 1.0. The upper limit is exclusive; We never want it to be 1. This is why we divide the result from arc4Random () by 0x100000000 instead of the more common 0xFFffFFFF.

  3. LeftToExamine is the number of items that we haven’t looked at yet. LeftToAdd is the number of items we need to select before we finish.

  4. This is where the magic happens. Basically, we’re flipping a coin. If heads, we add the current array element to the selection; If it’s tails, we’ll skip it.

Interestingly, even if we use probability, this method always guarantees that we end up with k entries in the output array.

Let’s discuss the same example again. The input array is:

[ "a", "b", "c", "d", "e", "f", "g" ]
Copy the code

The loop looks at each element in turn, so we start with “A”. We get a random number between 0 and 1, let’s say it’s 0.841. // The formula at 4 times the number of items to be checked by this random number. There are seven more elements to check, and the result is:

7 * 0.841 = 5.887
Copy the code

We compare this to 3 because we want to choose 3 items. Since 5.887 is greater than 3, we skip “a” and continue moving “b”.

Again, we get a random number, let’s say 0.212. There are now only six elements left to check, so the formula results in:

6 * 0.212 = 1.272
Copy the code

Less than 3, let’s add “b” to the selection. That’s the first term we picked, so we have two left.

To the next element, “C”. Random number is 0.264, and the results are as follows:

5 * 0.264 = 1.32
Copy the code

I just have to pick 2 more terms, so this number has to be less than 2. It is, and we’re also adding a “C” to the selection. The total choice is [” B “, “C “].

Just select one more item, but there are still four candidates to view. Suppose the next random number is 0.718. The formula now gives:

4 * 0.718 = 2.872
Copy the code

To select this element, the number must be less than 1, because there is only one item left to select. 2.872 No, so we skip the “D”. There are only three possibilities left – will we pick it before we run out of elements?

The random number is 0.346. The formula gives:

3 * 0.346 = 1.038
Copy the code

It’s a little too high. Let’s skip the “e”. There are only two candidates left……

Note that now we are literally dealing with a coin toss: if the random number is less than 0.5, we select “f” and we are done. If it is greater than 0.5, we proceed to the last element. Suppose we get 0.583:

2 * 0.583 = 1.166
Copy the code

Let’s skip the “f” and look at the last element. Whatever random number we get here, it should always pick “G” or we won’t pick enough elements and the algorithm won’t work!

Let’s say our final random number is 0.999 (remember, it will never be 1.0 or higher). In fact, no matter what we choose here, the formula will always give us a value less than 1:

1 * 0.999 = 0.999
Copy the code

Therefore, if we don’t have enough choices, we always choose the last element. The final choice is [” B “, “C “,” G “]. Notice that the elements are still in their original order, because we are querying the array from left to right.

Maybe you still don’t believe…… If we always use 0.999 as a random value (the maximum possible value), can we choose 3 items? Ok, let’s do the math:

Is 7 * 0.999 = 6.993 less than 3? No 6 * 0.999 = 5.994 less than 3? No 5 * 0.999 = 4.995 less than 3? No 4 * 0.999 = 3.996 less than 3? No 3 * 0.999 = 2.997 less than 3? YES 2 * 0.999 = 1.998 < 2? YES 1 * 0.999 = 0.999 < 1? YESCopy the code

It always works! But does that mean that the element near the end of the array is more likely to be selected than the element at the beginning? No, all elements are equally likely to be selected. (If you don’t believe me: a quick test on playground proves this in practice.)

Here is an example of how to test this algorithm:

let input = [
  "there"."once"."was"."a"."man"."from"."nantucket"."who"."kept"."all"."of"."his"."cash"."in"."a"."bucket"."his"."daughter"."named"."nan"."ran"."off"."with"."a"."man"."and"."as"."for"."the"."bucket"."nan"."took"."it",]let output = select(from: input, count: 10)
print(output)
print(output.count)
Copy the code

The second algorithm has O(n) performance because it may need to traverse the entire input array.

Note: If k > n / 2, it is more efficient to execute it in reverse and select the a.ount -k item to remove.

The code is based on Algorithm Alley, published in Dr. Dobb’s Journal, October 1993.


Translated by Andy Ron proofread by Andy Ron