randomSplit(weigh , *seed):

Parameters:

1. Weights: is an array to divide an RDD into multiple RDD’s according to weight. The higher the weight, the more elements will be divided. The length of the array is the number divided into RDD, as in

Rdd1 = RDD. RandomSplit ([0.25, 0.25, 0.25, 0.25])Copy the code

The function is to divide the original RDD into four RDD of the same size as possible. Note that the sum of the data in the weight array should be 1

2. Seed: is an optional parameter that is used as the seed of random to construct a random class. Seed is the meaning of seed, because in the computer actually can not generate real random number, it is based on the given seed (usually the current time, the results of the last few operations, etc.), through a fixed calculation formula to get the next random number seed is required to use a fixed seed to start generating random number. Given the same seed, the generated sequence of random numbers is always the same

The return value:

Returns an ARRAY of RDD

The test code

First, open PySparkCreate an RDD with a list of numbers 0-19

>>> rdd = sc.parallelize(range(20))
>>> rdd.collect()
Copy the code

Next, test the randomsplit method

> > > rdd1 = RDD. RandomSplit ([0.25, 0.25, 0.25, 0.25]) > > > rdd1 [0]. Collect () [3, 4, 9, 10, 15] > > > rdd1 [1]. The collect () / 2, 8] >>> rdd1[2].collect() [0, 6, 7, 12, 13, 14, 16, 17, 19] >>> rdd1[3].collect() [1, 5, 11, 18]Copy the code

You can see that RDD1 is divided into four RDDDS by weight

When the seed is set to 1 both times, the result is exactly the same

> > > rdd1 = RDD. RandomSplit ([0.5, 0.5], 1) > > > rdd1 [0]. Collect () [6, 7, 8, 9, 10, 11, 14, 15, 17, 18] >>> rdd1[1].collect() [0, 1, 2, 3, 4, 5, 12, 13, 16, 19]Copy the code
> > > rdd2 = RDD. RandomSplit ([0.5, 0.5], 1) > > > rdd2 [0]. Collect () [6, 7, 8, 9, 10, 11, 14, 15, 17, 18] >>> rdd2[1].collect() [0, 1, 2, 3, 4, 5, 12, 13, 16, 19]Copy the code

Set the seed to 2, and the result is different

> > > rdd3 = RDD. RandomSplit [0.5, 0.5], (2) > > > rdd3 [0]. Collect () [4, 5, 8, 9, 10, 11, 12, 13, 17, 18] >>> rdd3[1].collect() [0, 1, 2, 3, 6, 7, 14, 15, 16, 19]Copy the code

over : )