With the full amount of big data, do we need to use estimation and sampling?

This is my 59th original article

Statistics is the use of mathematics to study the real world by searching, organizing, analyzing, and describing data. Look, it’s a lot like being a data analyst, isn’t it?

Being a statistician is an interesting profession. The early statisticians spent a lot of time collecting and organizing data.

For example, The founder of modern mathematical statistics, The British mathematician and statistician Fisher, had been doing biological experiments in agricultural experimental stations for a long time, which was actually farming land.

Their research is aimed at practical problems. You can see the terms hypothesis testing, survivor bias, genetic algorithms, expectations, life tables, fishpond sampling, double-blind testing, and so on. \

Big data is hot these years. A lot of people’s understanding of big data is that now we have all the data, we don’t have to make estimates, we don’t have to do sampling statistics. Without passing judgment on this claim, let’s read a few stories about statisticians fishing for fish, and you’ll get a better idea of the problem.

How do you know how many fish there are in the pond?

You have contracted a big pond in your hometown, but you don’t know how many fish there are in it. You can’t manage it. How can you do that?

Counting fish is not like counting sheep, which can be counted through a fence, with the counted on one side and the uncounted on the other. But fish swim around, up and down and left and right, which is impossible to count accurately. What to do?

Statisticians count in two steps:

1. First pick up a batch of fish randomly (such as 10 nets, a total of 1000), mark them and put them in the pond;

2. After a while, pick up another batch of fish (800 in total) and count the number of fish that are marked (say 16).

And that gives us an estimate:

The relationship between 1000 fish and the total number of fish in the pond X is 1000: X
The relationship between labeled fish and sample size was 16:800.

Because they’re both random, the probability distributions are basically the same. Therefore, we can use the distribution of a sample as the probability, and then calculate the total amount:

1000:X=16:800, X=50000, there are about 50,000 fish in the fish pond.

This is the practical application of sampling probability — estimation of the total, and it works in a lot of places where you need estimation of the total.

How to ensure that every fish is equally lucky?

One day, god gives you a task: you need to choose 100 fish from the pond to go into heaven, and you must be fair. When God gave you a task, he forgot to give you the ability to do it. How can this be done?

That’s okay. Statisticians will help you.

You also don’t know how many fish there are in the pond, so it’s very difficult to make sure that every fish has a fair chance of being selected. But statisticians have already studied it:

1. Fish 100 fish first and put them in the optional basket;

2, and then constantly fishing fish, take one of them to come over, give him a lottery, the box is 1-101, if the number of shaking between 1-100, will be selected, and from the alternative basket to pick up a fish to mark thrown back, otherwise the fish just out of the good mark thrown back. And then from now on, just fishing the unmarked fish will meet the requirement of fairness.

3, continue to repeat the second step, catch the unmarked fish. For each additional fish, add a number to the number box, put the alternative basket if you shake it, and then mark it and throw it back into the pond if you don’t. For example, after fishing all morning, the 10,000th fish is fished out. At this time, there are 10,000 fish in the box. At this time, the fish will be raffled.

This is the legendary “pond sampling/sampling”. Some people say, what does this have to do with big data? This is so relevant!

Big data application of pond sampling

Pond sampling can solve the problem of sampling with equal probability when the total quantity is unknown, and the following scenario requires a method to solve this problem:

1. Random sampling of the full data when the memory is not large enough;

This application scenario addresses data skew. We need to sample the full sample, know the data distribution, and then partition according to the data distribution. In this way, the data assigned to each node is basically uniform, which solves the problem of data skew.

2. In the data stream, random sampling is carried out on the full data;

This application scenario is real-time data processing, where the amount of data is not known at the time of the stream data coming in. In this case, the pool sampling method is used to solve the stream data random sampling. Online machine learning uses this approach.

3. Conduct random sampling in distributed node data source.

This application scenario addresses random sampling from distributed data sources. For example, in machine learning, you need to randomly sample the full amount of data, but the data is stored in many nodes. If you read all the data once and then sample again, it is too slow. Flink uses ponds for sampling, which is fast and accurate. \

conclusion

In addition, we use A/B Test and quality inspection will also use A variety of sampling oh! Ever heard the joke about buying peaches?

Xiao Ming went to buy peaches and got a bite on each one because he needed a bite to make sure they were sweet.

All the products we buy have to go through quality inspection. We can’t do destructive quality inspection on every product, it’s no different from xiao Ming above.

Like this little thing in a puffy food bag. This is used to detect metals in food bags. Is one of the means of quality control. Manufacturers intentionally add these test chips randomly to their products and then recycle them before they leave the factory to determine if the product line is safe. Once these chips are found before they leave the factory, all previous metal-detection methods fail.

So when someone tells you that there is no need for “sampling” statistics because of the “full volume” of big data, you can throw these application scenarios in their face.

To sum up, there are three reasons: \

1. The “full volume” of big data has a premise, that is, your “full volume” is really full volume. And most of the time we can’t really do that.

2. Even if we have full data, because there are too many full data, we sometimes use sampling to ensure efficiency;

Some things really can’t be full. Such as A/B Test and quality inspection.

These statistical professionals are not people who specialize in these things, and they don’t even think like this, which is why machine learning requires a high level of mathematical and statistical skills. So ah, come on boy! Study hard so you can get a raise.

Highlights from the past

Through the middle dry | the architect’s point of view

Hot article | system of big data engineer career path whole solution

Dry goods | what is called understand the business? Five levels of analysis

If you find it interesting, please help share it

With the full amount of big data, do we need to use estimation and sampling?

Related Posts

Platform strategy is the key to the digital transformation of traditional enterprises

Run ios real machine error has been stuck in installing and launching problem diagnosis and solution

Focus on traditional network and learn SDN foundation and cases