Compiled by Christian Pascual from Dataquest. Compiled by Shuting Wang and Si Yuan.
Probability theory and statistics are fundamental to machine learning, but many beginners don’t know much about them. This paper introduces the basic concept, relation and usage of probability and statistics, and takes normal distribution as an example to show what is probability distribution, distribution function and empirical rules. It also explains conceptually the central limit theorem and why the normal distribution is so important in statistics as a whole. In addition, many of the experiments in this article can be implemented in Python, which readers who are not familiar with Python can skip.
To learn statistics, it is inevitable to understand probability problems first. Probability involves so many formulas and theories that it’s easy to get lost in them, but it plays an important role in both work and daily life. We have previously discussed some basic concepts in descriptive statistics, now we will explore the relationship between statistics and probability.
Prerequisites:
Like the previous blog, no knowledge of statistics is required for this article, but at least a basic understanding of Python is required. In case you’re not familiar with the for loop and lists, I’ll give you a brief introduction to them.
What is probability?
At its most basic level, probability answers the question, “What are the odds that an event will happen?” To calculate the probability of one event happening, we also consider all the other possible events.
The classic example of a probability problem is flipping a coin. In the process of flipping a coin, only two things can happen:
1. Face up
2. Tails up
These two outcomes form a sample space, the set of all possible outcomes. To calculate the probability of an event, we count the number of times that event occurs (such as flipping a coin for heads) and divide that by the total number of trials. So the probability is going to tell us that the probability of flipping a coin heads or tails is 1/2. By looking at what is likely to happen, probability can provide us with a framework for predicting the frequency of events.
However, even if the results seem obvious, if we actually tried to flip some coins, we’d probably get either too high or too low a probability of heads. If the coin toss is unfair, what can we do? Collect data! We can use statistical methods to calculate probabilities based on real-world observed samples and compare them to ideal probabilities.
From statistics to probability
We can get data by flipping a coin 10 times and counting the number of heads. We’re going to use these 10 flips as an experiment, and the number of heads is going to be the data point. Maybe the number of heads is not the “ideal” five, but don’t worry, because a trial is just one data point.
If we run multiple trials, we would expect the average probability of all trials coming up heads to be close to 50%. The following code simulates 10, 100, 1,000, and 1,000,000 trials, respectively, and then calculates the average frequency of heads. The following figure summarizes this process.
import random
def coin_trial():
heads = 0
for i in range(100):
ifRandom. Random () <= 0.5: heads +=1return heads
def simulate(n):
trials = []
for i in range(n):
trials.append(coin_trial())
return(Sum (Trials)/ N) SIMULATE (10) >> 5.4 SIMULATE (100) >>> 4.83 SIMULATE (1000) >>> 5.055 SIMULATE (1000000) >>> 4.999781Copy the code
The coin_trial function represents a simulation of 10 coin tosses. It uses the random() function to generate a random floating point number between 0 and 1, increasing the number of heads if the floating point number is below 0.5. Then simulate repeats these tests as many times as you want and returns the average number of heads after all tests.
The results of the coin-flipping simulation are interesting. First, the simulated data showed that the average number of heads was close to the probability estimate. Second, as the number of trials increased, the average became closer to the expected result. With 10 simulations, there was a slight error, but at 1,000,000 trials, the error almost disappeared. As we increase the number of tests, the deviation from the expected average decreases. Sound familiar?
Of course, we could flip the coin ourselves, but emulating this process in Python code can save a lot of time. As we get more and more data, the real world (results) begins to overlap with the ideal world (expectations). So, given enough data, statistics allow us to estimate probabilities based on real-world observations. Probability provides the theory, and statistics provide the tools to test that theory using data. Thus, numerical characteristics of statistical samples, especially mean and standard deviation, become a theoretical substitute.
You may ask, “Why would I use a substitute if I could calculate theoretical probability?” Flipping a coin is a very simple example, but some of the more interesting probability problems are not so easy to calculate. How likely is a person to get sick over time? What is the probability that a critical car part will fail while you are driving?
There is no easy way to calculate probability, so we must rely on data and statistics. Given more data, our results have more confidence that the calculated results represent the true probability of these important events occurring.
Let’s say I’m a working sommelier and I need to figure out which wines are better before I buy. I already have a lot of data, so we’ll use statistics to guide our decisions.
Data and Distribution
Before addressing the question of “which wine is better?” we need to pay attention to the nature of the data. Intuitively, we want to score the better wines, but here’s the problem: the scores are usually distributed over a range. So how do we compare the scores of different types of wine and determine to some extent that one wine is better than another?
If there is a normal distribution (also known as the Gaussian distribution), it is a particularly important phenomenon in the field of probability and statistics. The normal distribution is as follows:
The most important properties of normal distribution are symmetry and shape, as well as its wide universality. We’ve been calling it a distribution, but what is a distribution? We can intuitively think of the probability distribution as all possible events and their corresponding probabilities in a task. For example, in the “coin toss” task, the “heads” and “tails” events and their corresponding probability 1/2 can form a distribution.
In probability, the normal distribution is a specific distribution of all events and their corresponding probabilities. The X-axis represents the events of which we want to know the probability, and the Y-axis is the probability associated with each event — from 0 to 1. We don’t go into probability distributions here, but we know that the normal distribution is a particularly important probability distribution.
In statistics, a normal distribution is the distribution of data values. Here, the X-axis is the values of the data, and the Y-axis is the count of those values. Here are two identical normal distributions, but labeled by probability and statistics:
In a normal distribution of probabilities, the highest point represents the event with the highest probability of occurrence. The further away from the event, the more the probability drops, eventually forming a bell shape. In a statistical normal distribution, the highest point represents the mean, and just like in probability, the farther away you are from the mean, the more the frequency drops. That is, the points at both ends deviate wildly from the mean, and the sample is rare.
If you suspect that there is another relationship between probability and statistics through a normal distribution, you are right! We’ll explore this important relationship later in this article, but don’t worry.
Now that we are going to use the distribution of quality scores to compare different wines, we need to set up some criteria to search for wines of interest. We will collect wine data and then isolate some wine quality scores of interest.
To get the data, we need the following code:
import csv
with open("wine-data.csv"."r", encoding="latin-1") as f:
wines = list(csv.reader(f))Copy the code
The data is presented in tabular form below. We need the Points column, so we’ll extract it into our own list. A wine expert told us that Hungary’s Tokaj was great, while a friend suggested we start with Italian red Lambroesco. We can use data to compare these wines!
If you can’t remember what the data looks like, here’s a brief table for you to review and relearn.
# Extract the Tokaji scores
tokaji = []
non_tokaji = []
for wine in wines:
ifpoints ! =' ':
points = wine[4]
if wine[9] == "Tokaji":
tokaji.append(float(points))
else:
non_tokaji.append(points)
# Extract the Lambrusco scores
lambrusco = []
non_lambrusco = []
for wine in wines:
ifpoints ! =' ':
points = wine[4]
if wine[9] == "Lambrusco":
lambrusco.append(float(points))
else:
non_lambrusco.append(float(points))Copy the code
If each set of mass fractions is visualized as a normal distribution, we can immediately determine whether the two distributions are the same based on their position, but problems can quickly be encountered with this approach, as shown below. Because we have a lot of data, we’re going to assume that the score is normally distributed. While this assumption is fine here, it’s actually dangerous, as I’ll discuss later.
When two score distributions overlap too much, it’s best to assume that your scores come from the same distribution rather than different ones. At the other extreme, where the two distributions do not overlap, it is safe to assume that they come from different distributions. The trouble is that some overlap is special. For example, the extremely high point of one distribution may intersect the extremely low point of another distribution, in which case how can we determine whether these fractions come from different distributions.
So, once again, we expect the normal distribution to give us an answer and to bridge the gap between statistics and probability.
Let’s revisit the normal distribution
Normal distributions are important to probability and statistics for two reasons: the central limit theorem and the three sigma rule.
Central limit theorem
In the previous section, we showed that if the experiment of flipping a coin was repeated ten times, the average result of heads would be close to 50% of the ideal. As the number of trials increases, the average gets closer to the true probability, even if the individual trials themselves are not perfect. This idea or mathematically called approximate convergence is a key principle of the central limit theorem.
In the case of the coin toss, 10 flips in one trial, we would estimate the number of heads per trial to be 5. It’s an estimate because we know it’s not going to be perfect (i.e., not going to get 5 heads every time). If we make a lot of estimates, according to the central limit theorem, the distribution of those estimates will look like a normal distribution, and the expected values of the vertices or estimates of such a distribution will be consistent with the real values. We observe that in statistics the vertices of the normal distribution agree with the mean. Thus, given multiple “trials” as data, the central limit theorem states that we can estimate the likely shape of the distribution from the data even if we do not know the true probability.
The central limit theorem tells us that the mean of many trials will approach the true mean, and the three sigma rule tells us how much data will be distributed around that mean.
3 sigma criteria
The three sigma rule, also known as the rule of thumb or 68-95-99.7, is a way of saying how much of the data we observe falls within a certain distance of the average. Note that the standard deviation (aka “sigma”) is the average distance between the observed value of the data and the mean value.
The 3 sigma criterion states that given a normal distribution, 68% of observations will fall within one standard deviation of the mean, 95% will fall within two standard deviations, and 99.7% will fall within three standard deviations. There is a lot of complex mathematics involved in deriving these values, so it is beyond the scope of this article. The key is to realize that the three sigma criterion allows us to know how much data is contained in different intervals of a normal distribution. The following figure summarizes what the three sigma principle represents.
We will link these concepts to wine data. Hypothetically, as a wine taster, we wanted to know how popular chardonnay white and Pinot Noir are compared to regular wines. We collected thousands of wine reviews, and the average score of these reviews should, according to the central limit theorem, be consistent with a “true” representation of the wine’s quality, as judged by the reviewer.
While the three Sigma rule tells you how much of your data is in the known range, it also tells you how rare extreme values are. Any deviation of three standard deviations from the mean should be handled with care. Using the three sigma criterion and z-Score, we can finally measure the difference between chardonnay and pinot noir.
Z-score
The Z-Score is a simple calculation that answers the question, “Given a data point, how many standard deviations is it from the mean?” Here is the z-Score equation:
The Z-Score itself doesn’t give you much information. But it is very valuable when compared with a Z-table that lists the cumulative probability of a standard normal distribution up to a given Z-score. The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. Even if our normal distribution is not standard, z-Score allows us to refer to z-table.
The cumulative probability (or probability distribution function) is the sum of the probabilities of all values before the occurrence of a given point. A simple example is the average itself. The average is the middle part of the normal distribution, so we know that the sum of all the probabilities from left to right to the average is 50%. If you want to calculate the cumulative probability between standard deviations, the value of the three sigma criterion will actually appear. The following is a visualization of the cumulative probability.
The sum of all probabilities must be equal to 100%, so we use z-table to calculate the probabilities on both sides of z-Score under normal distribution.
This calculation of probabilities above a certain Z-score is useful for us. It takes us from “How far is a value from the average?” “To” How likely is it that a value will differ by a specified distance from the mean of the same set of observations?” So the probabilities from the Z-Score and z-table will answer our questions about wine.
import numpy as np
tokaji_avg = np.average(tokaji)
lambrusco_avg = np.average(lambrusco)
tokaji_std = np.std(tokaji)
lambrusco = np.std(lambrusco)
# Let's see what the results are
print("Tokaji: ", tokaji_avg, tokaji_std)
print("Lambrusco: ", lambrusco_avg, lambrusco_std)
>>> Tokaji: 90.9 2.65015722804
>>> Lambrusco: 84.4047619048 1.61922267961Copy the code
Looks like friend recommendations aren’t very good! For the purposes of this article, we treat the scores of both tokaj and Lambroco as normally distributed. So the average score for each wine will represent their “true” score in terms of quality. We will calculate the Z-score to see how far the average of a tokaj differs from the average of a red lambroco.
Z = (tokaji_avG-lambrusco_AVG)/lambrusco_STD >>> 4.0113309781438229# We'll bring in scipy to do the calculation of probability from the Z-tableImport scipy.stats as st st.norm.cdf(z) >>> 0.99996981130231266# We need the probability from the right side, so we'll flip it!1 - St. norm.cdF (z) >>> 3.0188697687338895E-05Copy the code
The answer is very small. But what does that really mean? This infinitesimal amount of probability may require detailed explanation.
Assume that there is no difference in quality between a white tokaj and a red lambroco. In other words, they are of similar quality. Again, the scores of these wines can be somewhat scattered due to individual differences between wines. According to the central limit theorem, if we make a histogram of the two wine scores, we will produce a mass score that follows a normal distribution.
Now, we can use some data to calculate the mean and standard deviation of these two wines. These values test whether they are of similar quality. We will use the red Wine score of Lambroesco as a basis and compare the average score of tokaj, which is easy to do the other way around. The only difference is negative Z-score.
Z – score of 4.01. Assuming that tokaj and Lambroesco are of similar quality, 99.7% of the data should be within 3 standard deviations according to the 3 sigma criterion. In cases where Tokaj and Lambrosco are considered to be of the same quality, the probability of moving away from the average quality score is very, very small. The odds are so small that we have to consider the opposite: if Tokaj were different from Lambroesco, there would be a different distribution of scores.
We chose our words carefully here: I did not say “Tokaj is better than Lambrusco”. Because we calculated this probability, which is microscopic, but it’s not zero. To be sure, tokaj and Lambroesco are definitely not from the same distribution, but one cannot be said to be better or worse than the other.
This kind of reasoning belongs to the category of inferential statistics, and this paper only wants to make a simple introduction. This article covers a lot of concepts, so if you’re having a headache, go back and read.
conclusion
We start with descriptive statistics, and then we relate that to probability. Based on probability, we develop a method to quantitatively show whether two sets of scores come from the same distribution. Following this approach, we compared two wines recommended by others and found that they most likely came from different distribution of quality points. That is, one wine is likely to be better than another.
Statistics are not the domain of statisticians alone, and as a data scientist, having an intuitive understanding of commonly used statistical methods will help you build your theories and subsequently test them. We have barely touched on inferential statistics here, but the same ideas will help guide an understanding of statistical principles. This article discusses the advantages of normal distribution, but statisticians have also developed techniques for non-normal distribution.
The original link: www.dataquest.io/blog/basic-…