It can be said that linear algebra provides data representation and calculation methods for machine learning, while probability and statistics provide theoretical basis for the design of machine learning algorithms themselves. Generally, the solution of machine learning algorithm parameters is based on the frequency school or Bayes school in statistics

The frequency school believes that although parameters are unknown, they are objectively fixed values, so maximum likelihood estimation is usually used to determine parameters. However, The Bayes school believes that parameters are random variables that have not been observed and have a certain distribution. Therefore, it can be assumed that parameters follow a prior distribution and then calculate the posterior distribution of parameters based on existing data

In this paper, the common probability and statistics knowledge in machine learning is sorted out

 

I. Basic concepts

1.1 Randomized Trial (Experiment)

Randomized trial refers to an experiment with uncertain results, which conforms to three characteristics:

  • It can be repeated under the same conditions
  • More than one result at a time, and all possible results can be identified in advance
  • It is not certain which results will occur until a trial is carried out

 

1.2 Sample Space

The sample space is the set of all possible outcomes of a randomized experiment. It’s usually called S of space.

For example

  • If you flip a coin, the sample space is {heads, tails};
  • If I roll a die, the sample space is going to be
    { 1 . 2 . 3 . 4 . 5 . 6 } 6 \ {\ displaystyle \ {}}

 

1.3 Random Events

Any subset of the sample space is called a random event, usually called E (event) where we mean the set of possible outcomes of one experiment, not the set of multiple experiments

Sounds like a mouthful, here’s an example:

  • Event 1: Roll a die and roll 2 is an event
  • Event 2: Rolling a die without rolling 2 is also an event

But obviously, event 2 is much more likely than event 1.

 

1.4 Probability (aim-listed Probability)

For event E, let’s call P(E) the probability of event E

There are three axioms:

  • 0<= P(E)<=1, the probability is between 0 and 1
  • P of S is equal to 1, the probability of the sample space is 1

  • P ( i = 1 n A i ) = i = 1 n P ( A i ) {\displaystyle P\left(\bigcup _{i=1}^{n }A_{i}\right) = \sum _{i=1}^{n}P(A_{i})}

 

Random variables

2.1 Random Variable

A random variable, usually called X, is a function that maps every element in the sample space to a real value. In fact, a random variable is simply the concept of numeralizing random events, as shown below:

When we flip a coin, heads and tails appear, and we represent heads and tails as 0 and 1, {0, 1} are all the possible values of the random variable. In fact, we don’t have to write heads and tails as {0, 1}, but {-100, 20}

Random variables can be divided into:

  • Discrete random variables: Only certain values can be removed, for example X = {1, 2, 3}, X is a discrete random variable
  • Continuous random variable: can take a range of arbitrary value, such as the temperature of Shanghai, the height of people

 

2.2 Cumulative Distribution Function (CDF)

The cumulative distribution function is defined as follows:


  • F X ( x ) = P ( X Or less x ) {\displaystyle F_{X}(x)=\operatorname {P} (X\leq x)}

Properties of cumulative distribution function:


  • lim x up F X ( x ) = 0 {\displaystyle \lim _{x\to -\infty }F_{X}(x)=0}

  • lim x + up F X ( x ) = 1 {\displaystyle \lim _{x\to +\infty }F_{X}(x)=1}

 

2.3 Probability Density Function (PDF)

The probability density function is a function that describes the probability of the output value of a random variable near a certain value point. The simple way to think about it is, what’s the probability that X hits some value

Relationship between probability density function and cumulative distribution function:

  • You take the derivative of the cumulative distribution function to get the probability density function
  • The cumulative distribution function is obtained by integrating the probability density function

 

2.4 expectations (Expectation)

The expectation of a discrete random variable is the sum of each possible outcome in the trial multiplied by the probability of its outcome. The expected value may not be equal to each outcome; in other words, the expected value is the weighted average of the output value of that variable

Such as:

The expected value of a roll of a fair six-sided die is 3.5, calculated as follows:


  • E ( X ) = 1 1 6 + 2 1 6 + 3 1 6 + 4 1 6 + 5 1 6 + 6 1 6 = 1 + 2 + 3 + 4 + 5 + 6 6 = 3.5 {\displaystyle {\begin{aligned}\operatorname {E} (X)&=1\cdot {\frac {1}{6}}+2\cdot {\frac {1}{6}}+3\cdot {\frac {1}{6}}+4\cdot {\frac {1}{6}}+5\cdot {\frac {1}{6}}+6\cdot {\frac {1}{6}}\\[6pt]&={\frac {1 + 2 + 3 + 4 + 5 + 6} {6} {aligned}}}} = 3.5 \ end

If X{\ Displaystyle X}X is a continuous random variable, there exists a corresponding probability density function f(X){\ Displaystyle f(X)} f(X), If the integral ∫ – up up xf (x) dx {\ displaystyle \ int _ ^ {- \ infty} {\ infty} xf (x), \ \ mathrm {d}} x ∫ – up up xf (x) dx absolute convergence, Then the expected value of X{\displaystyle X} can be calculated as: E ⁡ (X) = ∫ – up up xf (X) dx {\ displaystyle \ operatorname {E} (X) = \ int _ ^ {- \ infty} {\ infty} xf (X), \ \ mathrm {d}} X (X) = E ∫ – up up xf dx (X) , is for continuous random variables, and is basically the same as the algorithm for the expected value of discrete random variables. Since the output value is continuous, the sum is changed to an integral

 

2.5 Variance and Standard Deviation

The variance of a random variable describes how discrete it is, how far it is from its expectations

The formula of variance:

  • Var ⁡ (X) = E ⁡ [2] (X – mu) {\ displaystyle \ operatorname {Var} (X) = \ operatorname {E} \ left [(X – \ mu) ^ {2} \ right]} Var (X) = E [2] (X – mu), Where μ\muμ is the expectation of XXX

Standard deviation: The positive square root of the variance is called the standard deviation of this random variable

 

2.6 Covariance

Covariance is a measure of the combined variation of two random variables. And variance is a special case of covariance, the covariance of a variable with itself

Covariance formula:

  • Cov ⁡ ⁡ (X, Y) = E ((X – mu) (Y – an argument)) = E ⁡ ⋅ Y (X) – mu argument {\ displaystyle \ operatorname cov (X, Y) = \ attach operatorname {E} ((X – \ mu) (Y – \ nu (X) = \ operatorname {E} \ cdot Y) – \ mu \ nu} cov (X, Y) = E ((X – mu) (Y – an argument)) = E (X ⋅ Y) – mu nu —

    Where, μ\muμ is the expectation of XXX, ν\nuν is the expectation of YYY

 

Conditional probability and Bayes’ theorem

3.1 Conditional Probability

: conditional probability events happened in the event B occurs under the condition of probability, denoted by P (A ∣ B) P (A | B) P (A ∣ B), reading in the English language as the conditional aim-listed probability of A given B

Joint probability: represents the probability of two events occurring together. The joint probability of A and B is expressed as: P(A∩B){\ Displaystyle P(A\cap B)}P(A∩B) or P(AB){\ Displaystyle P(AB)} P(AB)

Conditional probability formula: under the condition that the event B occurs in the conditional probability of event A for: P (∣ B) = P (AB) P (B) {\ displaystyle P (A | B) = {\ frac {P (AB)} {P (B)}}} P (∣ B) = P (B) P (AB)

 

3.2 Bayes Rule

Bayes’ theorem describes the probability of an event occurring under given conditions.

Bayesian formula: P (∣ B) = P (B ∣ A) P (A) P (B) ⁣ {\ displaystyle P (A | B) = {\ frac {P (B | A) \, P (A)} {P (B)}} \! } P (∣ B) = P (B) P (B ∣ A) P (A)

Derivation of Bayes’ theorem:

  1. According to the definition of conditional probability, the probability of event A occurring under the condition that event B occurs is:


    P ( A B ) = P ( A B ) P ( B ) {\displaystyle P(A|B)={\frac {P(AB)}{P(B)}}}

  2. Similarly, the probability of event B occurring under the condition that event A occurs:


    P ( B A ) = P ( A B ) P ( A ) {\displaystyle P(B|A)={\frac {P(AB)}{P(A)}}\! }

  3. By collating and combining these two equations, we can get:

    P (A ∣ B) P (B) = P (AB) = P (A) P (B ∣ A) ⁣ {\ displaystyle P (A | B) \, P (B) = P (AB) = P (B | A) \, P (A) \! } P (A ∣ B) P (B) = P (AB) = P (B ∣ A) P (A),

    I.e., P (A ∣ B) P (B) = P (A) P (B ∣ A) ⁣ {\ displaystyle P (A | B) \, P (B) = P (B | A) \, P (A) \! } P (A ∣ B) P (B) = P (B ∣ A) P (A)

  4. When P(B)P(B)P(B) is not equal to 0, divide both sides by P(B)P(B)P(B) P(B)

    P (A ∣ B) = P (AB) P (B) {\ displaystyle P (A | B) = {\ frac {P (AB)} {P (B)}}} P (∣ B) = P (B) P (AB), namely the bayesian formula

In Bayes’ theorem, every noun has a conventional name. In Bayes’ theory:

  • P(A)P(A)P(A) is the Prior probability of AAA, because it does not consider BBB;
  • P (A ∣ B) P (A | B) P (A ∣ B) is given when the BBB incidence of the AAA, referred to as A posteriori probability (Posterior aim-listed probability);
  • And P (B ∣ A) P (B | A) P (B ∣ A) is A known results, the probability of B, called the likelihood/probability (likelihood) B

The most important application of Bayes’ theorem is Bayesian inference, which is a very important part of machine learning

 

Probability distribution

4.1 Bernoulli Distribution

Bernoulli distribution, also known as 0-1 distribution

Its probability mass function (for discrete distribution):

  • Probability of 1 (success) are as follows: P (X = 1) = P (0 P 1 or less) or less {\ displaystyle P (X = 1) = P (0 {\ leq} {\ leq} P 1)} P (X = 1) = P (0 P 1 or less) or less,

  • Probability of 0 (failure) are as follows: P (X = 0) = 1 – P {\ displaystyle P = 1 – P (X = 0)} P = 1 – P (X = 0)

Expectations for:


  • E [ X ] = i = 0 1 x i f X ( x ) = 0 + p = p {\displaystyle \operatorname {E} [X]=\sum _{i=0}^{1}x_{i}f_{X}(x)=0+p=p}

Variance is:


  • Var [ X ] = i = 0 1 ( x i E [ X ] ) 2 f X ( x ) = ( 0 p ) 2 ( 1 p ) + ( 1 p ) 2 p = p ( 1 p ) {\displaystyle \operatorname {Var} [X]=\sum _{i=0}^{1}(x_{i}-E[X])^{2}f_{X}(x)=(0-p)^{2}(1-p)+(1-p)^{2}p=p(1-p)}

 

4.2 Binomial Distribution

The binomial distribution is the discrete probability distribution of n independent Bernoulli tests. When n =1, the binomial distribution is a Bernoulli distribution

The probability of getting exactly K successes out of n trials has the probability mass function:


  • f ( k . n . p ) = Pr ( X = k ) = C n k p k ( 1 p ) n k {\displaystyle f(k,n,p)=\Pr(X=k)={C_n^k}p^{k}(1-p)^{n-k}}

    For k = 0, 1, 2… , n, where Cnk=n! k! (n – k)! {\displaystyle {C_n^k}={\frac {n! }{k! (n-k)! }}}Cnk=k! (n – k)! n!

Expectations for:


  • E [ X ] = n p {\displaystyle \operatorname {E} [X]=np}

Variance:


  • Var [ X ] = n p ( 1 p ) {\displaystyle \operatorname {Var} [X]=np(1-p)}

 

Geometric Distribution

The geometric distribution refers to the distribution of the number of trials X needed to achieve a successful Bernoulli experiment

The probability quality function is:


  • P ( X = k ) = ( 1 p ) k 1 p {\displaystyle P(X=k)=(1-p)^{k-1}\,p\,}

Hope is:


  • E ( X ) = 1 p {E} (X)={\frac {1}{p}}

The variance is:


  • V a r ( X ) = 1 p p 2 {Var} (X)={\frac {1-p}{p^2}}

 

4.4 Poisson Distribution

Poisson distribution describes the probability distribution of the number of random events per unit time

Such as:

  • The number of service requests received by a service facility in a given period of time
  • The number of calls received by a call center in a certain period of time
  • The number of times a machine breaks down in a given period of time

The probability mass function of poisson distribution is:

  • P (X = k) = e – lambda lambda kk! {\displaystyle P(X=k)={\frac {e^{-\lambda }\lambda ^{k}}{k! }}} P(X=k)=k! E – lambda lambda k,

    λ\lambda lambda is the unit time, and KKK is the number of occurrences

Expectation and variance:

  • A random variable subject to the Poisson distribution, whose expectation and variance are equal, is also λ\lambdaλ :


    E ( X ) = V ( X ) = Lambda. {\displaystyle} {\displaystyle E(X)=V(X)=\lambda }

 

4.5 Normal Distribution

Also known as Gaussian distribution

If the random variable X follows a normal distribution with position parameter μ{\ Displaystyle \mu}μ and scale parameter σ{\ Displaystyle \sigma}σ, it can be written as:


  • X …… N ( mu . sigma 2 ) {\displaystyle X\sim N(\mu ,\sigma ^{2})}

Its probability density function (PDF) is:


  • f ( x ) = 1 sigma 2 PI.    e ( x mu ) 2 2 sigma 2 {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}\; e^{-{\frac {\left(x-\mu \right)^{2}}{2\sigma ^{2}}}}\! }

The expectation of a normal distribution
mu {\displaystyle \mu }
Determines where the distribution is; Its variance
sigma 2 \sigma ^{2}
Determines the magnitude of the distribution

 

4.6 Exponential Distribution

An exponential distribution can be used to represent the time interval between random events

Such as:

  • The time interval between passengers entering the airport
  • The interval between calls to the call center

The probability density function is:

  • f(x; Lambda) = lambda e – (lambda x) f (x; \lambda) = \lambda e^{-(\lambda x)}f(x; Lambda) = lambda e – (lambda x),

    Lambda lambda is the number of times the event occurs per unit time

The cumulative distribution function is:


  • F ( x ; Lambda. ) = 1 Lambda. e ( Lambda. x ) F(x; \lambda) = 1- \lambda e^{-(\lambda x)}

The expected value is:


  • E [ X ] = 1 Lambda. {\displaystyle \mathbf {E} [X]={\frac {1}{\lambda }}}

    For example, if you average two calls per hour, you can expect to wait half an hour for each call

Variance is:


  • V [ X ] = 1 Lambda. 2 {\displaystyle \mathbf {V} [X]={\frac {1}{\lambda ^{2}}}}