2. The theory of probability
2.1 Probability distribution and random variables
2.1.1 Why do machine learning use probability
The probability of an event is a measure of the likelihood that the event will occur. Although the occurrence of an event in a randomised trial is contingent, randomised trials that can be repeated in large numbers under the same conditions tend to show clear quantitative regularities.
Machine learning usually has to deal with uncertain quantities, and sometimes with random quantities. Almost all activities require some ability to reason in the presence of uncertainty.
Uncertainty and randomness can come from many sources, with 3 possible sources of uncertainty:
- The randomness inherent in the system being modeled. A card game, for example, assumes that cards are actually shuffled in random order.
- Incomplete observation. For a deterministic system, randomness can occur if all the variables that drive the system’s behavior cannot be observed. For example, if a contestant chooses one of three doors and gets a prize behind each door, the prize behind each door is determined, but the contestant cannot observe it, so for the contestant, the result is uncertain.
- Incomplete modeling. When some models must discard some information, the discarded information may lead to uncertainty in the prediction of the model.
In many cases, it is more practical to adopt simple and uncertain rules than complex and certain rules.
You can use probability theory to quantify uncertainty. A measure of trust is probability, which is directly related to the frequency of an event, which is called frequency-ism probability. For example, the probability that something will happen is P, which means that if you try it out an infinite number of times, there is a P probability that it will happen; The level of certainty involved is called Bayesian probability, where, for example, the probability of a doctor diagnosing a patient as having a disease is P.
Probability theory plays a central role in machine learning because the design of machine learning algorithms often relies on probabilistic assumptions about data.
For example, in machine learning (Andrew Ng), there will be a naive Bayes hypothesis that is an example of conditional independence. The learning algorithm makes assumptions about the content and is used to tell whether an E-mail message is spam. Assume that the probability condition for the word X to appear in the message is independent of the word Y, whether the message is spam or not. Obviously this assumption is not without loss of generality, because certain words almost always appear together. The end result, however, is that this simple assumption doesn’t make much difference, and in any case allows us to quickly identify spam.
2.1.2 What are the differences between variables and random variables
A random variable is a variable that can take different values randomly.
It represents a real-valued function (all possible sample points) of the various outcomes of random phenomena (under certain conditions, phenomena that do not always produce the same results are called random phenomena). For example, the number of passengers waiting at a bus stop at a certain time, the number of calls received by the telephone exchange in a certain period of time, etc., are examples of random variables. The essential difference between the uncertainty of random variable and fuzzy variable is that the measurement result of the latter still has uncertainty, that is, fuzziness.
The difference between variable and random variable: when the probability of the value of a variable is not 1, the variable becomes a random variable; A random variable becomes a variable when the probability is 1.
For example, if the probability of the variable XXX being 100 is 1, then x=100x=100x=100 is determined and will not change unless further operations are performed. When the probability of variable XXX is 100 is not 1, for example, the probability of 50 is 0.5, and the probability of 100 is 0.5, then the variable will change with different conditions. It is a random variable, and the probability of 50 or 100 is 0.5, that is, 50%.
2.1.3 Relationship between random variables and probability distribution
A random variable represents only one possible state, and the probability distribution associated with it must be given to specify the probability of each state. A method used to describe the probability of each possible state of a random variable, or a cluster of random variables, is probability distribution **.
Random variables can be divided into discrete random variables and continuous random variables.
The corresponding function describing its probability distribution is:
-
Probability Mass Function (PMF): Describes the Probability distribution of discrete random variables, usually represented by capital PPP.
-
Probability Density Function (PDF) : Describes the Probability distribution of continuous random variables, usually represented by lowercase PPP.
2.1.4 Discrete random variables and probabilistic quality functions
PMF maps each state that a random variable can achieve to the probability that the random variable achieves that state.
- In general, P(x)P(x)P(x) represents the probability that x =x, x =x, x =x, x =x. Probability 1 means x =x, x =xX=x is certain, and probability 0 means x =x, x =xX=x is impossible.
- Sometimes to avoid confusion, we’ll explicitly name the random variable P(P(P(x =x)=x)=x).
- Sometimes you have to define a random variable, and then you have to define the probability distribution that it follows that x follows P(P(P(x))).
PMF can act on multiple random variables simultaneously, That is, joint probability distribution P(X= X,Y= Y)P(X= X,Y= Y)P(X= X,Y= Y) represents the probability of simultaneous occurrence of X=xX=xX= X and Y= Y,Y= Y,Y= Y, Or you could write it as P of x,y, P of x,y, P of x,y.
If a function PPP is the PMF of the random variable XXX, then it must satisfy the following three conditions:
- The domain of PPP must be the set of all possible states
- ∀ X ∈∀x∈∀ X ∈x, 0≤P(x)≤10 \leq P(x) \leq 10 ≤P(x)≤1.
- (x) = 1 ∑ ∑ x ∈ XP _ {x ∈ x} P (x) = 1 ∑ x ∈ XP (x) = 1. This property is called normalized. If it isn’t, the probability of something happening is greater than 1.
2.1.. Continuous random variable and probability density function
If a function PPP is a PDF of X, it must satisfy the following conditions
- The domain of PPP must be the set of all possible states of X.
- ∀ x ∈ x, p (x) p 0 ∀ x ∈ x, p (x) or 0 ∀ x ∈ x, p (x) or 0. Note that we do not require p(x)≤1p(x)≤ 1p(x)≤1, because here P (x)p(x)p(x) is not a specific probability corresponding to this state, but rather a relative magnitude (density) of the probability. The exact probability, you have to integrate it.
- The integral of p(x)dx is equal to 1, the sum is still 1, the probability is still 1.
Note: PDF P (x)p(x)p(x) p(x) does not directly give a probability for a particular state. Instead, it gives a probability of falling within a wirelessly small region of area δxδxδx, p(x)δxp(x)δ XP (x)δx.
[a,b][a,b][a,b] is ∫abp(x)dx \int_{a}^{b}p(x)dx.
2.1.6 Examples for Understanding conditional probability
The conditional probability formula is as follows:
Note: Events or subsets OF AAA and BBB in the same sample space ω \Omega ω, if one element randomly selected from ω \Omega ω belongs to BBB, then the probability that the next randomly selected element belongs to AAA is defined as the conditional probability of AAA given BBB.
The conditional probability Venn diagram is shown in FIG. 1.1.
FIG. 1.1 Conditional probability Venn diagram
According to the Venn diagram, it is clear that the probability of event A occurring in the case of event B is P(A⋂B)P(A\bigcap B)P(A⋂B) divided by P(B)P(B)P(B) P(B).
For example: what is the probability that a couple has two children, given that one of them is a girl, and the other is a girl? (Interview and written test)
Exhaustive method: given that one of them is a girl, then the sample space is male and female, female and male, then the probability that the other one is still a girl is 1/3.
Conditional probability method: Female ∣ P (female) = P (female female)/P (female) P (female |) = P (female female)/P (female) P (female ∣) = P (female female)/P (female), the couple has two children, so the sample space for female female, male and female, female, male, MSM, then P (female female) P (female female) P (female female) for a quarter, P (female) = 1 – P (MSM) = 3/4 P (female) = 1 – P (MSM) = 3/4 P (female) = 1 – P (MSM) = 3/4, so finally 1/31/31/3.
There’s a misconception here that men and women are the same case as women and men, but it’s actually a different case for siblings and siblings.
2.1.7 Association difference between joint probability and edge probability
Differences: Joint probability: Joint probability is similar to P(X=a,Y=b)P(X=a,Y=b)P(X=a,Y=b), containing multiple conditions, all of which are true at the same time. Joint probability refers to the probability that multiple random variables satisfy their respective conditions in a multivariate probability distribution.
Edge probability: Edge probability is the probability of an event happening, independent of other events. Marginal probability refers to the probability of a single random variable, similar to P(X=a)P(X=a)P(X=a), P(Y=b)P(Y=b)P(Y=b).
Contact:
The joint distribution can be obtained by the marginal distribution, but if only the marginal distribution is known, the joint distribution cannot be obtained.
2.1.8 Chain rule of conditional probability
According to the definition of conditional probability, the following multiplication formula can be directly obtained: multiplication formula If A,BA, BA,B are two events and P(A)>0P(A) >0P(A) >0, then there is
To promote
In general, it can be proved by induction: if P(A1A2… An)>0P(A_1A_2… A_n)>0P(A1A2… An) > 0, there are
Any joint probability distribution of multidimensional random variables can be decomposed into a conditional probability multiplication with only one variable.
2.1.9 Independence and Conditional independence
Independent of the two random variables XXX and YYY, the probability distribution can be expressed as the product of two factors, one factor contains only XXX, the other factor contains only YYy, it can be said that the two random variables are independent of each other **. Conditions sometimes bring independence between events that are not independent, and sometimes they make otherwise independent events lose their independence because of the existence of the condition.
Example: P (X, Y) = P (X) P (Y), P (X, Y) = P (X) P (Y), P (X, Y) = P (X) P (Y), event XXX and YYY independence. Given ZZZ,
When events are independent, the joint probability is equal to the product of the probabilities. This is a nice mathematical property, but unfortunately, unconditional independence is rare, because for the most part, events affect each other.
Conditional independence Given ZZZ, XXX and YYY are conditionally independent, if and only if
The relationship between XXX and YYY depends on ZZZ rather than being directly generated.
Example define the following events: XXX: it rains tomorrow; YYY: The ground is wet today. ZZZ: Is it raining today? The establishment of ZZZ event has an impact on BOTH XXX and YYY. However, under the premise of the establishment of ZZZ event, the ground situation today has no impact on whether it rains tomorrow.
2.1.10 Common formulas
The formula for the basis of probability
Full probability
The bayesian
2.1.11 application
Take the ball
N balls, with or without return
-
There are put back extraction, extraction of M in a row, find the number of different arrangements: NMN ^ MNM
-
If there is no extraction put back, extract M in a row and calculate the number of different permutations: N! N – (m)! \frac{n! }{(n-m)! } (n – m)! n!
2.2 Common probability distributions
2.2.1 Uniform distribution
Uniform distribution of discrete random variables: Assuming that X has k values, the probability quality function of uniform distribution is:
Uniform distribution of continuous random variables: Assuming that X is uniformly distributed on [a, B], its probability density function is:
2.2.1 Bernoulli distribution
The Bernoulli distribution (0-1 distribution) is a single-valued random variable distribution with a single parameter ϕ\phiϕ∈[0,1] controlling ϕ\phiϕ gives the probability that a random variable is equal to 1. The main properties are:
Its expectation and variance are:
Bernoulli distribution is suitable for modeling discrete random variables.
Multinoulli distribution, also known as a categorical distribution, is a random distribution of a single K value and is often used to represent the distribution of object classes. Where KKK is finite. Multinoulli distribution is parameterized by the vector p⃗∈[0,1]k−1\vec{p}\in[0,1]^{k-1}p ∈[0,1]k−1. Each component pip_ipi represents the probability of the third state and pk=1−1Tpp_k=1-1^Tppk=1−1Tp. Here, 1T1 to the T1T is the transpose of the column vector with all 1’s, which is just the sum of the probabilities of p minus k. Can be rewritten as pk = 1 – ∑ 0 k – 1 pip_k = 1 – \ sum_ {0} ^ {1} k p_ipk = 1 – ∑ 0 k – 1 PI.
Supplementary binomial distribution, multinomial distribution:
Binomial distribution, popular coin flips. Binomial distribution is the discrete probability distribution of n – fold Bernoulli test success times.
Define success probability of x times as: f (x) = Cnxpx n – (1 – p) x, x ∈ 0, 1,…, nf (x) = C_n (1 – p) ^ ^ ^ xp x} {n – x, x \ {0, 1, \ \ cdots, n} in f (x) = Cnxpx n – (1 – p) x, x ∈ 0, 1,…, n.
The expectation is NP, and the variance is NP times 1 minus P.
Multinomial Distribution is a generalization of the binomial Distribution. If you do n Bernoulli experiments on a binomial, which specifies that there are only two results per experiment, if you still do n experiments, but you can have m results per experiment, and the probabilities of m results are mutually exclusive and sum is 1, then the probability of one of them happening X times is polynomial.
2.2.3 Gaussian Distribution
Gaussian is also called Normal Distribution. The probability function is as follows:
, mu, mu mu and sigma, sigma sigma is mean and standard deviation, respectively, the peak center x coordinates are given by mu, mu mu, the width of the peak is governed by sigma \ sigma sigma, maximum points in x = u x = \ mux = u get, inflection point for x = u + sigma x = / mu/PM/sigmax = u + sigma
In a normal distribution, the probabilities of ±1 σ\sigma, ±2 σ\sigma, and ±3 σ\sigma are 68.3%, 95.5%, and 99.73%, respectively. These three numbers are best remembered.
In addition, if μ=0,σ=1\mu=0,\sigma=1μ=0,σ=1, the Gaussian distribution can be simplified to the standard normal distribution:
Efficient evaluation of probability density function:
β=1σ2\beta=\frac{1}{\sigma^2}β=σ21 The distribution accuracy is controlled by the parameter β∈ (0, ∞) \beta∈ (0, \infty) β∈ (0, ∞).
2.2.4 When to adopt normal distribution
Q: When is the normal distribution adopted?
Answer: Without prior knowledge of distribution on real numbers, it is always right to choose normal distribution by default when I do not know which form to choose. The reasons are as follows:
- The central limit theorem tells us that many independent random variables approximately obey normal distribution. In reality, many complex systems can be modeled as normal distribution noise, even if the system can be structurally decomposed.
- Normal distribution is the distribution with the greatest uncertainty among all probability distributions with the same variance. In other words, normal distribution is the distribution with the least prior knowledge added to the model.
Extension of normal distribution:
The normal distribution can be generalized to the RnR^nRn space, then called the multi-digit normal distribution, whose parameters are a positive definite symmetric matrix Sigma \Sigma Sigma:
Efficient evaluation of probability density with mostly normal distribution:
Here, β⃗\vec\betaβ is an accuracy matrix.
2.2.5 Exponential distribution
In deep learning, exponential distribution is used to describe the distribution of boundary points obtained at x=0x=0x=0, and the exponential distribution is defined as follows:
The exponential distribution uses the indicator function Ix≥0I_{x\geq 0}Ix≥0 to make the probability of XXX taking a negative value zero.
2.2.6 Laplace Distribution
A closely related probability distribution is the Laplace distribution, which allows us to set the peak of the probability mass at any point μ\muμ
The expectation is μ\muμ and the variance is 2γ22\gamma^22γ2
The Laplacian distribution is sharper and narrower than the Gaussian, a property that is often exploited in regularization.
2.2.7 Poisson distribution
Given that the average number of events per unit time (or unit area) is λ, the Poisson distribution describes the probability that the specific number of events per unit time (or unit area) is k. Probability density function:
The expectation is lambda lambda lambda, and the variance is lambda lambda lambda lambda.
2.2.8 Dirac distribution and empirical distribution
The Dirac distribution ensures that all the masses in the probability distribution are concentrated at one point. The Dirac delta \delta delta function (also known as the unit impulse function) of the Diract distribution is defined as follows:
The Dirac distribution is often presented as a component of empirical distribution
Where, m points x1… ,xmx^{1},… ,x^{m}x1,… , xM is the given data set, and the empirical distribution assigns the probability density 1m\ FRAc {1}{m}m1 to these points.
When we train the model on the training set, we can assume that the empirical distribution obtained from the training set indicates the source of the sample.
Dirac delta function is suitable for the empirical distribution of continuous random variables.
Another important point about the empirical distribution is that it is the probability density function with the greatest likelihood of the training data.
2.2.9 Mixed distribution
It is also common to define new probability distributions by combining some simple probability distributions.
A common combinatorial approach is to construct a mixed distribution. The mixed distribution consists of some component distribution.
An example of a mixed distribution is that the empirical distribution of real-valued variables for each training instance is the mixed distribution with the Dirac distribution as a component.
A hybrid model is a simple strategy that combines simple probability distributions to produce richer ones. A very powerful and common mixture model is the Gaussian mixture model.
Its components are gaussian distributed, and each component has its own parameters, mean and covariance matrices.
2.3 Expectation, variance, covariance and correlation coefficient
2.3.1 expectations
The expectation of f of x, or the expected value of f of x, with respect to some distribution P of x, is the average value of f of x, when x is produced by P, when f acts on x.
In probability theory and statistics, the mathematical expectation (or mean, also known as expectation) is the sum of the probabilities of each possible outcome of an experiment multiplied by its outcomes. It reflects the average value of the random variable.
- Linear operation: E (ax + by + c) = aE (x) + bE (y) + cE (ax + by + c) = aE (x) + bE (y) + cE (ax + by + c) = aE (x) + bE + c (y)
- Promotion form: E (∑ naixi k = 1 + c) = ∑ k = 1 naie (xi) + cE (\ sum_ {k = 1} ^ {n}} {a_ix_i + c) = \ sum_ {k = 1} ^ {n} {a_iE (x_i) + c} E (∑ naixi k = 1 + c) = ∑ naie k = 1 + c (xi)
- Function expectation: set
for
Function of, then
Expectations for- Discrete function: E (f (x)) = ∑ k = 1 nf (xk) P (xk) (f (x) = E \ sum_ {k = 1} ^ {n} {f (x_k) P (x_k)} E (f (x)) = ∑ k = 1 nf (xk) P (xk)
- Continuous function: E (f (x)) = ∫ – up + up f (x) p (x) dxE (f (x) = \ int_ ^ {- \ infty} {+ \ infty} {f (x) p (x) dx} E (f) (x) = ∫ – up + up f (x) p (x) dx
Note:
- The expectation of function is greater than or equal to the desired function (Jensen inequality (Jason), namely, E (f (x)) ⩾ f (x) (E) (f (x)) E \ geqslant f (x) (E) E (f (x)) ⩾ f (x) (E)
- In general, the expectation of a product is not equal to the product of expectations.
- If the XXX and YYY are independent of each other, the E (x, y) = E (x) E (y) E (x, y) = E (x) E (y) E (x, y) = E (x) E (y).
2.3.2 variance
The variance in probability theory measures the deviation between a random variable and its mathematical expectation, the mean. Variance is a special kind of expectation. Defined as:
Variance properties:
1) Var (x) = E (x2) – E (x) 2 Var (x) = E – E (x) (x ^ 2) ^ 2 Var (x) = E (x2) – E (x) 2 2) constant variance to zero; 3) Variance does not meet linear properties; Var(ax+by)=a^2Var(x)+b^2Var(y)Var(ax+by)=a ^2Var(x)+b^2Var(y)Var(ax+by)=a ^2Var(x)+b^2Var(y)Var(ax+by)=a ^2Var(x)+ b2Var(y)
2.3.3 covariance
Covariance is a measure of the strength and scale of the linear correlation between two variables. The covariance of the two random variables is defined as:
Variance is a special kind of covariance. When X = YX = YX = Y, Cov (X, Y) = Var (X) = Var (Y) Cov (X, Y) = Var (X) = Var (Y) Cov (X, Y) = Var (X) = Var (Y).
Covariance properties:
1) The covariance of the independent variable is 0. 2) Covariance calculation formula:
3) Special circumstances:
2.3.4 Correlation coefficient
Correlation coefficient is a measure of the degree of linear correlation between variables. The correlation coefficient of the two random variables is defined as:
Properties of correlation coefficient: 1) boundedness. The value range of the correlation coefficient is [-1,1], which can be regarded as the dimensionless covariance. 2) The closer the value is to 1, the stronger the positive correlation (linear) between the two variables is. The closer it is to -1, the stronger the negative correlation is. When it is 0, the two variables have no correlation.
2.4 information theory
Information theory is concerned with quantifying how much new a signal contains.
A basic idea of information theory is that the occurrence of an unlikely event provides more information than the occurrence of a very likely event.
If you want to quantify information using this basic idea, you need to satisfy these three properties:
- There should be less information about highly likely events, and in extreme cases, there should be no information about events that can occur;
- Less likely events have a higher amount of information;
- Individual events should have incremental information. For example, flipping a coin twice heads should convey twice as much information as flipping a coin once proving heads.
The self-information of an event x= XXX is defined as:
Self-information can only handle a single output. Shannon entropy can be used to quantify the total amount of uncertainty in the whole probability distribution:
I’ll call that H of P. Here E stands for expectation, which means that shannon entropy of a distribution is the amount of expected information generated by events that follow that distribution.
If there are two separate probability distributions P(x) and Q(x) for a random variable, the KL divergence can be used to measure the difference between the two distributions:
Example: for a binary random distribution, shannon entropy H (x) = – (1 – p) log (1 – p) – plogpH (x) = – (1 – p) log (1 – p) – plogpH (x) = – (1 – p) log (1 – p) – plogp
The properties of KL divergence are as follows:
- The negative;
- KL divergence is 0 if and only if P and Q are identically distributed in the case of discrete variables, or “almost everywhere” in the case of continuous variables;
- It’s often used to measure some distance between distributions, but not really distance, because it’s not symmetric.
A similar to KL divergence is cross entropy, or H (P, Q) = H (P) + DKL (P ∣ ∣ Q) H (P, Q) = H (P) + D_ {KL} (P | | Q) H (P, Q) = H (P) + DKL (P ∣ ∣ Q) :
Minimizing cross entropy for Q is equivalent to minimizing KL divergence because Q does not participate in the omitted term.
When calculating these quantities, we often encounter the expression 0log0, which is generally treated as limx−>0xlogx=0lim_{x->0}xlogx = 0LIMx −>0xlogx=0
Welcome to follow my official account –AI Algorithm Notes, get more AI algorithm notes, paper reading notes.