C Probability evaluation
In this appendix, we briefly review some of the basic concepts of probability theory and will define symbols to be used throughout the material.
C. 1 probability
Probability space consists of three parts: sample space, event set and probability distribution:
- Sample space Omega \Omega: Omega \Omega is the set of all the basic events or outcomes that can occur in an experiment, such as when a die is rolled and all the possible outcomes are in (1,… , 6), big (1 \ dots, 6) (1,… 6).
- The event set F\mathcal FF: F\mathcal FF is a sigma sigma sigma -algebra that is the set of Omega Omega subsets containing Omega Omega, closed under complementary and countable unions (and therefore also countable intersections). An example of an event might be “the dice fall on an odd number”.
- Probability distribution: R\mathbb RR is a mapping of all events F\mathcal FF to [0,1], such that P\mathbb PP[Omega Omega]=1, P\mathbb PP[ϕ\phi]=1, and, for mutually exclusive events A1A_{_1}A1,… \ dots… , AnA_{_n}An,
\qquad in the discrete probability distribution of uniform dice can be arbitrarily defined as P\mathbb PP[AiA_{_i}Ai]=1/6 for I ∈\in∈{1… \ dots… 6}, where AiA_{_i}Ai is the event of the dice landing on the value I.
C.2 Random variables
Definition C.1 (Random variable) The random variable X{X}X is a measurable function: ω\ Omega ω →\rightarrow→ R\mathbb RR, that is, for any interval I{I}I, a subset of the sample space {ω\ Omega ω ω ω in∈ ω\ Omega ω :X(ω\ Omega ω) ∈I\in{I}∈I} is an event. The probabilistic mass function of discrete random variable X{X}X is defined as function X ↦\mapsto↦ P\mathbb PP[X{X}X= X{X}X]. The joint probability mass function defined as discrete random variables X{X}X and Y{Y}Y is the function (X{X}X, Y {y}y) ↦\mapsto↦ P\mathbb PP[X{X}X= X{X}X ∧ y \land{y} ∧ y = y{y}y] \qquad probability distribution is absolutely continuous, when it contains a probability density function, This function f{f}f is associated with a real-valued random variable X{X}X, satisfying all a{a}a, b{b}b ∈\in∈ P\mathbb PP
Figure C.1 binomial distribution (red) approximates normal distribution (blue). Definition c. 2 (binomial distribution) suppose a random variable X{X}X obeies a binomial distribution B{B}B(n{n}n, p{p}p), n∈ n{n} \in\mathbb Nn∈ n, p∈{p}\inp∈[0,1], if for any k∈{k}\ink∈{1,… \ dots… , n} {n} n,
Definition C.3 (Normal distribution) A random variable X{X}X is considered to obey a normal (or Gaussian) distribution N (μ,σ2) N (\mu,\sigma^2) N (μ,σ2), μ∈R\mu\in\mathbb Rμ∈R, σ>0\sigma>0σ>0 if its probability density function is:
The standard normal distribution N(0,1)N(0,1)N(0,1) is a normal distribution with zero mean and unit variance. The normal distribution is often used to approximate the binomial distribution. Figure C.1c.11.1 illustrates this approximation. Definition C.4 (Laplacian distribution) A random variable XXX is said to obey a Laplacian distribution with μ∈R\mu\in\mathbb Rμ∈R and scale parameter B >0b>0b>0 if its probability density function is:
Definition C.5 (Gibbs distribution) Given a set XXX and the eigenfunction φ : X→RN\Phi: X\rightarrow\mathbb R^N φ : X→RN, if for any X ∈Xx\in Xx∈X,
The random variable XXX is said to obey the Gibbs distribution of the parameter ω∈RN\omega\in\mathbb R^Nω∈RN. The denominator Z = ∑ x ∈ Xexp (omega ⋅ Φ (x)) Z = \ sum_ \ {x in x} \ exp (\ omega \ cdot \ Phi (x)) Z = ∑ x ∈ Xexp (omega ⋅ Φ (x)) also known as the amount of normalized in the partition function. Define C.6 (Poisson distribution) for any K ∈Nk\in \mathbb Nk∈N
Say the random variable XXX follows the Poisson distribution λ>0\lambda>0λ>0! The following family of distributions is defined using the concept of random variable independence as defined in the next section. Definition C.7 (X2− square distribution X^2- square distribution X2− square distribution) The X2X^2X2- distribution (or Chi-square distribution) with KKK degrees of freedom is the distribution of the sum of squares of KKK independent random variables, each subject to a variable normal distribution.
C.3 Conditional probability and independence
Definition C.8 (conditional probability) When P[B]≠0\mathbb P[B]\neq0P[B]=0, the conditional probability of the given event BBB is defined as
Definition C.9 (independence) if P[A∩B]=P[A]P[B]\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\ qquad\qquad\qquad\qquad\qquad\mathbb P[A\cap B]=\mathbb P[A]\mathbb P[B]P[A]P[B] =P[A]P[B] then two events A and BA and BA and B are independent. Similarly, if P≠0\mathbb P\neq0P=0, A and BA and BA and B are independent, A sequence of random variables is an independent homodistribution if and only if P[A] =P[A]\mathbb P[A]P[A]P[A]P[A] given B =P[A] when the random variables are independent and obey the same distribution. \qquad The following are the basic conceptual formulas related to the concept of conditional probability. They are A, B and A1,… ,AnA, B and A_{_1},\dots,A_{_n}A, B and A1,… The Bayes formula is defined with additional constraint P[B]≠0\mathbb P[B]\neq0P[B]=0: P = P A ∪ [B] [A] + P – P A studying [B] [B] \ qquad \ qquad \ qquad \ qquad \ mathbb P = \ [A \] cup B mathbb P [A] + \ mathbb [B] – P \ mathbb P [\ A cap P [B] A ∪ B] = P + P [A] [B] – [] A studying B P
\qquad and the rule follow the decomposition of the union (A∪B) (A∪B) (A cup B) (A∪B) of A union that does not overlap A and (B−A∩B)A and (B−A∩B). The union constraint is a direct result of the summation rule. Bayes’ formula follows the definition and observation of conditional probability: P A ∣ [B] [B] P = P (B ∣ A] P = P [A] A studying [B] \ mathbb P [\ vert B] \ mathbb [B] = P \ mathbb P/B \ vert \ mathbb P = \ mathbb P [A] [\ A cap B] [P] A ∣ B P = P (B ∣ A] [B] [A] P = P (A studying B). In the same way, Chain rule compliance observed P [A1] P = P [A2 ∣ A1] [A1] studying P [A2] \ mathbb P [A_ {1}] \ mathbb P [A_ {2} \ vert A_ {1}] = \ mathbb P [A_ {1}] \ cap \ mathbb P P [A_ {2}] [A1] P = P [A2 ∣ A1] [A1] studying P [A2]; Using the same parameters recursively show that the product of the first KKK term on the right is equal to P[⋂ I =1kAi]\mathbb P[\ bigCap ^{k}_{I =1}A_{I}]P[⋂ I =1kAi].qquad Finally, Suppose Ω = ∪ A1 A2 ∪.. ∪ An \ Omega = A_ {1} \ CPU A_ {2}, CPU, dots, CPU A_ {n} Ω = ∪ A2 ∪.. ∪ An A1, Ai studying Aj = ϕ A_ {I} \ cap A_ {j} = \ phiAi studying Aj = ϕ, I indicates j, I.e. I \neq j, i.e. I =j, i.e. AisA_{I}sAis are mutually disjoint. Then, the following formula is valid for any event BBB:
According to the definition of conditional probability P (B ∣ Ai) P = P [Ai] [B] studying Ai \ mathbb P \ [B vert A_ {I}] \ mathbb P [A_ {I}] = \ mathbb P \ [B cap A_ {I}] P (B ∣ Ai) P = P [Ai] [B] studying Ai as well B∩AiB\cap A_{I}B∩Ai Mutually disjoint events.
C.4 Expectation and Markov inequality
Definition C.10 (Expected value) The expected value or mean value of a random variable XXX is represented by E[X]\mathbb E[X]E[X] E[X]
When XXX follows the probability distribution D\mathcal DD, we also use Ex∈X[X]\mathbb E_{X \in\ Mathcal X}[X] Ex∈X[X] instead of E[X]\mathbb E[X]E[X] to express the distribution explicitly. A basic property of expectation, which can be directly proved by its definition, is that it is linear, that is, for any two random variables X and YX and YX and Y, and for any A,b∈Ra,b\in\mathbb Ra,b∈R, there are the following: E [aX + bX] = aE [X] + bE [Y] \ qquad \ qquad \ qquad \ qquad \ mathbb E [aX + bX] = a \ mathbb E [X] + b \ mathbb E [Y] [aX + bX] E = aE [X] + bE [Y] in addition, When XXX and YYY are independent random variables, the following identities hold: E[XY]=E[X]E[Y]\qquad\qquad\qquad\qquad\ qquad\mathbb E[XY]= mathbb E[X]\mathbb E[Y]E[XY]=E[X]E[Y] in fact, according to the definition of expectation and independence, we can write
In the last step, we use Faubini’s theorem. A simple bound on the expectation of a nonnegative random variable, called Markov’s inequality, is given below. Theorem C.11 (Markov inequality) Let XXX be a nonnegative random variable of E[X]<∞\mathbb E[X]<\inftyE[X]<∞, t>0t>0t>0,
Proof: The proof steps are as follows:
So much for the proof.
C.5 Variance and Chebyshev inequality
Definition C.12 (variance-standard deviation) the variance of random variable XXX is expressed by Var[X]Var[X]Var[X] Defined as Var (X) = E [X – E [X] 2] \ qquad \ qquad \ qquad \ qquad \ qquad Var (X) = \ mathbb E [X] [X – \ mathbb E ^ {2}] Var (X) = E [X – E [X] 2] by the standard deviation of random variable XXX Sigma =Var[X]\qquad\qquad\qquad\qquad\qquad\ qquad\qquad\qquad\qquad\sigma X= SQRT {Var[X]} sigma X=Var[X]. For any random variable XXX and any A ∈Ra\in\mathbb Ra∈R, the following basic properties of the variance can be directly proved: Var (X) = E [X2] – [X] E 2 \ qquad \ qquad \ qquad \ qquad \ qquad Var [X] = [X ^ 2] \ mathbb E – \ mathbb [X] E ^ 2 Var (X) = E [X2] – [X] 2 E Var[aX]=a2Var[X]\qquad\qquad\qquad\qquad\qquad\ qquad\qquad\qquad\qquad\ Var[aX]=a2Var[X] Var[X+Y]=Var[X]+Var[Y]\qquad\qquad\qquad\qquad \qquad\qquad Var[X+Y]=Var[X]+Var[Y]Var[X+Y]=Var[X]+Var[Y [Y] – [X] E E E (X, Y) = 0 \ mathbb E [X] \ mathbb E [Y] – [Y] \ mathbb E = 0 E [X] [Y] – [Y] E E = 0 with independent and YX and YX X and Y, we can write
\qquad The following inequality is called Chebyshev’s inequality, which defines the standard deviation between a random variable and its expected value. Theorem C.13 (Chebyshev’s inequality) Let XXX be a random variable with Var[X]<+∞Var[X]<+\inftyVar[X]<+∞. For all t>0t>0t>0, the following inequality holds:
Li proves that: P [∣ X – E [X] ∣ P t sigma X] = P [X – E [X] or t2 sigma 2 X2] \ qquad \ qquad \ qquad \ mathbb P [\ vert X – [X] \ \ mathbb E ge t \ sigma_X vert \] = \ mathbb P [X – \ mathbb E [X] ge t ^ 2 ^ 2 \ \ sigma P ^ 2 _x] [∣ X – E [X] ∣ P t sigma X] = P [X – E [X] or t2 sigma 2 X2] is obtained by using chebyshev inequality (X – E [X]) 2 (X – \ mathbb E [X]) ^ 2 (X – E [X]). We will use Chebyshev’s inequality to prove the following theorem. Theorem C.14 (Weak Law of Large Numbers) Let (Xn)n∈ n (X_n)_{n\in\mathbb n}(Xn)n∈ n be a sequence of independent random variables with the same mean μ\muμ and variance σ2<∞\sigma^2< inftyσ2<∞. set
ε>0\varepsilon>0ε>0
Proof: Since the variables are independent, we can write
Therefore, by chebyshev inequality (t = epsilon/(Var ‾ n [X]) = 1/2 t \ varepsilon/(Var [\ overline X_n]) ^ 1/2} {t = epsilon/Var/Xn) (1/2) get the following conclusion:
Implication (C.19) Example C.15 (applying Chebyshev’s inequality) Suppose we roll a pair of fair dice NNN times. Can we estimate the total price of the NNN roll? If we calculate the mean and variance, we find that μ=7n,σ2=35/6n\mu=7n,\sigma^2=35/6nμ=7n,σ2=35/6n (we leave the reader to verify these expressions). Thus, applying Chebyshev’s inequality, we can see that the final sum is within 7n+10356n7n+10\ SQRT {\ FRAc {35}{6}n}7n+10635n for at least 999999% of all experiments. So after a million flips, the odds of a total between 6.975m6.9775m6.9775m and 7.025m7.025m7.025m are better than 99:1. Definition C.16 (covariance) The covariance of the two random variables X and YX and the covariance of YX and Y is expressed by Cov(X,Y)Cov(X,Y)Cov(X,Y) Cov(X,Y), which is defined as: [Cov (X, Y) = E (X – [X]) E (Y – E) [Y]] \ qquad \ qquad \ qquad \ qquad \ qquad Cov (X, Y) = \ mathbb [E (X – \ mathbb E [X]) (Y – \ mathbb E) [Y]] Cov (X, Y) = E [[X]) (X – E (Y – E) [Y]] when Cov (X, Y = 0) Cov (X, Y = 0) Cov (X, Y = 0), two random variables X and YX and YX and Y is considered to be unrelated. It’s easy to see that if two random variables X and YX and YX and Y are independent of each other, then they’re not correlated, but vice versa is generally not true. Covariance defines a positive semidefinite and symmetric bilinear form: Cov(X,Y)=Cov(Y,X)Cov(X,Y)=Cov(Y,X)Cov(X,Y)=Cov(Y,X) For any two random variables X and YX and YX and Y;
- Independence: Cov(X,Y)=Cov(Y,X)Cov(X,Y)=Cov(Y,X)Cov(X,Y)=Cov(Y,X) For any two random variables X and YX and YX and Y;
- Bilinear: Cov (X + ‘X, Y) = Cov (X, Y) + Cov (‘ X, Y), Cov (aX, Y) Cov (X + X ^ {\ prime}, Y) = Cov (X, Y) + Cov (X ^ {\ prime}, Y), Cov (aX, Y) Cov (X +’ X, Y) = Cov (X, Y) + Co ‘v (X, Y), Cov (aX, Y) for any random variable X, X’, Y, and a ∈ RX, X ^ {\ prime}, Y, and a \ \ mathbb in RX, ‘X, Y and a ∈ R;
- Positive half has limits: Cov(X,X)=Var[X]≥0Cov(X,X)=Var[X]\ge0Cov(X,X)=Var[X]≥0 for any random variable XXX. Cauchy inequality Schwartz of the following applies to random variable X and YX and YX and Y, Var [X] < + up, Var [Y] < + up Var [X] < + \ infty, Var [Y] < + \ inftyVar [X] < + up, Var [Y] < + up: ∣ Cov (X, Y) ∣ Var [X] or less Var [Y] \ qquad \ qquad \ qquad \ qquad \ vert Cov (X, Y), vert, le, SQRT {Var [X] Var [Y]} ∣ Cov (X, Y) ∣ Var [X] or less Var (Y). The following definition. Define c.17 random variable vector X=(X1… , XN) X = (X_1, \ dots, X_N) X = (X1,… The covariance matrix of,XN) is RN×N\mathbb R^{N\times N} represented by C(X)C(X)C(X) in RN×N and is defined as: C (X) = E [[X]) (X – E – E (X [X]) T] \ qquad \ qquad \ qquad \ qquad C (X) = \ mathbb E [(X – \ mathbb E [X]) (X – \ mathbb E] [X]) ^ T C (X) = E [[X]) (X – E – E (X [X]) T] thus, C (X) = (Cov) (Xi, Xi) ijC (X) = (Cov (X_i X_i)) _ {ij} (X) = C (Cov (Xi, Xi) ij. It is easy to prove that C(X)=E[XXT]−E[X]E[X]T\qquad\qquad\qquad\qquad \qquad C(X)=\mathbb E[XX^T]-\mathbb E[X]\mathbb [X] ^ E TC (X) = E [XXT] – [X] [X] E T E. We conclude this appendix with the following famous probability theorem. Theorem C.18 (Central limit theorem) Let X1,… , XnX_1, \dots, X_nX1… , Xn is an I.I.D.I.I.D.I.I.I.I.D with mean μ\muμ and standard deviation σ\sigmaσ. Sequence of random variables. Set X ‾ nxi n = 1 n ∑ I = 1, 2 = sigma sigma ‾ ‾ 2 / n \ overline X_n = \ frac {1} {n} \ sum ^ {n} _ {I = 1} X_i, \ overline \ sigma ^ 2 = \ overline \ sigma ^ 2 / nXn = nxi n1 ∑ I = 1, 2 = sigma sigma 2 / n. Then, (X ‾ n – mu)/sigma ‾ n (\ overline X_n – \ mu) / \ overline \ sigma_n (Xn – mu) sigma convergence in distribution of n/n (0, 1) n (0, 1) n (0, 1), or for any t ∈ Rt \ \ mathbb in Rt ∈ R,
C.6 Expected value of the moment generating function
E[Xp]\qquad\mathbb E[X^p]E[Xp] the matrix called the random variable XXX. The moment generating function of the random variable XXX is a key function whose different moments can be calculated directly by differentiating at zero. Therefore, it is critical to specify the distribution of XXX or to analyze its properties. Definition C.19 (Dynamic range Generator) THE dynamic range generator of a random variable XXX is a function defined on the set t∈Rt\in\mathbb Rt∈R MX:t→E[etX]M_X:t\rightarrow\mathbb E[E ^{tX}]MX:t→E[etX], Its expectations are limited. If MXM_XMX is differentiable at zero, then the matrix of the XXX (Xp) E = MX (p) (0) \ mathbb E ^ p] [X = M ^ {(p)} _ {X} (0) E (Xp) = MX (p) (0) is given. In the next chapter, we will give a general bound (lemma D. 1) on the dynamic difference generating function of a zero mean bounded random variable. Here, we use two special cases to illustrate its calculation. Example C.20 (Standard Normal distribution) Let XXX be a random variable, subject to a normal distribution of mean 000 and variance 111. Then, MXM_XMX is defined for all T ∈Rt\in\mathbb Rt∈R by recognizing that the last integral is the probability density function of the mean TTT and the normal distribution with variance of 111
Example C.21 (X2X^2X2- distribution) Let XXX be a random variable, subject to X2X^2X2- square distribution, with degrees of freedom KKK. We can write X=∑ I =1kXi2X=\sum^{k}_{I =1}X^2_iX=∑ I =1kXi2 where XXX is independent and follows the standard normal distribution. Let t < 1/2 1/2 t t < < 1/2. Based on I.I.D.I.I.D.I.I.D. ‘s assumption of the variable XiX_iXi, we can write
According to the standard definition of a normal distribution,
\mu=\ SQRT {1-2t}x.μ=1−2t x. It can be seen that, X2X^2X2- The dynamic difference generation function of the distribution is ∀<1/2,MX(t)=E[etX]=(1−2t)− K2 \qquad\qquad\qquad\qquad\forall<1/2,M_X(t)=\mathbb E/E ^ {tX}] = (1-2 t) ^ {- \ frac {k} {2}} ∀ < 1/2, MX (t) = E (etX) = (1-2 t) – 2 k
C. 7 exercises
C. set f: (0, 1 + up) – E + f (0, + \ \ infty) rightarrow \ mathbb E_ + f (0, + up) – E + f is a allowed – 1 f ^ {1} inverse function f – 1, and set the XXX is a random variable. Show that if for any t > 0, P (X > t] f (t) t > 0 or less, \ mathbb P > t [X] \ le f (t) t > 0, P (X > t] f (t), or less, for any delta > \ delta delta > > 0 0 0, Probability of at least 1 – the delta, X or less f – 1 – \ delta (delta), 1 X \ le f ^ {1} (1 \ delta) – the delta, X or less f – 1 (delta). C. 2 XXX is a discrete random variable, take non-negative integer value. Prove E = ∑ n P 1 P P n] [X = \ \ mathbb E sum_ {n \ ge1} \ mathbb P \ ge n [X] E = ∑ n P 1 P [X P n] (hint: P/X P n – P [X P n + 1] \ mathbb P ge n [X \] – \ mathbb P [\ ge X N + 1] P [X P n] – P [X n + 1) or higher).