Lecture Notes 1 Brief Review of Basic Probability 1 Probability Review I assume you know basic probability. Chapters 1-3 are a review. I will assume you have read and understood Chapters 1-3. If not, you should be in 36-700. 1.1 Random Variables A random variable is a map X from a set Ω (equipped with a probability P ) to R. We write P (X ∈ A) = P ({ω ∈ Ω : X(ω) ∈ A}) and we write X ∼ P to mean that X has distribution P . The cumulative distribution function (cdf ) of X is FX (x) = F (x) = P (X ≤ x). If X is discrete, its probability mass function (pmf ) is pX (x) = p(x) = P (X = x). If X is continuous, then its probability density function function (pdf ) satisfies Z Z p(x)dx pX (x)dx = P (X ∈ A) = A A and pX (x) = p(x) = F 0 (x). The following are all equivalent: X ∼ P, X ∼ F, X ∼ p. Suppose that X ∼ P and Y ∼ Q. We say that X and Y have the same distribution if P (X ∈ A) = Q(Y ∈ A) for all A. In that case we say that X and Y are equal in d distribution and we write X = Y . d Lemma 1 X = Y if and only if FX (t) = FY (t) for all t. 1 1.2 Expected Values The mean or expected value of g(X) is Z E (g(X)) = Z g(x)dF (x) = ( R∞ g(x)p(x)dx if X is continuous −∞ g(x)dP (x) = P g(xj )p(xj ) if X is discrete. j Recall that: P P 1. E( kj=1 cj gj (X)) = kj=1 cj E(gj (X)). 2. If X1 , . . . , Xn are independent then E n Y ! Xi = i=1 Y E (Xi ) . i 3. We often write µ = E(X). 4. σ 2 = Var (X) = E ((X − µ)2 ) is the Variance. 5. Var (X) = E (X 2 ) − µ2 . 6. If X1 , . . . , Xn are independent then Var n X ! ai X i i=1 = X a2i Var (Xi ) . i 7. The covariance is Cov(X, Y ) = E((X − µx )(Y − µy )) = E(XY ) − µX µY and the correlation is ρ(X, Y ) = Cov(X, Y )/σx σy . Recall that −1 ≤ ρ(X, Y ) ≤ 1. The conditional expectation of Y given X is the random variable E(Y |X) whose value, when X = x is Z E(Y |X = x) = y p(y|x)dy where p(y|x) = p(x, y)/p(x). 2 The Law of Total Expectation or Law of Iterated Expectation: Z E(Y ) = E E(Y |X) = E(Y |X = x)pX (x)dx. The Law of Total Variance is Var(Y ) = Var E(Y |X) + E Var(Y |X) . The moment generating function (mgf ) is MX (t) = E etX . d If MX (t) = MY (t) for all t in an interval around 0 then X = Y . (n) Exercise: show that MX (t)|t=0 = E (X n ) . 1.3 Transformations Let Y = g(X). Then Z FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = pX (x)dx A(y) where Ay = {x : g(x) ≤ y}. Then pY (y) = FY0 (y). If g is monotonic, then dh(y) pY (y) = pX (h(y)) dy where h = g −1 . Example 2 Let pX (x) = e−x for x > 0. Hence FX (x) = 1 − e−x . Let Y = g(X) = log X. Then FY (y) = P (Y ≤ y) = P (log(X) ≤ y) y = P (X ≤ ey ) = FX (ey ) = 1 − e−e y and pY (y) = ey e−e for y ∈ R. Example 3 Practice problem. Let X be uniform on (−1, 2) and let Y = X 2 . Find the density of Y . 3 Let Z = g(X, Y ). For example, Z = X + Y or Z = X/Y . Then we find the pdf of Z as follows: 1. For each z, find the set Az = {(x, y) : g(x, y) ≤ z}. 2. Find the CDF Z Z FZ (z) = P (Z ≤ z) = P (g(X, Y ) ≤ z) = P ({(x, y) : g(x, y) ≤ z}) = pX,Y (x, y)dxdy. Az 3. The pdf is pZ (z) = FZ0 (z). Example 4 Practice problem. Let (X, Y ) be uniform on the unit square. Let Z = X/Y . Find the density of Z. 1.4 Independence X and Y are independent if and only if P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B) for all A and B. Theorem 5 Let (X, Y ) be a bivariate random vector with pX,Y (x, y). X and Y are independent iff pX,Y (x, y) = pX (x)pY (y). X1 , . . . , Xn are independent if and only if P(X1 ∈ A1 , . . . , Xn ∈ An ) = n Y P(Xi ∈ Ai ). i=1 Q Thus, pX1 ,...,Xn (x1 , . . . , xn ) = ni=1 pXi (xi ). If X1 , . . . , Xn are independent and identically distributed we say they are iid (or that they are a random sample) and we write X1 , . . . , X n ∼ P 1.5 or X1 , . . . , X n ∼ F or Important Distributions Normal (Gaussian). X ∼ N (µ, σ 2 ) if 1 2 2 p(x) = √ e−(x−µ) /(2σ ) . σ 2π 4 X1 , . . . , Xn ∼ p. If X ∈ Rd then X ∼ N (µ, Σ) if 1 T −1 exp − (x − µ) Σ (x − µ) . p(x) = (2π)d/2 |Σ| 2 P Chi-squared. X ∼ χ2p if X = pj=1 Zj2 where Z1 , . . . , Zp ∼ N (0, 1). Bernoulli. X ∼ Bernoulli(θ) if P(X = 1) = θ and P(X = 0) = 1 − θ and hence 1 p(x) = θx (1 − θ)1−x x = 0, 1. Binomial. X ∼ Binomial(θ) if n x p(x) = P(X = x) = θ (1 − θ)n−x x x ∈ {0, . . . , n}. Uniform. X ∼ Uniform(0, θ) if p(x) = I(0 ≤ x ≤ θ)/θ. −λ x Poisson. X ∼ Poisson(λ) if P (X = x) = e x!λ x = 0, 1, 2, . . .. The E (X) = Var (X) = λ t and MX (t) = eλ(e −1) . We can use the mgf to show: if X1 ∼ Poisson(λ1 ), X2 ∼ Poisson(λ2 ), independent then Y = X1 + X2 ∼ Poisson(λ1 + λ2 ). Multinomial. The multivariate version of a Binomial is called a Multinomial. Consider drawing a ball from an urn with has balls with kPdifferent colors labeled “color 1, color 2, . . . , color k.” Let p = (p1 , p2 , . . . , pk ) where j pj = 1 and pj is the probability of drawing color j. Draw n balls from the urn (independently and with replacement) and let X = (X1 , X2 , . . . , Xk ) be the count of the number of balls of each color drawn. We say that X has a Multinomial (n, p) distribution. The pdf is n p(x) = px1 . . . pxkk . x1 , . . . , x k 1 Exponential. X ∼ exp(β) if pX (x) = β1 e−x/β , x > 0. Note that exp(β) = Γ(1, β). Gamma. X ∼ Γ(α, β) if 1 pX (x) = xα−1 e−x/β α Γ(α)β R ∞ 1 α−1 −x/β for x > 0 where Γ(α) = 0 β α x e dx. More on the Multivariate Normal. Let Y ∈ Rd . Then Y ∼ N (µ, Σ) if 1 1 T −1 p(y) = exp − (y − µ) Σ (y − µ) . (2π)d/2 |Σ|1/2 2 Then E(Y ) = µ and cov(Y ) = Σ. The moment generating function is tT Σt T M (t) = exp µ t + . 2 5 Theorem 6 (a). If Y ∼ N (µ, Σ), then E(Y ) = µ, cov(Y ) = Σ. (b). If Y ∼ N (µ, Σ) and c is a scalar, then cY ∼ N (cµ, c2 Σ). (c). Let Y ∼ N (µ, Σ). If A is p × n and b is p × 1, then AY + b ∼ N (Aµ + b, AΣAT ). Theorem 7 Suppose that Y ∼ N (µ, Σ). Let Y1 µ1 Σ11 Σ12 Y = , µ= , Σ= . Y2 µ2 Σ21 Σ22 where Y1 and µ1 are p × 1, and Σ11 is p × p. (a). Y1 ∼ Np (µ1 , Σ11 ), Y2 ∼ Nn−p (µ2 , Σ22 ). (b). Y1 and Y2 are independent if and only if Σ12 = 0. (c). If Σ22 > 0, then the condition distribution of Y1 given Y2 is −1 Y1 |Y2 ∼ Np (µ1 + Σ12 Σ−1 22 (Y2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ). (1) Lemma 8 Let Y ∼ N (µ, σ 2 I), where Y T = (Y1 , . . . , Yn ), µT = (µ1 , . . . , µn ) and σ 2 > 0 is a scalar. Then the Yi are independent, Yi ∼ N1 (µ, σ 2 ) and T Y TY µ µ ||Y ||2 2 = ∼ χn . 2 2 σ σ σ2 Theorem 9 Let Y ∼ N (µ, Σ). Then: (a). Y T Σ−1 Y ∼ χ2n (µT Σ−1 µ). (b). (Y − µ)T Σ−1 (Y − µ) ∼ χ2n (0). 1.6 Sample Mean and Variance The sample mean is Xn = 1X Xi n i and the sample variance is Sn2 = 1 X (Xi − X)2 . n−1 i Let X1 , . . . , Xn be iid with µ = E(Xi ) = µ and σ 2 = Var(Xi ) = σ 2 . Then E(X n ) = µ, σ2 Var(X n ) = , n Theorem 10 If X1 , . . . , Xn ∼ N (µ, σ 2 ) then 2 (a) X n ∼ N (µ, σn ). (b) 2 (n−1)Sn σ2 ∼ χ2n−1 . (c) X n and Sn2 are independent. 6 E(S 2 ) = σ 2 . 1.7 Delta Method If X ∼ N (µ, σ 2 ), Y = g(X) and σ 2 is small then Y ≈ N (g(µ), σ 2 (g 0 (µ))2 ). To see this, note that Y = g(X) = g(µ) + (X − µ)g 0 (µ) + (X − µ)2 00 g (ξ) 2 for some ξ. Now E((X − µ)2 ) = σ 2 which we are assuming is small and so Y = g(X) ≈ g(µ) + (X − µ)g 0 (µ). Thus E(Y ) ≈ g(µ), Var(Y ) ≈ (g 0 (µ))2 σ 2 . Hence, g(X) ≈ N g(µ), (g 0 (µ))2 σ 2 . 7