Lecture Notes 4 Convergence (Chapter 5) 1 Random Samples Let X1 , . . . , Xn ∼ F . A statistic is any function Tn = g(X1 , . . . , Xn ). Recall that the sample mean is n 1X Xi Xn = n i=1 and sample variance is n Sn2 1 X (Xi − X n )2 . = n − 1 i=1 Let µ = E(Xi ) and σ 2 = Var(Xi ). Recall that E(X n ) = µ, Var(X n ) = σ2 , n E(Sn2 ) = σ 2 . Theorem 1 If X1 , . . . , Xn ∼ N (µ, σ 2 ) then X n ∼ N (µ, σ 2 /n). Proof. We know that MXi (s) = eµs+σ t MX n (t) = E(etX n ) = E(e n Pn i=1 Xi 2 s2 /2 . So, ) = (EetXi /n )n = (MXi (t/n))n = e(µt/n)+σ 2 t2 /(2n2 ) n ( = exp µt + σ2 2 t n ) 2 which is the mgf of a N (µ, σ 2 /n). Example 2 Let X(1) , . . . , X(n) denoted the ordered values: X(1) ≤ X(2) ≤ · · · ≤ X(n) . Then X(1) , . . . , X(n) are called the order statistics. And Tn = (X(1) , . . . , X(n) ) is a statistic. 1 2 Convergence Let X1 , X2 , . . . be a sequence of random variables and let X be another random variable. Let Fn denote the cdf of Xn and let F denote the cdf of X. We are going to study different types of convergence. Example: A good example to keep in mind is the folllowing. Let Y1 , Y2 , . . . , be a sequence of iid random variables. Let n 1X Yi Xn = n i=1 be the average of the first n of the Yi ’s. This defines a new sequence X1 , X2 , . . . ,. In other words, the sequence of interest X1 , X2 , . . . , might be a sequence of statistics based on some other sequence of iid random variables. Note that the original sequence Y1 , Y2 , . . . , is iid but the sequence X1 , X2 , . . . , is not iid. a.s. 1. Xn converges almost surely to X, written Xn → X, if, for every > 0, P lim |Xn − X| < = 1. n→∞ (1) a.s. Xn converges almost surely to a constant c, written Xn → c, if, for every > 0, P lim Xn = c = 1. (2) n→∞ P 2. Xn converges to X in probability, written Xn → X, if, for every > 0, P(|Xn − X| > ) → 0 (3) as n → ∞. In other words, Xn − X = oP (1). P Xn converges to c in probability, written Xn → c, if, for every > 0, P(|Xn − c| > ) → 0 (4) as n → ∞. In other words, Xn − c = oP (1). 3. Xn converges to X in quadratic mean (also called convergence in L2 ), written qm Xn → X, if E(Xn − X)2 → 0 (5) as n → ∞. qm Xn converges to c in quadratic mean, written Xn → c, if E(Xn − c)2 → 0 as n → ∞. 2 (6) 4. Xn converges to X in distribution, written Xn X, if lim Fn (t) = F (t) (7) n→∞ at all t for which F is continuous. Xn converges to c in distribution, written Xn c, if lim Fn (t) = δc (t) n→∞ (8) at all t 6= c where δc (t) = 0 if t < c and δc (t) = 1 if t ≥ c. Theorem 3 Convergence in probability does not imply almost sure convergence. Proof. Let Ω = [0, 1]. Let P be uniform on [0, 1]. We draw S ∼ P . Let X(s) = s and let X1 = s + I[0,1] (s), X4 = s + I[0,1/3] (s), X2 = s + I[0,1/2] (s), X3 = s + I[1/2,1] (s) X5 = s + I[1/3,2/3] (s), X6 = s + I[2/3,1] (s) P etc. Then Xn → X. But, for each s, Xn (s) does not converge to X(s). Hence, Xn does not converge almost surely to X. In fact, P ({s ∈ Ω : limn Xn (s) = X(s)}) = 0. Example 4 Let Xn ∼ N (0, 1/n). Intuitively, Xn is concentrating at 0 so we would like to say that Xn converges to 0. Let’s √ see if this is true. Let F be the distribution function for a point mass at 0. Note that nXn ∼ N (0, 1). Let Z denote a standard normal random variable. For t < 0, √ √ √ Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 0 √ since nt → −∞. For t > 0, √ √ √ Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 1 √ since nt → ∞. Hence, Fn (t) → F (t) for all t 6= 0 and so Xn 0. Notice that Fn (0) = 1/2 6= F (1/2) = 1 so convergence fails at t = 0. That doesn’t matter because t = 0 is not a continuity point of F and the definition of convergence in distribution only requires convergence at continuity points. Now consider convergence in probability. For any > 0, using Markov’s inequality, 1 E(Xn2 ) n P(|Xn | > ) = P(|Xn | > ) ≤ = 2 →0 2 2 2 P as n → ∞. Hence, Xn → 0. 3 The next theorem gives the relationship between the types of convergence. Theorem 5 The following relationships hold: qm P (a) Xn → X implies that Xn → X. P (b) Xn → X implies that Xn X. P (c) If Xn X and if P(X = c) = 1 for some real number c, then Xn → X. as P (d) Xn → X implies Xn → X. In general, none of the reverse implications hold except the special case in (c). qm Proof. We start by proving (a). Suppose that Xn → X. Fix > 0. Then, using Markov’s inequality, P(|Xn − X| > ) = P(|Xn − X|2 > 2 ) ≤ E|Xn − X|2 → 0. 2 Proof of (b). Fix > 0 and let x be a continuity point of F . Then Fn (x) = P(Xn ≤ x) = P(Xn ≤ x, X ≤ x + ) + P(Xn ≤ x, X > x + ) ≤ P(X ≤ x + ) + P(|Xn − X| > ) = F (x + ) + P(|Xn − X| > ). Also, F (x − ) = P(X ≤ x − ) = P(X ≤ x − , Xn ≤ x) + P(X ≤ x − , Xn > x) ≤ Fn (x) + P(|Xn − X| > ). Hence, F (x − ) − P(|Xn − X| > ) ≤ Fn (x) ≤ F (x + ) + P(|Xn − X| > ). Take the limit as n → ∞ to conclude that F (x − ) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x + ). n→∞ n→∞ This holds for all > 0. Take the limit as → 0 and use the fact that F is continuous at x and conclude that limn Fn (x) = F (x). Proof of (c). Fix > 0. Then, P(|Xn − c| > ) = ≤ = → = P(Xn < c − ) + P(Xn > c + ) P(Xn ≤ c − ) + P(Xn > c + ) Fn (c − ) + 1 − Fn (c + ) F (c − ) + 1 − F (c + ) 0 + 1 − 1 = 0. 4 Proof of (d). Omitted. Let us now show that the reverse implications do not hold. Convergence in √in quadratic mean. Let U ∼ Unif(0, 1) √probability does not imply convergence and let Xn = nI(0,1/n) (U ). Then P(|Xn | > ) = P( nI(0,1/n) (U ) > ) = P(0 ≤ U < 1/n) = R 1/n P 1/n → 0. Hence, Xn → 0. But E(Xn2 ) = n 0 du = 1 for all n so Xn does not converge in quadratic mean. Convergence in distribution does not imply convergence in probability. Let X ∼ N (0, 1). Let Xn = −X for n = 1, 2, 3, . . .; hence Xn ∼ N (0, 1). Xn has the same distribution function as X for all n so, trivially, limn Fn (x) = F (x) for all x. Therefore, Xn X. But P(|Xn − X| > ) = P(|2X| > ) = P(|X| > /2) 6= 0. So Xn does not converge to X in probability. The relationships between the types of convergence can be summarized as follows: q.m. ↓ a.s. → prob → distribution P Example 6 One might conjecture that if Xn → b, then E(Xn ) → b. This is not true. Let Xn be a random variable defined by P(Xn = n2 ) = 1/n and P(Xn = 0) = 1 − (1/n). P Now, P(|Xn | < ) = P(Xn = 0) = 1 − (1/n) → 1. Hence, Xn → 0. However, E(Xn ) = [n2 × (1/n)] + [0 × (1 − (1/n))] = n. Thus, E(Xn ) → ∞. Example 7 Let X1 , . . . , Xn ∼ Uniform(0, 1). Let X(n) = maxi Xi . First we claim that P X(n) → 1. This follows since P(|X(n) − 1| > ) = P(X(n) ≤ 1 − ) = Y P(Xi ≤ 1 − ) = (1 − )n → 0. i Also P(n(1 − X(n) ) ≤ t) = P(X(n) ≥ 1 − (t/n)) = 1 − (1 − t/n)n → 1 − e−t . So n(1 − X(n) ) Exp(1). Some convergence properties are preserved under transformations. Theorem 8 Let Xn , X, Yn , Y be random variables. P P P (a) If Xn → X and Yn → Y , then Xn + Yn → X + Y . qm qm qm (b) If Xn → X and Yn → Y , then Xn + Yn → X + Y . P P P (c) If Xn → X and Yn → Y , then Xn Yn → XY . 5 In general, Xn X and Yn are cases when it does: Y does not imply that Xn + Yn Theorem 9 (Slutzky’s Theorem) If Xn If Xn X and Yn c, then Xn Yn cX. X and Yn X + Y . But there c, then Xn + Yn X + c. Also, Theorem 10 (The Continuous Mapping Theorem) Let Xn , X, Yn , Y be random variables. Let g be a continuous function. P P (a) If Xn → X, then g(Xn ) → g(X). (b) If Xn X, then g(Xn ) g(X). Exercise: Prove the continuous mapping theorem. 3 The Law of Large Numbers The law of large numbers (LLN) says that the mean of a large sample is close to the mean of the distribution. For example, the proportion of heads of a large number of tosses of a fair coin is expected to be close to 1/2. We now make this more precise. Let X1 , X2 , . . . be an iid sample, let µ = E(X1 ) and σ 2 = Var(X1 ). Recall that the sample P n mean is defined as X n = n−1 i=1 Xi and that E(X n ) = µ and Var(X n ) = σ 2 /n. Theorem 11 (The Weak Law of Large Numbers (WLLN)) P If X1 , . . . , Xn are iid, then X n → µ. Thus, X n − µ = oP (1). Interpretation of the WLLN: The distribution of X n becomes more concentrated around µ as n gets large. Proof. Assume that σ < ∞. This is not necessary but it simplifies the proof. Using Chebyshev’s inequality, Var(X n ) σ2 = P |X n − µ| > ≤ 2 n2 which tends to 0 as n → ∞. Theorem 12 The Strong Law of Large Numbers. Let X1 , . . . , Xn be iid with mean µ. as Then X n → µ. The proof is beyond the scope of this course. 6 4 The Central Limit Theorem The law of large numbers says that the distribution of X n piles up near µ. This isn’t enough to help us approximate probability statements about X n . For this we need the central limit theorem. 2 Suppose that X1 , . . . , Xn are iid P with mean µ and variance σ . The central limit the−1 orem (CLT) says that X n = n i Xi has a distribution which is approximately Normal with mean µ and variance σ 2 /n. This is remarkable since nothing is assumed about the distribution of Xi , except the existence of the mean and variance. Theorem 13 (The Central Limit P Theorem (CLT)) Let X1 , . . . , Xn be iid with mean µ and variance σ 2 . Let X n = n−1 ni=1 Xi . Then √ Xn − µ n(X n − µ) Zn ≡ q = Z σ Var(X n ) where Z ∼ N (0, 1). In other words, Z z lim P(Zn ≤ z) = Φ(z) = n→∞ −∞ 1 2 √ e−x /2 dx. 2π Interpretation: Probability statements about X n can be approximated using a Normal distribution. It’s the probability statements that we are approximating, not the random variable itself. Remark: We often write Xn ≈ N as short form for √ n(X n − µ) σ2 µ, n N (0, 1). Recall that if X is a random variable, its moment generating function (mgf) is ψX (t) = EetX . Assume in what follows that the mgf is finite in a neighborhood around t = 0. Lemma 14 Let Z1 , Z2 , . . . be a sequence of random variables. Let ψn be the mgf of Zn . Let Z be another random variable and denote its mgf by ψ. If ψn (t) → ψ(t) for all t in some open interval around 0, then Zn Z. P −1/2 Proof of the central limit theorem. Let Y = (X − µ)/σ. Then, Z = n i i n i Yi . P √ Let ψ(t) be the mgf of Yi . The mgf of i Yi is (ψ(t))n and mgf of Zn is [ψ(t/ n)]n ≡ ξn (t). Now ψ 0 (0) = E(Y1 ) = 0, ψ 00 (0) = E(Y12 ) = Var(Y1 ) = 1. So, t2 00 t3 000 ψ(t) = ψ(0) + tψ (0) + ψ (0) + ψ (0) + · · · 2! 3! t2 t3 000 = 1 + 0 + + ψ (0) + · · · 2 3! t2 t3 000 = 1 + + ψ (0) + · · · 2 3! 0 7 Now, n t ξn (t) = ψ √ n n 2 t t3 000 = 1+ + ψ (0) + · · · 2n 3!n3/2 #n " t3 t2 000 + ψ (0) + · · · 1/2 3!n = 1+ 2 n 2 /2 → et which is the mgf of a N(0,1). The result follows from Lemma 14. In the last step we used the fact that if an → a then an n 1+ → ea . n √ The central limit theorem tells us that Zn = n(X n − µ)/σ is approximately N(0,1). However, we rarely know σ. We can estimate σ 2 from X1 , . . . , Xn by n Sn2 1 X (Xi − X n )2 . = n − 1 i=1 This raises the following question: if we replace σ with Sn , is the central limit theorem still true? The answer is yes. Theorem 15 Assume the same conditions as the CLT. Then, √ n(X n − µ) Tn = N (0, 1). Sn Proof. Here is a brief proof. We have that Tn = Zn Wn where √ Zn = n(X n − µ) σ and Wn = Now Zn P σ . Sn N (0, 1) and Wn → 1. The result follows from Slutzky’s theorem. 8 Here is an extended proof. P Step 1. We first show that Rn2 → σ 2 where n Rn2 1X (Xi − X n )2 . = n i=1 Note that n 1X 2 2 Rn = X − n i=1 i n 1X Xi n i=1 !2 . Define Yi = Xi2 . Then, using the LLN (law of large numbers) n n 1X 1X 2 P Xi = Yi → E(Yi ) = E(Xi2 ) = µ2 + σ 2 . n i=1 n i=1 Next, by the LLN, n 1X P Xi → µ. n i=1 Since g(t) = t2 is continuous, the continuous mapping theorem implies that !2 n 1X P Xi → µ2 . n i=1 Thus P Rn2 → (µ2 + σ 2 ) − µ2 = σ 2 . Step 2. Note that Sn2 = n n−1 P Rn2 . P Since, Rn2 → σ 2 and n/(n − 1) → 1, we have that Sn2 → σ 2 . Step 3. Since g(t) = P implies that Sn → σ. √ t is continuous, (for t ≥ 0) the continuous mapping theorem Step 4. Since g(t) = t/σ is continuous, the continuous mapping theorem implies that P Sn /σ → 1. Step 5. Since g(t) = 1/t is continuous (for t > 0) the continuous mapping theorem P implies that σ/Sn → 1. Since convergence in probability implies convergence in distribution, σ/Sn 1. 9 Step 5. Note that √ Tn = n(X n − µ) σ σ Sn ≡ Vn Wn . Now Vn Z where Z ∼ N (0, 1) by the CLT. And we showed that Wn 1. By Slutzky’s theorem, Tn = Vn Wn Z × 1 = Z. The next result is very important. It tells us how close the distribution of X is to the Normal distribution. Theorem 16 (Berry-Esseen Theorem) Let X1 , . . . , Xn ∼ P . Let µ = E[Xi ] and σ 2 = Var[Xi ]. Assume that µ3 = E[|Xi − µ|3 ] < ∞. Let √ n(X n − µ) ≤z . Fn (z) = P σ Then sup |Fn (z) − Φ(z)| ≤ z 33 µ3 √ . 4 σ3 n There is also a multivariate version of the central limit theorem. Recall that X = (X1 , . . . , Xk )T has a multivariate Normal distribution with mean vector µ and covariance matrix Σ if 1 1 T −1 f (x) = exp − (x − µ) Σ (x − µ) . (2π)k/2 |Σ|1/2 2 In this case we write X ∼ N (µ, Σ). Theorem 17 (Multivariate central limit theorem) Let X1 , . . . , Xn be iid random vectors where Xi = (X1i , . . . , Xki )T withP mean µ = (µ1 , . . . , µk )T and covariance matrix Σ. Let X = (X 1 , . . . , X k )T where X j = n−1 ni=1 Xji . Then, √ n(X − µ) N (0, Σ). Remark: There is also a multivariate version of the Berry-Esseen theorem but it is more complicated than the one-dimensional version. 5 The Delta Method If Yn has a limiting Normal distribution then the delta method allows us to find the limiting distribution of g(Yn ) where g is any smooth function. 10 Theorem 18 (The Delta Method) Suppose that √ n(Yn − µ) N (0, 1) σ and that g is a differentiable function such that g 0 (µ) 6= 0. Then √ n(g(Yn ) − g(µ)) N (0, 1). |g 0 (µ)|σ In other words, σ2 Yn ≈ N µ, n ! ! 2 σ g(Yn ) ≈ N g(µ), (g 0 (µ))2 . n implies that Example 19 Let variance σ 2 . By the central √ X1 , . . . , Xn be iid with finite mean µ and Xfinite limit theorem, n(X n − µ)/σ N (0, 1). Let Wn = e n . Thus, Wn = g(X n ) where s 0 s g(s) = e . Since g (s) = e , the delta method implies that Wn ≈ N (eµ , e2µ σ 2 /n). There is also a multivariate version of the delta method. Theorem 20 (The Multivariate Delta Method) Suppose that Yn = (Yn1 , . . . , Ynk ) is a sequence of random vectors such that √ n(Yn − µ) N (0, Σ). Let g : Rk → R and let ∂g ∂y1 ∇g(y) = ... . ∂g ∂yk Let ∇µ denote ∇g(y) evaluated at y = µ and assume that the elements of ∇µ are nonzero. Then √ n(g(Yn ) − g(µ)) N 0, ∇Tµ Σ∇µ . Example 21 Let X11 X21 X12 X1n , , ..., X22 X2n be iid random vectors with mean µ = (µ1 , µ2 )T and variance Σ. Let n n 1X X1 = X1i , n i=1 1X X2 = X2i n i=1 and define Yn = X 1 X 2 . Thus, Yn = g(X 1 , X 2 ) where g(s1 , s2 ) = s1 s2 . By the central limit theorem, √ X 1 − µ1 n N (0, Σ). X 2 − µ2 11 Now ∇g(s) = ∂g ∂s1 ∂g ∂s2 ! = s2 s1 and so ∇Tµ Σ∇µ = (µ2 µ1 ) σ11 σ12 σ12 σ22 µ2 µ1 = µ22 σ11 + 2µ1 µ2 σ12 + µ21 σ22 . Therefore, √ n(X 1 X 2 − µ1 µ2 ) N 0, µ22 σ11 12 + 2µ1 µ2 σ12 + µ21 σ22 .