Statistics 510: Notes 24 Reading: Section 8.2, 8.4 I. Markov’s Inequality and Chebyshev’s Inequality (Section 8.2) Sometimes we know the mean and/or variance of a random variable but not its entire distribution. Markov’s Inequality provides a bound on the probability that a nonnegative random variable will be greater than or equal to some value when only the mean of a distribution is known. Proposition 8.2.1: If X is a random variable that takes only nonnegative values, then for any value a 0 , E( X ) P{ X a} a Proof: For a 0 , let 1 if X a I 0 otherwise and note that since X 0 , X I . a Taking expectations of the preceding yields that X E(I ) E , a which since E ( I ) P{ X a} proves the result. 1 As a corollary, we obtain Chebyshev’s Inequality that provides a bound on the probability of a random variable departing from its mean by a certain amount if we know the variance of a random variable. Proposition 8.2.2: If X is a random variable with finite 2 mean and variance , then for any value k 0 , P{| X | k} 2 k2 . Proof: Since X 2 is a nonnegative random variable, 2 we can apply Markov’s inequality with a k to obtain E[( X ) 2 ] 2 2 P{( X ) k } (1.1) k2 2 2 But since ( X ) k if and only if | X | k , Equation (1.1) is equivalent to E[( X )2 ] 2 P{| X | k} 2 2 k k and the proof is complete. Example 2: A fair die is tossed 100 times. Let X k denote the outcome on the kth roll. Use Chebyshev’s Inequality to get a lower bound for the probability that X X1 X100 is between 300 and 400. 2 Example 3: The mean of a list of a million numbers is 10 and the mean of the squares of the numbers is 101. Find an upper bound on how many of the entries in the list are 14 or more. 3 As Chebyshev’s inequality is valid for all distributions of the random variable X, we cannot expect the bound on the probability to be very close to the actual probability in most cases. Consider a normal random variable with mean and variance . 2 Probability Chebyshev bound Probability for X ~ N ( , 2 ) P(| X P(| X P (| X P(| X | ) | 2 ) | 3 ) | 4 ) at most 1 at most 1/ 22 0.25 at most 1/ 32 0.11 at most 1/ 42 0.06 0.3173 0.0465 0.00270 0.000063 As the table shows, Chebsyhev’s bound will be very crude for a distribution that is approximately normal. Its importance is that it holds no matter what the shape of the distribution, so it gives some information about two-sided tail probabilities whenever the mean and standard deviation of a distribution can be calculated. II. Convergence in Probability and the Weak Law of Large Numbers (Section 8.2) Limit Theorems: Limit theorems concern what happens in the limit to the probability distribution of a random variable Yn in an infinite sequence of random variables Y1 , Y2 , In 4 particular we often consider Yn to be a random variable that is associated with a random sample of n units, e.g., the 1 n sample mean Yn n X i for a sample X 1 , , X n . i 1 Although the notion of an infinite sample size is a theoretical artifact, it can often provide us with some useful approximations for the finite-sample case. We will consider three types of convergence for the probability distribution of a sequence of random variables Y1 , Y2 , : (i) convergence in probability, (ii) almost sure convergence and (iii) convergence in distribution. Convergence in Probability: A sequence of random variables Y1 , Y2 , converges in probability to a number c if, for every 0 , lim P(| Yn c | ) 0 or equivalently n lim P(| Yn c | ) 1 . n Note: The Y1 , Y2 , are typically not independent and identically distributed randomly variables. The distribution of Yn changes as the subscript changes, and the convergence concepts we will discuss describe different ways in which the distribution of Yn “converges” as the subscript becomes large. The weak law of large numbers: 5 Consider a sample of independent and identically distributed random variables X 1 , , X n . The relationship X1 X n X between the sample mean n and true n mean of the X i ’s, E ( X i ) , is a problem of pivotal importance in statistics. Typically, is unknown and we would like to estimate based on X n . The weak law of large numbers says that the sample mean converges in probability to . This means that for a large enough sample size n , X n will be close to with high probability. Theorem 8.2.1. The weak law of large numbers. Let X 1 , X 2 be a sequence of independent and identically distributed random variables, each having finite mean E ( X i ) . Then, for any 0 , X Xn P 1 0 as n . n Proof: We prove the result only under the additional assumption that the random variables have a finite variance 2 (it can be proved without this assumption using more advanced techniques). Because 2 X1 X n X1 X n E and Var , n n n it follows from Chebyshev’s inequality that 6 X1 X n 2 P 2 n n 2 Since n 2 0 as n , the result is proved. Application of Weak Law of Large Numbers: Monte Carlo Integration. Suppose that we wish to calculate 1 I ( f ) f ( x)dx 0 where the integration cannot be done by elementary means or evaluated using tables of integrals. The most common approach is to use a numerical method in which the integral is approximated by a sum; various schemes and computer packages exist for doing this. Another method, called the Monte Carlo method, works in the following way. Generate independent uniform random variables on (0,1) – that is X 1 , , X n -- and compute 1 n ˆ I( f ) f (Xi ) n i 1 By the law of large numbers, for large n, this should be close to E[ f ( X )], which is simply 1 E[ f ( X )] f ( x)dx I ( f ) . 0 This simple scheme can easily be modified in order to change the range of integration and in other ways. Compared to the standard numerical methods, it is not especially efficient in one dimension, but becomes 7 increasingly efficient as the dimensionality of the integral grows. As a concrete example, let’s consider the evaluation of 1 1 2 I( f ) e x / 2 dx 0 2 The integral is that of the standard normal density, which cannot be evaluated in closed form. From Table 5.1, an accurate numerical approximation is I ( f ) 0.3413 . The following code for the statistical computing software package R generates 1000 pseudorandom independent points the uniform (0,1) distribution and computes Iˆ( f ) . > # Generate vector of 1000 independent uniform (0,1) random variables > xvector=runif(1000); > # Approximate I(f) by 1/1000 * sum from i=1 to 1000 of f(Xi) > fxvector=(1/(2*pi)^.5)*exp(-xvector^2/2); > Ihatf=1/1000*sum(fxvector); > Ihatf [1] 0.3430698 III. Almost Sure Convergence and the Strong Law of Large Numbers (Section 8.4) A type of convergence that is stronger than convergence in probability is almost sure convergence (sometimes confusingly known as convergence with probability 1). 8 This type of convergence is similar to pointwise convergence of a sequence of functions except that the convergence need not occur on a set with probability 0 (hence the “almost” sure). Almost sure convergence: A sequence of random variables Y1 , Y2 , defined on a common sample space converges almost surely to a number c if, for every 0 , P(lim | Yn c | ) 1 , n Yn c) 1 or equivalently, P(lim n Almost sure convergence is a much stronger convergence concept than convergence in probability and indeed implies convergence in probability. Example of a sequence of random variable that converges in probability but not almost surely: Consider an experiment with sample space the interval from 0 to 1 and a uniform probability distribution over the sample space. We construct a sequence of intervals Wn as follows: W1 is the interval from 0 to 1/2, W2 is the interval from 1/2 to 1, W3 is the interval from 0 to 1/3, W4 is the interval from 1/3 to 2/3, W5 is the interval from 2/3 to 1, W6 is the interval from 0 to ¼, and so forth. Now for every point s between 0 and 1, define the value of the random variable Yn (s) to be -1 if s is in the first half of the interval 9 Wn , 1 if s is in the second half of the interval Wn and 0 if s is not in the interval Wn We have E (Yn ) 0 for all n . Moreover, for any 0 1 , P(| Yn 0 | ) is the probability that a uniform random variable is in the interval Wn , which converges in probability to 0 since the length of the intervals Wn converges to 0. Thus, the sequence Y1 , Y2 , converges in probability to 0. However, no matter how short the intervals Wn become, every s between 0 and 1 is in some Wn during each left-toright “progression” of these events for increasing n . Consequently, for each s between 0 and 1, the sequence Y1 (s), Y2 (s), does not converge and hence Y1 , Y2 , does not converge almost surely. The Strong Law of Large Numbers: The strong law of large numbers states that for a sequence of independent and identically distributed random variables X 1 , X 2 , the sample mean converges almost surely to the mean of the random variables E ( X i ) . Theorem 8.4.1: Let X1 , X 2 , , be a sequence of independent and identically distributed random variables, each having a finite mean E ( X i ) . Then, with probability 1, 10 X1 Xn n as n . What is the strong law of large numbers adding to the weak law of large numbers? The weak law of large numbers states that for any specified * * large value n , ( X1 X n* ) / n is likely to be near . However, it does not say that ( X1 X n ) / n is bound to * stay near for all values of n larger than n . Thus, it leaves open the possibly that large values of | ( X1 X n ) / n | can occur infinitely often (thought at infrequent intervals). The strong law shows that this cannot occur. In particular, it implies that with probability 1, for any positive value , n Xi i 1 n will be greater than only a finite number of times. Application of the strong law of large numbers: Consider a sequence of independent trials of some experiments. Let E be a fixed event of the experiment and denote by P ( E ) the probability that the event E occurs on any particular trial. Letting, 1 if E occurs on the ith trial Xi , 0 if E does not occur on the ith trial we have by the strong law of large numbers that with probability 1, 11 X1 Xn E[ X ] P( E ) (1.2) n Since X 1 X n represents the number of times that the event E occurs in the first n trials, equation (1.2) is stating that, with probability 1, the limiting proportion of times that the event E occurs in repeated, independent trials of the experiment is just P ( E ) . The strong law of large numbers is of enormous importance, because it provides a direct link between the axioms of probability (Section 2.3) and the frequency interpretation of probability. If we accept the interpretation that “with probability 1” means “with certainty,” then we can say that P ( E ) is the limit of the long-run relative frequency of times E would occur in repeated, independent trials of the experiment. 12