C22.0015 / B90.3302 NOTES for Wednesday 2011.MAR.09 Please see the handout with caption “A Summary of the Bayesian Method and Bayesian Point of View.” We will consider the topic of decision theory. There is a separate handout for this subject; this was distributed last week.d As some interesting developments of the 20th century . . . Cantor’s notion of infinity (around 1900) Gödel’s incompleteness theorem (1931) Ronald Fisher’s foundations of statistical inference (1920s) Neyman-Pearson notion of statistical hypothesis testing (1931) Operations research (World War II) Shannon’s information theory (about 1948) Kuhn-Tucker theorem for constrained optimization (about 1948) John von Neumann’s game theory (about 1948) Statistical and economic utility theory (1950s) Kahneman and Tversky’s prospect theory (1971) This list was Gary Simon’s. People in the class added also Box-Jenkins methods for time series and also Bayesian methods. Within just statistics, we’ve done many things that will “stick” through time. They are probably not as important as the things in the list above, though. We could include the Kaplan-Meyer non-parametric survival function (about 1950), Cox survival analysis (1970), Efron’s bootstrap methods (1970s), Laird-Rubin E-M method for missing data (1970s). These have been around for a while and have proved very useful. There are other things that we’ve worked on. We found them stimulating, we had fun with them, we wrote lots of papers…. and then they have faded away. I’m going to call these “toys.” The statistical robustness movement of the 1970s was such a toy. Very few of the ideas remain, and no one really talks about it anymore. I am going to claim here that statistical decision theory is such a toy. Others might disagree. I guess that by talking about it at all provides some evidence that maybe it is not a toy. Of course, “toy” is an ex-post judgment. It’s hard to make that appraisal while something is still at high mania. Perhaps MCMC is such a toy. Later we will mention regression’s LASSO estimate. Could be a toy. What about networks? 1 There are several incredibly useful properties of maximum likelihood estimates. With one notable category of exceptions, these estimates are asymptotically normal with the optimum variance. What we mean here is that if ML is a maximum likelihood estimate, then ML is approximately normally distributed and moreover, the standard SD ej ML deviation is as small as possible. OK, what’s the category of exceptions? These occur when estimating parameters which limit the support of the random variable. For example, if we have a sample from U(0, ), then marks off the range of possible values. The maximum likelihood estimate here will not have a limiting normal distribution. 1 , where I() is Fisher’s I information in the data. We need to see more about Fisher’s information. We can actually obtain this optimal limiting variance as af We have a problem involving parameter . At the moment, there are no other parameters. We’ve written the likelihood L and we have found ˆ ML . We are going to assert that we have the approximate value of SD( ˆ ML ), and also that ML is SD ML ej approximately normally distributed. For this maximum likelihood estimate, the value of SD( ˆ ML ) is optimally small. We can actually obtain the optimal limiting variance as 1 , where I() is Fisher’s I af information in the data. Then, of course, SD( ˆ ML ) will be the square root of this limiting variance. We need to see more about Fisher’s information. How do we find I()? There are a number of ways. Let f(x) be the likelihood for the whole problem. Note here that x is used as a vector to represent the entire set of data. 2 We’ll use X (upper case) to denote the corresponding random variable. Let S be the score random variable, defining this as S = log f X This S is a random variable, but it is not a statistic; its form involves , which is unknown. It can be shown that E S = 0. There are three ways to get I(): (1) I() = E S2 (2) I() = Var S (3) 2 S I() = E 2 log f X = E Generally one way will be somewhat easier than the others. Here’s a neat example. Suppose that X 1 , X 2 ,..., X n are independent random variables, each N(, 2). Let’s suppose for this example that is a known value. It’s pretty easy to show that MM = ML = X . Now let’s find I(). First, get the likelihood for the whole sample: 1 2 xi g 1 1 2b e 2 L = = n 2 i 1 2 n af n/ 2 e 1 n xi 2 2 i 1 b g 2 Now we need the score random variable: log L = n log S = n 1 n log 2 xi 2 2 2 i 1 af b g 2 1 n 1 n log L 2 2 xi = 2 xi 2 i 1 i 1 a fb g b g 1 n Xi . This is not a statistic, as it involves 2 i 1 the unknown parameter . It’s easy to see that E S = 0 here. In random variable form, this is S = b g 3 There are several ways to get I(), all pretty easy. (1) 1 n 1 n I() = E S2 = E 2 X i 2 X j j 1 i 1 n n 1 1 n = 4 E Xi X j = 4 n 2 = 2 . i 1 j 1 Consider the double sum. Any term with i j has expected value zero. The n terms with i = j each have expected value 2. 1 n I() = Var S = 4 n 2 = 2 . This is easy, but the third way is even easier. 1 n 2 I() = E 2 log L = E X 2 i i 1 b gc h (2) (3) L M N = O P Q n 1 1 E X i n = 2 E n 2 i 1 n . 2 Thus, the asymptotic variance of the maximum likelihood estimate is 1 2 . I n af For cases in which we have a sample, meaning n independent values sampled from the same distribution, we have I() = n I1(), where I1() is the information in one observation. We can get this from the score random variable based on one observation, generally identified as S1. For the example above, the likelihood for just the first observation is 1 L = 2 x1 1 e 2 2 2 This leads to log L = log 1 1 2 log 2 x 2 1 2 2 Then S = 1 1 d x ( 2) = log L = x1 2 1 2 2 d In random variable form, this is S = 1 X1 . 2 4 1 1 Var X 1 = It follows that I1() = Var(S) = Var 2 X 1 = 4 1 1 1 Var X 1 = 4 2 = 2 . 4 This verifies, for this example, that I() = n I1(). OK, let’s work through the details for this particular situation. Suppose that x1 , x 2 ,..., x n are known values, all positive. Suppose that Y1, Y2 , ..., Yn are independent, with Yi ~ N(xi, xi2 2). We can certainly get a method of moments estimate for . Observe that E Yi = xi, so that n n i 1 i 1 E Yi xi Note that this is not a sample of values, all with the same distribution. Each Yi has a possibly-different distribution. Thus we cannot use the n I1 logic. Y By dividing by n, we get E Y = x , so the method of moments estimate is MM . x As an interesting challenge, you might show that 2 Var Yi = 2 2 n x i 1 n 1 1 Var MM = 2 Var Y = 2 2 x n x n x 2 i i 1 In what follows we have to worry about two parameters. For the sake of this example, let’s think of as known. (It actually won’t matter here.) The likelihood for Yi is f(yixi) = 1 xi 2 e 1 yi xi 2 xi2 2 b g 2 5 Based on this, we can write the likelihood for the whole problem: L1 e L = M M Nx 2 n i 1 1 yi xi 2 xi2 2 b O 1 = P P Q a2f g 2 n/2 n i n x e b 1 n yi xi 2 2 i 1 xi2 g 2 i i 1 We’ll need to take log L: n log L = n log log 2 2 af b 1 n yi xi log x i 2 2 i 1 xi2 i 1 n g 2 We could get the maximum likelihood estimates for both and . For now, we’ll just worry about , as noted above. Clearly we get that estimate by minimizing the sum from the exponent. Thus we solve b n yi xi i 1 xi2 = 2 g 2by x g( x ) 2 n i i i xi2 i 1 n = 2 i 1 xi yi xi2 xi2 L O y x P 0 M N Q n i i 1 let i 1 Clearly the solution is ˆ ML n n Yi x i 1 . This is a very unusual ratio estimate. i Suppose we wanted to know its asymptotic variance. We need the score random variable. (Sometimes this score random variable is found as part of the routine of getting the maximum likelihood estimate, but not here.) S = b gb g= 1 n yi xi 2 xi log L 2 2 i 1 xi2 In random variable form, this is S = F IJ G H K 1 n Yi 2 i 1 xi 6 F IJ G H K 1 n yi 2 i 1 xi The easiest way to get I() is as Var S: I() = Var S = Var 1 x 4 i 1 xi2 n = 2 i 2 F1 FY II = VarF1 Y I = G H x J K Hx J KJ H G K G n 2 i 1 n i 2 i i 1 i i bg 1 n Var Yi 4 i 1 xi2 n 2 = 2 It follows that the limiting variance of ML is . You can actually show that this is a n non-asymptotic result as well. Another example. Suppose that X 1 , X 2 ,..., X n is a sample from the exponential density f(x) = e - x I(x 0) Let’s find the maximum likelihood estimate. Begin with the likelihood n L = e n xi = e n xi i 1 i 1 It follows that n log L = n log xi i 1 let n n log L xi 0 i 1 It follow that ML = n = n x 1 . In random variable form, this is ML = x 1 . It’s X i i 1 going to be very difficult to get a limiting distribution. Let’s use the asymptotic results, based on the fact that this is a maximum likelihood estimate. For observation 1, we have log L1 = n log - x1. Then S1 = 1 log L1 X1 Certainly E S1 = 0. There are several ways to get I1(). Here’s the easiest: 7 L M N 1 O 1 = EL M N P Q= I1() = E 2 2 2 log L1 O 1 L R L R O UO log L V = EM = EM X U S S V P P P Q Q N T W Q N T W 1 1 2 It follows then that I() = n I1() = n . 2 We can certainly make an approximate 95% confidence interval based on ML ± 2 SE( ML ). Specifically, this is 1 1 2 X X n This next example was not done in class. Let’s use this technology for a genuinely difficult problem. Recall our censored likelihood problem. Suppose that X1, X2, …, Xn are independent random variables, each from the density e -x. Actually we are able to observe the Xi’s only if they have a value T or below. This corresponds to an experimental framework in which we are observing the lifetimes of n independent objects (light bulbs, say), but that the experiment ceases at time T. Suppose that K of the Xi’s are observed; call these values X 1 , X 2 , X 3 , .... , X K . The remaining n - K values are censored at T ; operationally, this means that there were n - K light bulbs still burning when the experiment stopped at time T. The overall likelihood is L = e T nK K e K xi i 1 It is not at all obvious what would be the maximum likelihood estimate. Let’s take logs: K log L = T n K K log xi i 1 8 Then we found . . . eventually . . . ˆ ML = K T n K K x i 1 . i Let’s get its asymptotic variance. S() = d K log L = T n K d K x i 1 i Finding E S2() or Var S() will be very tricky, as we need to consider that K is random, and the random variables xi have to be considered conditional on having a value below T. We’ll do the third method: 2 S I() = E 2 log f X = E K = E T n K = E K x i 1 i 1 K K = E 2 = 2 EK Here K is binomial with n trials and event probability 1 eT . Its expected value is then n 1 e T . This leads to I() = n 1 e T 2 The limiting variance of the maximum likelihood estimate is any actual use, we would use 1 ˆ 2 . ˆ I ˆ n 1 e T 9 1 2 . In I n 1 e T