Statistics 511 Homework 2 Fall 2006 1. In a study of a new drug to reduce the number of epilepsy episodes, an in-bred line of epileptic mice were used. It is well known from previous experience that the mean weekly number of episodes for these mice is 9, and the number of episodes is skewed. However, taking square root Normalizes the data. 12 mice were selected at random to be on the drug, and the number epilepsy episodes was recorded for each mouse for 1 week. When the square roots of these numbers were compiled, the mean square root was 3.2 and the s.d. of the square root was 1.0. a. Is there evidence that the mice on the drug differ from the undrugged mice in their mean weekly number of episodes? If so, has the mean increased or decreased? We need to test if the mean sqrt is sqrt(9)=3. t*= 3.2 3 = 0.69 1 / 12 d.f. = 12-1=11 p≈0.5 We fail to reject the mean is the same. There is no evidence of a change in the mean. b. Compute a 90% confidence interval for the mean number of weekly episodes for mice on the drug. (Note that this should an interval for the weekly mean and on the natural scale, not the square root scale.) We start with a CI for the sqrt mean, and then transform the ends of the interval to obtain the interval for the mean on the natural scale. A 90% CI for the mean sqrt of the number of episodes is: 3.2 ± 1.796 * 1/sqrt(12) = 3.2 ± 0.54 The interval is (2.68, 3.72) A 90% CI for the mean number of episodes is: (2.682, 3.722) = (7.18, 13.84) c. Mouse #7 in the sample had 25 epilepsy episodes during the week of the study. Is this mouse an outlier? (Hint: We might consider a mouse to be an outlier if the number of episodes was in an extreme tail of the population distribution. Since we do not know the population distribution, we might instead consider a prediction interval. Mice outside the 99% prediction interval might be considered outliers.) We can either work on the sqrt scale, or transform the PI to the natural scale. Statistics 511 Homework 2 Fall 2006 i) Using the sqrt scale, we want to know if sqrt(25)=5 is an outlier. 3.2+3.1* 1*sqrt( 1+1/12) = 6.43 So, 5 is in the interval and the mouse is not an outlier. ii) A 99% PI for sqrt(#episodes) is 3.2±3.1* 1*sqrt( 1+1/12) =(0,6.44). A 99% PI for the number of episodes is (0, 41.47) (or better (0,42) since we cannot have part of an episode) 25 episodes is not unusual. d. The investigator discoverered that mouse #7 was caged right beside the lab answering machine, and suspected that the flashing light from the machine might be triggering epileptic episodes. As a result, she decided to redo the analysis without the data from mouse #7. What will be effect be on the estimated mean and standard deviation of the number of episodes for the drugged mice? Since sqrt(25)=5 is greater than the sample mean, the estimated mean will decrease. Since 5-3=2 is greater than the sample SD, the estimated SD will also decrease. You can work out the exact effect but this was not required: There were 12 mice. Let y=sqrt(number of episodes): yi = 12*3 = 36. The new mean is (36-5)/11=2.28 sy = 1.0 (n-1) s2 = yi -3.2)2=11 After deleting this observation we have i≠7yi -2.28)2 = i≠7yi -3.32 + 3.32-2.28)2 = i≠7yi -3.32)2 + 2 (3.32-2.28)i≠7yi -3.32)+ (3.32-2.28)2 = i≠7yi -3.32)2 - 2 (3.32-2.28)-3.32)+ (3.32-2.28)2 = 11 – 2 (3.32-2.28)-3.32)+ (3.32-2.28)2 = 8.5872 sy = sqrt(8.5872/10) = .93 2. Prediction Intervals – Is there still research to be done. Read this entire question first. Then, download and read the article “On the prediction of a single future observation from a possibly noisy sample” by Horn, Britton and Lewis (1988). http://links.jstor.org/sici?sici=00390526%281988%2937%3A2%3C165%3AOTPOAS%3E2.0.CO%3B2-M Statistics 511 Homework 2 Fall 2006 You do not need to follow every detail of this article. Read the questions below, before reading the article, as I explain below which details are not very important for this exercse. How does the problem in this article differ from the problem of prediction based on a Normally distributed population? a. Suppose that there are outliers (unusually large or small observations) in the sample. What effect does this have on the sample mean? Sample variance? (Note – consider 2 cases, i) most of the outliers are either too large or too small) and ii) the outliers are equally prevalent in both tails of the distribution. a. (i) The sample mean follows the same trend as the outliers, but the sample variance increases because the data is more spread out. (ii)The sample mean is not affected, but the sample variance still increases. b. Section 2: This section describes robust estimators of location and scale (s.d.). The details of the estimators are not important. The important facts are that these estimators are robust (i.e. they estimate the population mean and s.d. based on the center of the distribution, and are not influenced by the outliers). Look at the equation for the prediction interval on p. 167: The following notation is used: Tbi is the robust estimator of the mean. ST2 is the estimator of the variance of the sampling distribution of Tbi sbi2 is the robust estimator of the variance of the population. Comparing with the Normal Theory prediction interval on p. 166, why are there two different estimators of variance, and what happened to 1/n in the formula? b. Analogies from the PI’s given on pages 166 and 167 give: S T2 s2 and n S bi2 s 2 s2 for the simple fact n 2 s2 that var Tbi is estimated by S T2 in the article as var X was estimated by in n n the classical setting. The term 1/n in the formula is taken into account by S T2 Statistics 511 Homework 2 Fall 2006 c. Section 3: When the sampling distribution of a proposed estimator cannot be computed mathematically, Monte Carlo simulation may be used to estimate the sampling distribution (or some of its properties). This means that computergenerated pseudo-random numbers are used to create many samples (hundreds to thousands) of fixed size from the population. For each sample, the value of the estimate is computed. These values are used to estimate the sampling distribution. In this case, the "coverage" of the prediction interval is estimated. The coverage is the percentage of times that the 1- interval actually contains at least 1- of the population. We always get 100% coverage if we use the interval (-∞,∞), but this interval is too long. A good interval has the correct coverage, but is as short as possible. In this case, we want an interval that adapts to outliers. So, if there are no outliers, we want to obtain something very close to the usual Normal Theory interval, but if there are outliers, we want to adjust the interval so that it is as short as possible, but retains the correct coverage. 4 intervals are compared: i) the ordinary Normal Theory interval ii) the new robust estimator iii) an estimator which uses the Normal Theory interval after deleting outliers that are not very far out (“inner-fence” – details not important) iv) an estimator which uses the Normal Theory deletes outliers after deleting outliers that are quite far out (“outer-fence” – details not important). The last 2 estimators are not discussed in the paper, but are supposed to represent what a careful statistician might actually do if (s)he planned to use the Normal Theory methods (discussed in Stat 511). The authors conclude (of course) that their method is best. Why? (Ignore the table, and concentrate on the text.) c. The authors claim that their method is more robust (stable) than the Normal Theory over a wide range of outliers because it produces shorter intervals and their probability coverage exceeded 95%. d. The authors apply their method to evaluation of laboratory precision. Why don’t they use Normal Theory prediction intervals? d. The authors believe that the laboratory precision data may contain some very extreme outliers. In that case, the Normal Theory prediction intervals will not have the correct coverage, but their robust method should still give a reasonable interval.