Solution to Homework 2

advertisement
Statistics 511
Homework 2
Fall 2006
1. In a study of a new drug to reduce the number of epilepsy episodes, an in-bred line of
epileptic mice were used. It is well known from previous experience that the mean
weekly number of episodes for these mice is 9, and the number of episodes is skewed.
However, taking square root Normalizes the data.
12 mice were selected at random to be on the drug, and the number epilepsy episodes was
recorded for each mouse for 1 week. When the square roots of these numbers were
compiled, the mean square root was 3.2 and the s.d. of the square root was 1.0.
a. Is there evidence that the mice on the drug differ from the undrugged mice in their
mean weekly number of episodes? If so, has the mean increased or decreased?
We need to test if the mean sqrt is sqrt(9)=3.
t*=
3.2  3
= 0.69
1 / 12
d.f. = 12-1=11
p≈0.5
We fail to reject the mean is the same. There is no evidence of a change in the mean.
b. Compute a 90% confidence interval for the mean number of weekly episodes for
mice on the drug. (Note that this should an interval for the weekly mean and on
the natural scale, not the square root scale.)
We start with a CI for the sqrt mean, and then transform the ends of the interval to
obtain the interval for the mean on the natural scale.
A 90% CI for the mean sqrt of the number of episodes is:
3.2 ± 1.796 * 1/sqrt(12) = 3.2 ± 0.54
The interval is (2.68, 3.72)
A 90% CI for the mean number of episodes is:
(2.682, 3.722) = (7.18, 13.84)
c. Mouse #7 in the sample had 25 epilepsy episodes during the week of the study. Is
this mouse an outlier? (Hint: We might consider a mouse to be an outlier if the
number of episodes was in an extreme tail of the population distribution. Since
we do not know the population distribution, we might instead consider a
prediction interval. Mice outside the 99% prediction interval might be considered
outliers.)
We can either work on the sqrt scale, or transform the PI to the natural scale.
Statistics 511
Homework 2
Fall 2006
i) Using the sqrt scale, we want to know if sqrt(25)=5 is an outlier.
3.2+3.1* 1*sqrt( 1+1/12) = 6.43
So, 5 is in the interval and the mouse is not an outlier.
ii) A 99% PI for sqrt(#episodes) is 3.2±3.1* 1*sqrt( 1+1/12) =(0,6.44).
A 99% PI for the number of episodes is (0, 41.47) (or better (0,42) since we cannot
have part of an episode)
25 episodes is not unusual.
d. The investigator discoverered that mouse #7 was caged right beside the lab
answering machine, and suspected that the flashing light from the machine might
be triggering epileptic episodes. As a result, she decided to redo the analysis
without the data from mouse #7. What will be effect be on the estimated mean
and standard deviation of the number of episodes for the drugged mice?
Since sqrt(25)=5 is greater than the sample mean, the estimated mean will decrease.
Since 5-3=2 is greater than the sample SD, the estimated SD will also decrease. You
can work out the exact effect but this was not required:
There were 12 mice. Let y=sqrt(number of episodes):
yi = 12*3 = 36.
The new mean is (36-5)/11=2.28
sy = 1.0 (n-1) s2 = yi -3.2)2=11
After deleting this observation we have
i≠7yi -2.28)2 = i≠7yi -3.32 + 3.32-2.28)2
= i≠7yi -3.32)2 + 2 (3.32-2.28)i≠7yi -3.32)+ (3.32-2.28)2
= i≠7yi -3.32)2 - 2 (3.32-2.28)-3.32)+ (3.32-2.28)2
= 11 – 2 (3.32-2.28)-3.32)+ (3.32-2.28)2
= 8.5872
sy = sqrt(8.5872/10) = .93
2. Prediction Intervals – Is there still research to be done.
Read this entire question first.
Then, download and read the article “On the prediction of a single future observation
from a possibly noisy sample” by Horn, Britton and Lewis (1988).
http://links.jstor.org/sici?sici=00390526%281988%2937%3A2%3C165%3AOTPOAS%3E2.0.CO%3B2-M
Statistics 511
Homework 2
Fall 2006
You do not need to follow every detail of this article. Read the questions below,
before reading the article, as I explain below which details are not very important for
this exercse.
How does the problem in this article differ from the problem of prediction based on a
Normally distributed population?
a. Suppose that there are outliers (unusually large or small observations) in the
sample. What effect does this have on the sample mean? Sample variance?
(Note – consider 2 cases, i) most of the outliers are either too large or too small)
and ii) the outliers are equally prevalent in both tails of the distribution.
a. (i) The sample mean follows the same trend as the outliers, but the sample
variance increases because the data is more spread out.
(ii)The sample mean is not affected, but the sample variance still increases.
b. Section 2: This section describes robust estimators of location and scale (s.d.).
The details of the estimators are not important. The important facts are that these
estimators are robust (i.e. they estimate the population mean and s.d. based on the
center of the distribution, and are not influenced by the outliers).
Look at the equation for the prediction interval on p. 167:
The following notation is used:
Tbi is the robust estimator of the mean.
ST2 is the estimator of the variance of the sampling distribution of Tbi
sbi2 is the robust estimator of the variance of the population.
Comparing with the Normal Theory prediction interval on p. 166, why are there
two different estimators of variance, and what happened to 1/n in the formula?
b. Analogies from the PI’s given on pages 166 and 167 give: S T2 
s2
and
n
S bi2  s 2
s2
for the simple fact
n
2
s2
that var Tbi is estimated by S T2 in the article as var X 
was estimated by
in
n
n
the classical setting.
The term 1/n in the formula is taken into account by S T2 
Statistics 511
Homework 2
Fall 2006
c. Section 3: When the sampling distribution of a proposed estimator cannot be
computed mathematically, Monte Carlo simulation may be used to estimate the
sampling distribution (or some of its properties). This means that computergenerated pseudo-random numbers are used to create many samples (hundreds to
thousands) of fixed size from the population. For each sample, the value of the
estimate is computed. These values are used to estimate the sampling
distribution.
In this case, the "coverage" of the prediction interval is estimated. The coverage
is the percentage of times that the 1- interval actually contains at least 1- of the
population. We always get 100% coverage if we use the interval (-∞,∞), but this
interval is too long. A good interval has the correct coverage, but is as short as
possible.
In this case, we want an interval that adapts to outliers. So, if there are no
outliers, we want to obtain something very close to the usual Normal Theory
interval, but if there are outliers, we want to adjust the interval so that it is as short
as possible, but retains the correct coverage.
4 intervals are compared: i) the ordinary Normal Theory interval ii) the new
robust estimator iii) an estimator which uses the Normal Theory interval after
deleting outliers that are not very far out (“inner-fence” – details not important)
iv) an estimator which uses the Normal Theory deletes outliers after deleting
outliers that are quite far out (“outer-fence” – details not important). The last 2
estimators are not discussed in the paper, but are supposed to represent what a
careful statistician might actually do if (s)he planned to use the Normal Theory
methods (discussed in Stat 511).
The authors conclude (of course) that their method is best. Why? (Ignore the
table, and concentrate on the text.)
c. The authors claim that their method is more robust (stable) than the Normal
Theory over a wide range of outliers because it produces shorter intervals and
their probability coverage exceeded 95%.
d. The authors apply their method to evaluation of laboratory precision. Why don’t
they use Normal Theory prediction intervals?
d. The authors believe that the laboratory precision data may contain some very
extreme outliers. In that case, the Normal Theory prediction intervals will not
have the correct coverage, but their robust method should still give a reasonable
interval.
Download