MidTerm205Sol

advertisement
Stat 511
Fall 2005
Midterm 2
Statistics 511
Midterm 2
Nov. 15, 2005
The following rules apply.
1. You may use 3 sheets of paper for any information you need - double-sided,
any font.
2. You may use a calculator.
3. You may not collaborate or copy.
4. You may not use outside resources, such as the internet. As well, you may
not store notes or formulas on your calculator.
5. Failure to comply with item 3 could lead to reduction in your grade, or
disciplinary action.
I have read the rules above and agree to comply with them.
Signature ________________________________________________
Name (printed) ___________________________________________
1
Stat 511
Fall 2005
Midterm 2
1. (21) In 1929, Edwin Hubble investigated the relationship between distance of a galaxy from
the earth and the velocity with which it appears to be receding. The data collected include
distances (megaparsecs) from earth to 24 galaxies and their recession velocities (km/sec). Note:
1 megaparsec = 3,260,000 light years. Hubble's theory was that at some time all the galaxies
were compressed into a single point in space and that since then they have moved apart at an
increasing velocity leading to Hubble's Law:
Recession Velocity = G*Distance
where G is Hubble's constant. Note that some of the galaxies appear to be moving towards us,
and hence have negative velocity.
Some plots and computer output are in the "Computer Output" handout.
a) (3) The investigator regressed Velocity (Y) on Distance (X). Does the ordinary linear
regression model fit the data? Justify your conclusion.
The ordinary linear regression model fits the data.
1) The plot of studentized residuals shows an even spread of the residuals around 0. There is no
curvature. There are no outliers. (any 2)
2. The normal probability plot of the residuals is very linear, indicating that the errors are close
to normal. (required)
b) (3) Is regression through the origin a suitable model for these data? Give 3 justifications for
your answer.
Yes, this is a suitable model.
1. Hubble’s Law indicates that the intercept should be 0.
2. There are data near X=0.
3. In the ordinary linear regression model, we do not reject the hypothesis that the intercept is
zero.
So, regression through the origin is a reasonable model for these data.
2
Stat 511
Fall 2005
Midterm 2
c) (1) What is the estimated value of Hubble's constant using the regression through the origin
method?
423.93732 km/sec/Mps
(You do not need the units for full points.)
d) (2) A quasar is receding from us at a speed of 48,000 km/sec. Using the regression through
the origin model, how far away do you estimate it to be in megaparsecs?
You can estimate distance either by inverting regression of velocity on distance or by computing
the regression of distance on velocity.
If distance is considered to be a known quantity, then it is most correct to invert the regression
equation. This is called the calibration problem. For calibration estimates, confidence intervals
for X are computed by inverting confidence intervals for Yˆ . (There is more on this in the text.)
In this case the estimated distance is: Distance=423.93732/Speed = 113.224 Mps
If distance is considered to be a random variable, then it is most correct to regress distance on
velocity. (See part g)
In this case the estimated distance is: 0.0019218*48,000 = 92.25 Mps
Note: According to my understanding, we can assign a velocity to every galaxy due to the redshift in its spectrum. The distance is known only for galaxies in which certain types of stars are
observed. Hence, the distance to most galaxies is derived from this type of computation.
e) (1) Is the estimate in part d an extrapolation? Justify your answer.
No matter which equation you used in part d, the quasar is very far from the galaxy data used in
to form the prediction.
(The Dept. of Statistics at Penn State has an institute of Astrostatistics. If you are interested in
these types of data, talk with Dr. Jogesh Babu.)
3
Stat 511
Fall 2005
Midterm 2
f) (2) From the output given, we can compute
How can
x
y
2
i
x
= 6511425 and
i
2
i
2
i
= 29.518
i
be computed from the output given?
i
For regression through the origin, the regression sum of squares is
 yˆ
2
i
= b12  xi2
i
i
b1 and SSR are given. So,
x
2
i
=SSR/ b12 = 5305022/(423.93732)2 = 29.518
i
g) (3) Suppose that for part d) you regressed Distance on Velocity (i.e. switched the roles of Y
and X). What is the slope of the regression of Distance on Velocity.
x
2
i
= 29.518
In the regression Velocity on Distance, b1= 423.93732 =  xi yi /  xi2
i
So,
i
x
i
yi = (29.518)(423.93732)=12513.782
i
The required slope is
x
i
i
yi /  yi2 = 12513.782/6511425= 0.0019218
i
The units would be Mps/(km/sec)
(1) point if you just invert the slope.
4
i
Stat 511
Fall 2005
Midterm 2
h) (3) Hubble's constant, G, is estimated to be about 75. Is this supported by the data? Do a
formal test using the regression through the origin model. State the null and alternative
hypotheses.
Ho: 1= 75
HA: 1≠ 75
t* = (423.93732-75) = 8.278
42.15414
p<.0001
We reject the null hypothesis. G is not 75.
i) (3) A galaxy is thought to be 2.1 megaparsecs from earth. Use the regression through the
origin model to compute a 90% interval for its recession velocity.
This should be a prediction interval.
The interval is: Yˆ  t.95, 23 MSE (1  h)
h= x2/  xi2 = (2.1)2 /29.518 = 0.1494
i
(423.93732)(2.1) ± 1,714 √(52452)(1.1494)
= (890.2677 ± 420.850
= (469.4177,1311.1177) km/sec
5
Stat 511
Fall 2005
Midterm 2
2. Suppose V1~N(3,5) independent of V2~N(2,4)
W1= V1+2V2
W2 = V1-2V2
W 
W =  1
W2 
a) (2) What is E(W) ?
 E (W1 )   E (V1 )  2 E (V2 ) 3  2 * 2  7 
E(W) = 


 
 E (W2 )  E (V1 )  2 E (V2 ) 3  2 * 2  1
b) (3) What is Var(W)?
Var(W1) = Var(V1) + 4 Var(V2)
= 5 + 4*4
= 21
Var(W2) = Var(W1)
Cov(W1, W2) = Var(V1) – 4 Var(V2)
= 5 -4*4
= -11
 21  11
Var(W)= 

 11 21 
6
Stat 511
Fall 2005
Midterm 2
3. (4) Suppose that Y=X+ where Y and  are n x 1,  is p x 1 and X is n x p and E()=0.
Suppose A is p x p and (X'AX) is invertible.
Show that (X'AX)-1X'AY is an unbiased estimator of .
E(Y)=X
So, E[(X'AX)-1X'AY] = (X'AX)-1X'A E(Y)
= (X'AX)-1X'A X
= (X'AX)-1(X'AX)
=
7
Stat 511
Fall 2005
Midterm 2
4. The SENIC data consists of a simple random sample of hospitals which was collected in a
study of the risk of patients incurring a new infection when in the hospital for treatment of an
unrelated condition.
We will focus on predicting RISK, the percentage of patients in the hospital who became
infected, from several other variables:
LENGTH
the average length of stay in the hospital (days)
AGE
the average age of patients
NURSES
the number of nurses
BEDS
the number of beds
CENSUS
the average number of occupied beds
Computer output for this problem is in the "Computer Output" handout.
a) (3) What is the theoretical regression model for predicting RISK from the other 5 variables?
Define all of your notation and make sure you include the distribution assumptions.
Riski = 0 + 1 Lengthi +2 Agei +3 Nursesi+4 Bedsi+5 Censusi + i
where
Riski is the risk of infection for hospital i
The predictor variables are as described above
0 … 5 are unknown regression coefficients
and i are errors, with i i.i.d. N(0,2)
b) (4) (You knew this one was coming!) Fill in the 8 blanks in the ANOVA table.
Analysis of Variance
Sum of
Squares
Mean
Square
Source
DF
Model
_5_
_72.27455__
14.45491
Error
_107_
_129.10513_
1.20659
Cor.Total _112_
R-Square
F Value
_201.37968_
_0.3589_
8
Pr > F
_11.98_
<.0001
Stat 511
Fall 2005
Midterm 2
c) (5) Below are the plot of RISK versus LENGTH and the partial regression leverage plot of
RISK versus length. Explain the difference between these 2 plots, and 3 things that we might
look for on these plots. Also, identify a high leverage point on at least one of the 2 plots.
The REG Procedure
Model: MODEL1
Partial Regression Residual Plot
„ƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒ†
risk ‚
‚
‚
‚
‚
‚
4 ˆ
1
ˆ
‚
‚
‚
‚
‚
‚
‚
1
‚
3 ˆ
ˆ
‚
‚
‚
1
‚
‚
1
‚
‚
1
‚
2 ˆ
1
1
ˆ
‚
1
1
‚
‚
1
1
1
‚
‚
1
11
1
‚
‚
1
‚
1 ˆ
2
1
1
ˆ
‚
111 11
11 1
1
‚
‚
1
1 1
1
11
‚
‚
2 1 2 1
11
‚
‚
1
1 1 1 1112 1
‚
0 ˆ
1
1
1 1
12
ˆ
‚
11
1 1
1
1
11
‚
‚
1
1
1 1
‚
‚
11
1111 11
1
‚
‚
1
1
1
‚
-1 ˆ
21
1
11
1
ˆ
‚
1
21 1
1
‚
‚
1
1
1
1
1
‚
‚
1
‚
‚
2
1
‚
-2 ˆ
ˆ
‚
1
1
‚
‚
1
1 1
‚
‚
1
‚
‚
‚
-3 ˆ
ˆ
‚
‚
‚
‚
‚
‚
ŠƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒŒ
-3
-2
-1
0
1
2
3
4
5
6
7
8
length
9
Stat 511
Fall 2005
Midterm 2
8
7
6
5
4
3
2
1
6
8
10
12
14
16
18
20
l engt h
The plot of risk versus length uses the actual measured values of the variables. The partial
regression leverage plot has on the X-axis, the residuals of length after regressing length on age,
beds, nurses and census, and on the Y-axis, the residuals of risk after regressing length on age,
beds, nurses and census.
Some of the things we might look for on these plots:
Plot of risk versus length: The shape of the regression function (curved or linear)
Constancy of spread.
Outliers in risk.
High leverage or outlying values of Length.
Partial regression leverage plot of risk versus length: The shape of the regression function
(curved or linear)
Constancy of spread.
High leverage values.
On either plot we can see that risk has a strong relationship with length. The partial regression
leverage plot is a better plot for detecting linearity and leverage. On that plot, we see that the
relationship is essentially linear, but that there is a high leverage plot (in blue) that might induce
10
Stat 511
Fall 2005
Midterm 2
curvature or reduce the slope of the line. On both plots we can see that the data are evenly
spread around the line, so the equal variance assumption likely holds.
One curious feature of these plots is that there appear to be 2 high leverage point (in blue) on the
plot of risk versus length, but only one on the partial regression leverage plot. This would
indicate that the other point might be accounted for by one of the other variables in the model.
c) (2) What is the likely effect of the high leverage point on the regression coefficients? Briefly
justify your answer.
The high leverage point(s) lie below the line that goes through the remaining points. So, the
effect of the point(s) is to reduce the regression coefficient of Length.
Or (since I was not specific that I meant the point identified earlier) in general a high leverage
point can change the regression slope so that the line comes closer to the y value for that point.
11
Stat 511
Fall 2005
Midterm 2
d) (2) Below is the plot of studentized residuals versus LENGTH. Summarize the important
features of this plot for checking the regression assumptions.
3
2
1
0
- 1
- 2
- 3
6
8
10
14
12
16
18
20
l engt h
There are 2 very outlying values of Length, and both of these have negative residuals. The other
points appear to have mean 0 and constant variance. There are no outliers.
(Any 2 for full points)
I also accepted variance is not constant (or even curvature) if it was clear that the effect was due
to the 2 high leverage points – e.g. by a sketch on the plot, or a verbal description.
12
Stat 511
Fall 2005
Midterm 2
e) (1) What is the predicted risk for a hospital with
LENGTH=8.82
AGE=58.2
BEDS=80
NURSES=52
CENSUS=51
predicted risk -=
1.63109 + 0.37974*8.82 -0.02308 * 58.2 + 0.00506*80 + 0.00041805*52 -0.00362*51
= 3.7492
f) (1) The leverage for the hospital in part d is 0.027.
Is this hospital a high leverage point?
We compare with 2p/n = 2*6/113 = 0.106. This leverage is less than our cut-off, so the hospital
is not a high leverage point.
13
Stat 511
Fall 2005
Midterm 2
g) (3) Compute a 95% interval for the mean risk for hospitals with the same values of the
predictor variables as the hospital in part d.
This is a confidence interval, since we are interested in mean risk.
I ask this type of question every year and I put this on the review list:
Var( Yˆ ) = 2h and h was given in part f.
Yˆ  t.975,107 MSE (h) = 3.7492 ± 1.98√(1.2066)(0.027)
=3.7492 ± 0.3574
(3.3918,41066)
h) (3) Compute a 95% confidence interval for the regression coefficient of BEDS when the other
variables are in the model.
The interval has the form
b4 ± t.975,107*s(b4) = 0.00042 ± 1.98(0.00303)
= 0.00042 ± 0.006
(-0.0056,0.0064)
14
Stat 511
Fall 2005
Midterm 2
i) (2) Is there evidence of high multicollinearity among the predictor variables? Briefly justify
your answer.
There are many variance inflation factors higher than 10. So there is high multicollinearity
among the predictor variables.
j) (2) Compute the TOLERANCE for BEDS. What does this tell us?
TOLERANCE = 1/VIF = 1/31.77778 = 0.0315
This is much smaller than 0.1, so BEDS is highly collinear with the other variables.
Alternatively, TOLERANCE = 1- R2 for the regression of BEDS on the other predictors. So, the
percentage of the variance in BEDS explained by the other variables in the model is 96.85%.
15
Stat 511
Fall 2005
Midterm 2
k) (3) Test whether the effect of LENGTH on risk is statistically significant when the other
variables are in the model. Be sure to include your null and alternative hypothesis, p-value and
conclusion. You may use whatever statistics you can obtain from the computer output, and
compute the rest.
H0: 1 = 0
HA: 1 ≠ 0
This test statistic can usually be found on the computer output, but since it was deleted:
b1 - 1 = 0.37974 = 5.5778
s(b1) 0.06808
Alternative, you can use F*= SSII(length) = 31.1139= t*2
MSE
Naturally, the p-value is the same both ways p<.001
The effect of length is statistically significant when the other variables are in the model.
l) (2) The study administrator notes that the p-values for the coefficients of average patient age,
number of beds and the census are not statistically significant and concludes none of these
variables contributes significantly to the risk. Is this conclusion statistically justified? Briefly
explain your answer.
The t-tests can only be used to determine the statistical significance of 1 of the variables when
ALL of the other variables are in the model. So, the t-tests cannot be used to conclude that none
of the 3 variables contributes significantly to the risk.
16
Stat 511
Fall 2005
Midterm 2
m) (3) The biostatistical staff suggest doing a simultaneous F-test to determine whether AGE,
BEDS and CENSUS are jointly significant when the other variables are in the model.
They compute the F-statistic:
F* = [SSI(AGE)+SSI(BEDS)+SSI(CENSUS)]/3
MSE(full)
= (2.07506+2.76081+1.04570)/3
1.20659
and compare the result with F(3, 107)
Is this test statistically justified?
The staff are on the right track – they do want to use a simultaneous F-test – but the computation
is incorrect. The SSI depend on having the variables in the model in the right order. For the test
required here, age, beds and census must be the last 3 variables in the model. So this is the
wrong test statistic.
17
Stat 511
Fall 2005
Midterm 2
n) (2) What is the percent variance explained by BEDS and CENSUS when the other variables
are in the model?
The percent variance explained by BEDS and CENSUS when the other variables are in the
model is
SSI(Beds) + SSI(Census) = (2.76081+1.04570) = 0.0189 (1.89%)
SSTo
201.3797
o) (2) What is the percent variance explained by AGE when the other variables are in the model?
The percent variance explained by AGE when the other variables are in the model is
SSII(Age) = 1.10841 = 0.0055 (0.55%)
SSTo
201.3797
18
Download