FinalExamF04

advertisement
Statistics 511
Final Exam
Dec. 16, 2004
4:40-6:30 p.m.
The following rules apply.
1.
2.
3.
4.
You may up to 3 pages of notes, double-sided, any font.
You may use a calculator.
You may not collaborate or copy.
Failure to comply with item 3 could lead to reduction in your grade, or disciplinary
action.
I have read the rules above and agree to comply with them.
Signature ________________________________________________
Name (printed) ___________________________________________
Statistics 511
Final Exam
Fall 2004
1) Several variables were collected on 97 men with prostate cancer. The doctors would like to be able to
determine which cancers will become invasive based on measurements of PSA (a blood chemical) and
the cancer volume (CancerVol) which can be estimated noninvasively.
Below are loess fits to the regression of invasion probability on PSA (top plot) and invasion probability
on Cancer Volume (lower plot) (2 separate regression fits). Does it look like logistic regression will
provide an adequate fit to the data? Briefly justify your response.
S mo o t h i n g P a r a me t e r = 0 . 7
I nvasi ve
1. 10000
O
1. 00000
O
0. 90000
O
0. 80000
0. 70000
O
0. 60000
O
O
OO
O
0. 50000
OO
OO
O
O
O
O
O
O
O
O
O
0. 40000
O
O
O
O
O
O
O
O
O
O
O
O
O
0. 30000
0. 20000
O
O
O
O
O
O
O
O
O
0. 10000
0
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
- 0. 10000
0
100. 00000
200. 00000
300. 00000
PSA
I nvasi ve
1. 10000
O
1. 00000
0. 90000
O
0. 80000
O
O
0. 70000
OO
O
0. 60000
O
OO
O
O
O
O
OO
0. 50000
O
OO
O
0. 40000
O
O
O
0. 30000
O
O
O
O
O
O
O
0. 20000
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
OO
0. 10000
0
OO
O
O
O
O
O
O
OO
O
O
O
O
OO
O
O
O
O
O
O
O
O
O
O
- 0. 10000
0
10. 00000
20. 00000
30. 00000
40. 00000
Ca n c e r V o l
2 of 18
2
50. 00000
Statistics 511
Final Exam
Fall 2004
Below is the SAS output for the logistic regression of Invasive (1=invasive cancer, 0=noninvasive
cancer) on PSA and CancerVol.
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Model
Optimization Technique
WORK.PROSTATE
Invasive
2
97
binary logit
Fisher's scoring
Response Profile
Ordered
Value
1
2
Total
Frequency
21
76
Invasive
1
0
Probability modeled is Invasive=1.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
Intercept
Only
Intercept
and
Covariates
103.353
105.927
101.353
65.503
73.227
59.503
AIC
SC
-2 Log L
Test
Testing Global Null Hypothesis: BETA=0
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
Score
Wald
41.8498
37.1807
18.1894
2
2
2
<.0001
<.0001
0.0001
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
PSA
CancerVol
Effect
PSA
CancerVol
3 of 18
DF
1
1
1
Estimate
-3.8156
0.0675
0.1141
Standard
Error
0.6962
0.0254
0.0547
Wald
Chi-Square
30.0399
7.0624
4.3440
Pr > ChiSq
<.0001
0.0079
0.0371
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.070
1.018
1.124
1.121
1.007
1.248
3
Statistics 511
E s t i ma t e d
Final Exam
Fall 2004
Pr o b a b i l i t y
1. 0
0. 9
0. 8
0. 7
0. 6
0. 5
0. 4
0. 3
0. 2
0. 1
0. 0
0
100
200
300
PSA
Predicted probability versus PSA (from regression on PSA and CancerVol)
E s t i ma t e d
Pr o b a b i l i t y
1. 0
0. 9
0. 8
0. 7
0. 6
0. 5
0. 4
0. 3
0. 2
0. 1
0. 0
0
10
20
30
40
50
Ca n c e r V o l
Predicted probability versus CancerVol (from regression on PSA and CancerVol)
4 of 18
4
Statistics 511
b) Test H0: 1 = 2= 0
Final Exam
Fall 2004
versus HA: at least one of the two coefficients is not zero.
Test Statistic:
Distribution of the Test Statistic Under the Null Hypothesis
P-value
Conclusion (stated in words)
c) Compute a 95% confidence interval for the regression coefficient for PSA.
5 of 18
5
Statistics 511
Final Exam
Fall 2004
d) A patient comes to the clinic with prostate cancer. His PSA is 100 and his Cancer Volume is 10. What is
his predicted probability of having invasive cancer?
e) A new drug has been developed that lowers PSA in men with prostate cancer. The drug company argues
that the data presented here provide evidence that taking this drug will lower the risk of invasive cancer in
men that have been diagnosed with prostate cancer and have elevated PSA. Do you agree with the drug
company? Briefly support your answer.
6 of 18
6
Statistics 511
Final Exam
Fall 2004
Sulfur dioxide (SO2) is an important atmospheric pollutant. The level of SO2 varies considerably across the
country. An investigator wants to predict the level of SO2 using the 22 variables in the table below. V1-V6
are composite variables taken from the Current Population Index.
Variables for predicting SO2
YEARTEMP
Mean annual temperature
MaxTemp
MANUFACT
POP70
SPEEDWIN
Manufacturing output
Population - 1970
Mean annual wind speed
ALTITUDE
FOREST
TRUCKS
PRECIP
Mean total annual
precipitation
COAL
DAYNUM
CARS70
Mean days of precipitation
Number of registered cars 1970
Mean annual gasoline sales
Mean daily humidity
Miles of roads - 1970
Mean Minimum Temperature
V1
V2
GAS
HUMIDITY
ROADS
MinTemp
Mean Maximum
Temperature
Altitude
Percent forested
Number of registered
trucks - 1970
Percent of electrical
power generated by
coal
V3
V4
V5
V6
Questions 2, 3 and 4 all refer to these data.
2. The investigator felt that 22 variables were too many for practical use. Hence he decided to use all
subsets regression to select a smaller set of variables.
a. Below is some output from all subsets regression.
Number in
Model
R-Square
SBC
Variables in Model
1
0.4157
243.15913 MANUFACT
------------------------------------------------------------------------------2
0.5863
232.71636 MANUFACT POP70
------------------------------------------------------------------------------3
0.6198
232.96993 MANUFACT POP70 V5
------------------------------------------------------------------------------4
0.6680
231.13065 MANUFACT POP70 V3 V5
------------------------------------------------------------------------------5
0.7195
227.92702 MANUFACT POP70 FOREST TRUCKS V3
------------------------------------------------------------------------------6
0.7860
220.54826 MANUFACT POP70 PRECIP GAS ALTITUDE V3
------------------------------------------------------------------------------7
0.8070
220.02561 POP70 PRECIP CARS HUMIDITY ROADS ALTITUDE V3
------------------------------------------------------------------------------8
0.8268
219.30165 MANUFACT POP70 PRECIP FOREST COAL V2 V3 V6
------------------------------------------------------------------------------9
0.8346
221.11842 MANUFACT POP70 PRECIP GAS FOREST COAL V2 V3 V6
------------------------------------------------------------------------------Number in
Model
R-Square
SBC
Variables in Model
10
0.8453
222.10022 MANUFACT POP70 PRECIP HUMIDITY ROADS FOREST COAL V2 V3 V6
-------------------------------------------------------------------------------
7 of 18
7
Statistics 511
Final Exam
Fall 2004
11
0.8561
222.84264 MANUFACT POP70 PRECIP DAYNUM GAS ALTITUDE V1 V2 V3 V5 V6
------------------------------------------------------------------------------12
0.8639
224.27753 MANUFACT POP70 PRECIP DAYNUM GAS MINTEMP MAXTEMP ALTITUDE V1 V2 V3
V5
------------------------------------------------------------------------------13
0.8706
225.91078 MANUFACT POP70 PRECIP DAYNUM GAS MINTEMP MAXTEMP ALTITUDE V1 V2 V3
V5 V6
------------------------------------------------------------------------------14
0.8775
227.39496 YEARTEMP MANUFACT POP70 DAYNUM CARS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL
V1 V2 V3 V5
------------------------------------------------------------------------------15
0.8802
230.17261 MANUFACT POP70 DAYNUM CARS GAS MINTEMP MAXTEMP ALTITUDE TRUCKS COAL
V1 V2 V3 V5 V6
------------------------------------------------------------------------------16
0.8829
232.96952 MANUFACT POP70 DAYNUM CARS GAS ROADS MINTEMP MAXTEMP ALTITUDE TRUCKS
COAL V1 V2 V3 V5 V6
------------------------------------------------------------------------------17
0.8878
234.93011 MANUFACT POP70 DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP FOREST
TRUCKS COAL V1 V2 V3 V5 V6
------------------------------------------------------------------------------18
0.8884
238.41900 MANUFACT POP70 SPEEDWIN DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP
ALTITUDE TRUCKS COAL V1 V2 V3 V5 V6
------------------------------------------------------------------------------19
0.8888
241.96613 MANUFACT POP70 PRECIP DAYNUM CARS GAS HUMIDITY ROADS MINTEMP MAXTEMP
ALTITUDE TRUCKS COAL V1 V2 V3 V4 V5 V6
------------------------------------------------------------------------------20
0.8894
245.46001 MANUFACT POP70 SPEEDWIN PRECIP DAYNUM CARS GAS HUMIDITY ROADS
MINTEMP MAXTEMP ALTITUDE FOREST TRUCKS COAL V1 V2 V3 V5 V6
------------------------------------------------------------------------------21
0.8899
248.99479 MANUFACT POP70 SPEEDWIN PRECIP DAYNUM CARS GAS HUMIDITY ROADS
MINTEMP MAXTEMP ALTITUDE FOREST TRUCKS COAL V1 V2 V3 V4 V5 V6
------------------------------------------------------------------------------22
0.8901
252.65771 YEARTEMP MANUFACT POP70 SPEEDWIN PRECIP DAYNUM CARS GAS HUMIDITY
ROADS MINTEMP MAXTEMP ALTITUDE FOREST TRUCKS COAL V1 V2 V3
V4 V5 V6
8 of 18
8
Statistics 511
Final Exam
Fall 2004
Plot of SBC versus number of parameters
255
250
245
240
235
230
225
220
215
0. 0
2. 5
5. 0
7. 5
10. 0
12. 5
15. 0
17. 5
20. 0
22. 5
25. 0
15. 0
17. 5
20. 0
22. 5
25. 0
P
Plot of R2 versus number of parameters
0. 9
0. 8
0. 7
0. 6
0. 5
0. 4
0. 0
2. 5
5. 0
7. 5
10. 0
12. 5
P
a) Based on this output, about how many variables should be in the final model? Justify your answer
briefly.
9 of 18
9
Statistics 511
Final Exam
Fall 2004
b) The investigator selected a candidate model, and looked at some of the resulting residual plots.
Two typical plots are below. Based on these 2 plots, the investigator decided to make some
adjustments to the data and model. What advice would you give about “adjustments” such as
transforming variables or removing unusual data values? (2 specific pieces of advice with supporting
evidence relying on the plots.)
4
3
2
1
0
- 1
- 2
0
200
400
600
800
1000
1200
1400
1600
1800
MA NUF A CT
4
3
2
1
0
- 1
- 2
10
15
20
25
30
35
40
Pr e d i c t e d
10 of 18
45
50
55
Va l u e
10
60
65
70
Statistics 511
Final Exam
Fall 2004
c. Another investigator suggested using a stepwise method to select variables for this study. Give 2
reasons why all subsets regression is better for selecting variables in this study than a stepwise
method.
d. Other investigators using SO2 in pollution studies, transformed to log(SO2). If this investigator
decides to predict log(SO2) instead of SO2, does he need to redo the variable selection, or can he use
one of the selected models? Explain your answer.
11 of 18
11
Statistics 511
Final Exam
Fall 2004
3. A student looking at the pollution data decided that log(SO2)=LSO2 could probably be predicted
from mean windspeed alone, using polynomial regression. Some of the relevant output is below.
Plot of LSO2 versus windspeed. The loess curve is plotted using squares.
L S O2
5. 00000
4. 00000
3. 00000
2. 00000
6. 00000
7. 00000
8. 00000
9. 00000
10. 00000
S P E E DWI N
Dependent Variable: LSO2
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
4
35
39
2.88858
14.38580
17.27438
0.72214
0.41102
Root MSE
Dependent Mean
Coeff Var
12 of 18
0.64111
3.11432
20.58592
R-Square
Adj R-Sq
0.1672
0.0720
12
F Value
Pr > F
1.76
0.1597
11. 00000
12. 00000
13. 00000
Statistics 511
Final Exam
Fall 2004
Parameter Estimates
Variable
Intercept
SPEEDWIN
SPEEDWIN2
SPEEDWIN3
SPEEDWIN4
DF
1
1
1
1
1
Parameter
Estimate
10.85596
-3.01474
0.32819
-0.00547
-0.00047328
Standard
Error
81.05424
36.34083
5.98802
0.43051
0.01141
t Value
0.13
-0.08
0.05
-0.01
-0.04
Pr > |t|
0.8942
0.9344
0.9566
0.9899
0.9671
Type I SS
387.95874
0.05255
2.16155
0.67377
0.00070716
Variance
Inflation
0
259310
2547486
2869922
368171
a)
Do an overall F-test of whether any of the regression coefficients are non-zero. State your
conclusion clearly.
13 of 18
13
Statistics 511
Final Exam
Fall 2004
b)
Do sequential unpooled testing to determine the appropriate degree for a polynomial fit to the
data. What is the appropriate degree? How does your answer correspond to your response in part a?
14 of 18
14
Statistics 511
Final Exam
Fall 2004
c)
The investigator is concerned about the very high variance inflation factors. What effect does
the variance inflation factor have on your tests in parts a) and b) above?
15 of 18
15
Statistics 511
Final Exam
Fall 2004
4. After looking at the results obtained by the student and the variable selection results, the investigator
decided to fit a polynomial of degree 2 in YEARTEMP, SPEEDWIN and PRECIP including all first
order interactions. Some of the output is below:
The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: LSO2
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
9
30
39
11.58682
5.68756
17.27438
1.28742
0.18959
Root MSE
Dependent Mean
Coeff Var
0.43541
3.11432
13.98104
R-Square
Adj R-Sq
F Value
Pr > F
6.79
<.0001
0.6708
0.5720
Parameter Estimates
Variable
Intercept
YEARTEMP
SPEEDWIN
PRECIP
YEARTEMP2
SPEEDWIN2
PRECIP2
SPEEDxPREC
TEMPxSPEED
TEMPxPREC
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Type I SS
Type II SS
1
1
1
1
1
1
1
1
1
1
12.50779
-0.17487
-0.10851
-0.08986
0.00033096
-0.05717
-0.00016982
0.01792
0.00782
-0.00057411
10.37681
0.24891
1.12216
0.17204
0.00199
0.03350
0.00088005
0.01035
0.01079
0.00191
1.21
-0.70
-0.10
-0.52
0.17
-1.71
-0.19
1.73
0.73
-0.30
0.2375
0.4878
0.9236
0.6053
0.8693
0.0983
0.8483
0.0938
0.4740
0.7659
387.95874
5.12074
1.13771
1.94376
0.04366
0.86568
0.31964
2.05591
0.08259
0.01712
0.27545
0.09357
0.00177
0.05172
0.00522
0.55198
0.00706
0.56773
0.09967
0.01712
Assume that the regression assumptions are satisfied.
16 of 18
16
Statistics 511
Final Exam
Fall 2004
a) Do pooled sequential testing (one term at a time) to determine if any interaction effects are needed in
the model.
17 of 18
17
Statistics 511
Final Exam
Fall 2004
b) Below is the plot of studentized residual versus MANUFACT. Do you think that MANUFACT
should be included in the model? Support your answer briefly referring to the plot.
2. 0
1. 5
1. 0
0. 5
0. 0
- 0. 5
- 1. 0
- 1. 5
- 2. 0
0
200
400
600
800
1000
1200
1400
MA NUF A CT
18 of 18
18
1600
1800
Download