Assignment

advertisement
Stat 501 Nov. 17 Lab and homework assignment. Due Monday Nov. 22. Write answers on
different paper from this.
1. Use the dataset prostatecancer.mtw at www.stat.psu.edu/~rho/501data/. Dataset consists of n = 97
prostate cancer patients. y = CancerVol = cancer volume, x1= PSA_level = prostate specific antigen, a
blood chemistry measurement believed to predict the presence of prostate cancer, and x2 = Capsular, a
measure of the invasiveness of the cancer. The dataset also includes a column giving an ID number for each
patient.
A. Plot CancerVol versus PSA. (1) Identify any unusual points. Give the ID number and values of the
variables for points that are unusual. Hint: Hold the mouse cursor on a point in the graph to see the data
value(s) for the point. (2) Discuss possible interpretations of the plot.
B. Plot CancerVol versus Capsular. (1) Identify any unusual points. Give the ID number and values of the
variables for points that are unusual. (2) Discuss possible interpretations of the plot.
C. Plot PSA versus Capsular. Identify any unusual points. Give the ID number and values of the variables
for points that are unusual. NOTE: This is a plot of the two x-variables.
D. Fit the model E(y) = 0 + 1 x1 + 2 x2 to the whole dataset. Use the Storage button, Store Deleted t
residuals, and DFITS, and Cook’s D values. (1) What is the estimated regression equation? (2) What is the
value of R2?
E. In the output for the previous part (not the storage columns): (1) Which observations are listed as having
unusual x-values? (2) Which observations are listed as having unusually large residuals?
F. Plot the DFITS versus ID number. Identify any “extreme” points. (The book’s criterion for extreme is
that absolute DFIT>1.) For each point, give the patient number and the value of the DFIT.
G Explain (in general) what is measured by a DFFIT value.
H. Plot Cook’s D values versus ID number .Identify the most extreme values. What observation(s) have
extreme Cook’s D values and what are the values of Cook’s D for these observations?
I. Plot the deleted t residuals versus ID. Identify any “extreme” data points. Give the observation number
and value of the deleted t residuals.
J. Based on the results so far, which observations do you think should be deleted? Why?
K. Delete the points you list in the previous part. With these points deleted, redo the regression you did in
part D. (1) What is the estimated regression equation? (2) What is the value of R2?
L. For the regression done in the previous part, plot residuals versus fits. Interpret the plot.
2. Use the dataset party200.mtw. We’ll use logistic regression to predict the probability that a student
says they have ever driven under the influence of alcohol based on X= how many days per month they say
they drink at least two beers.
The underlying model of logistic regression for one X-variable is
e 01X
where p = probability of falling into a category of interest (saying yes to having driven
p
1  e 01X
under the influence.
Use Stat>Regression>Binary Logistic Regression. Enter “DrvDrnk” as “Response” enter “DaysBeer” in
the Model box, click Storage and click to Store “Event Probabilities” (item is on right side of dialog box).
A. What are the estimated values of 0 and 1? (See output in session window.)
B. Plot the stored event probabilities versus DaysBeer. Based on this plot, estimate the probability of ever
having driven under the influence for (1) DaysBeer = 0, (2) DaysBeer=10, and (3) DaysBeer = 20.
Remember that you can hold the mouse over a point to identify the coordinate values.
C. Use the equation given for the model to calculate predicted probabilities for (1) DaysBeer = 0 and (2)
DaysBeer=10.
3. Data from n = 19 female bears are used to develop an equation for estimating Y= female bear’s weight
from X =bear’s neck circumference. On of the observations has X = 10, Y = 140. Regression results shown
below give equations when this observation is included in the calculations and also when it is not.
All observations used: Regr. equation is Weight = - 108.3 + 16.95 Neck
X= 10 Y = 140 not used in calculations: Regr. equation is Weight = - 194.1 + 20.54 Neck
A. Calculate an unstandardized value of DFFIT for the observation X = 10, Y = 140. Show any work.
B. Calculate an unstandardized deleted residual for the observation X = 10, Y = 140. Show any work.
4. For the hospital infection risk dataset, the following output is for y = infection risk, x1= average length
of Stay, x2 = Beds in hospital, and x3 = average daily Census of patients. VIF factors are given as part of
the output, and pairwise correlations are given below the regression output.
Predictor
Constant
Stay
Beds
Census
Coef
0.8230
0.33538
0.002253
-0.001421
SE Coef
0.6138
0.06706
0.003017
0.003921
T
1.34
5.00
0.75
-0.36
P
0.183
0.000
0.457
0.718
VIF
1.4
29.7
31.9
Correlations: InfctRsk, Stay, Beds, Census
Stay
Beds
Census
InfctRsk
0.533
0.360
0.381
Stay
Beds
0.409
0.474
0.981
A. Explain why the VIF values for the variables Beds and Census are so much larger than the VIF value for
Stay.
B. The initials VIF stand for “variance inflation factor.” What variance is being referred to in this phrase?
Download