Stat 501 Nov. 17 Lab and homework assignment. Due Monday Nov. 22. Write answers on different paper from this. 1. Use the dataset prostatecancer.mtw at www.stat.psu.edu/~rho/501data/. Dataset consists of n = 97 prostate cancer patients. y = CancerVol = cancer volume, x1= PSA_level = prostate specific antigen, a blood chemistry measurement believed to predict the presence of prostate cancer, and x2 = Capsular, a measure of the invasiveness of the cancer. The dataset also includes a column giving an ID number for each patient. A. Plot CancerVol versus PSA. (1) Identify any unusual points. Give the ID number and values of the variables for points that are unusual. Hint: Hold the mouse cursor on a point in the graph to see the data value(s) for the point. (2) Discuss possible interpretations of the plot. B. Plot CancerVol versus Capsular. (1) Identify any unusual points. Give the ID number and values of the variables for points that are unusual. (2) Discuss possible interpretations of the plot. C. Plot PSA versus Capsular. Identify any unusual points. Give the ID number and values of the variables for points that are unusual. NOTE: This is a plot of the two x-variables. D. Fit the model E(y) = 0 + 1 x1 + 2 x2 to the whole dataset. Use the Storage button, Store Deleted t residuals, and DFITS, and Cook’s D values. (1) What is the estimated regression equation? (2) What is the value of R2? E. In the output for the previous part (not the storage columns): (1) Which observations are listed as having unusual x-values? (2) Which observations are listed as having unusually large residuals? F. Plot the DFITS versus ID number. Identify any “extreme” points. (The book’s criterion for extreme is that absolute DFIT>1.) For each point, give the patient number and the value of the DFIT. G Explain (in general) what is measured by a DFFIT value. H. Plot Cook’s D values versus ID number .Identify the most extreme values. What observation(s) have extreme Cook’s D values and what are the values of Cook’s D for these observations? I. Plot the deleted t residuals versus ID. Identify any “extreme” data points. Give the observation number and value of the deleted t residuals. J. Based on the results so far, which observations do you think should be deleted? Why? K. Delete the points you list in the previous part. With these points deleted, redo the regression you did in part D. (1) What is the estimated regression equation? (2) What is the value of R2? L. For the regression done in the previous part, plot residuals versus fits. Interpret the plot. 2. Use the dataset party200.mtw. We’ll use logistic regression to predict the probability that a student says they have ever driven under the influence of alcohol based on X= how many days per month they say they drink at least two beers. The underlying model of logistic regression for one X-variable is e 01X where p = probability of falling into a category of interest (saying yes to having driven p 1 e 01X under the influence. Use Stat>Regression>Binary Logistic Regression. Enter “DrvDrnk” as “Response” enter “DaysBeer” in the Model box, click Storage and click to Store “Event Probabilities” (item is on right side of dialog box). A. What are the estimated values of 0 and 1? (See output in session window.) B. Plot the stored event probabilities versus DaysBeer. Based on this plot, estimate the probability of ever having driven under the influence for (1) DaysBeer = 0, (2) DaysBeer=10, and (3) DaysBeer = 20. Remember that you can hold the mouse over a point to identify the coordinate values. C. Use the equation given for the model to calculate predicted probabilities for (1) DaysBeer = 0 and (2) DaysBeer=10. 3. Data from n = 19 female bears are used to develop an equation for estimating Y= female bear’s weight from X =bear’s neck circumference. On of the observations has X = 10, Y = 140. Regression results shown below give equations when this observation is included in the calculations and also when it is not. All observations used: Regr. equation is Weight = - 108.3 + 16.95 Neck X= 10 Y = 140 not used in calculations: Regr. equation is Weight = - 194.1 + 20.54 Neck A. Calculate an unstandardized value of DFFIT for the observation X = 10, Y = 140. Show any work. B. Calculate an unstandardized deleted residual for the observation X = 10, Y = 140. Show any work. 4. For the hospital infection risk dataset, the following output is for y = infection risk, x1= average length of Stay, x2 = Beds in hospital, and x3 = average daily Census of patients. VIF factors are given as part of the output, and pairwise correlations are given below the regression output. Predictor Constant Stay Beds Census Coef 0.8230 0.33538 0.002253 -0.001421 SE Coef 0.6138 0.06706 0.003017 0.003921 T 1.34 5.00 0.75 -0.36 P 0.183 0.000 0.457 0.718 VIF 1.4 29.7 31.9 Correlations: InfctRsk, Stay, Beds, Census Stay Beds Census InfctRsk 0.533 0.360 0.381 Stay Beds 0.409 0.474 0.981 A. Explain why the VIF values for the variables Beds and Census are so much larger than the VIF value for Stay. B. The initials VIF stand for “variance inflation factor.” What variance is being referred to in this phrase?