Biostat 510 Homework 3 Due Thursday, February 2, 2006 1. Create a permanent SAS data set from the raw data file, afifi.dat, on my web page. Model your SAS commands based on the description of the Afifi data set included with this homework. a) You should read in all variables, even though the example only shows reading in a subset of the variable. b) Assign labels to all variables, using a label statement, as shown in the handout. c) Create new variables in the permanent data set, as shown in the handout, and additional new variables. i. SHOCK: Dummy Variable. 0=no shock, 1=shock (shown in handout) ii. DIED: Dummy Variable. 0=lived, 1=died (shown in handout) iii. SHOCK_DUM2 – SHOCK_DUM7: Dummy variables for each shock type (shock types go from 2 through 7). iv. SBPDIFF: The difference between SBP at time 2 and SBP at time 1. Calculate by subtracting SBP1 from SBP2. 2. Create a Scatterplot with SBP2 as the Y-axis, and SBP1 as the X-axis. Include a linear regression line in your scatterplot. You may create the scatterplot using Proc Gplot, or in SAS/INSIGHT. Include your scatterplot in your homework. 3. Carry out a simple linear regression, with SBP2 as the dependent variable, and SBP1 as the only predictor. a) Get a plot of residuals vs. predicted values to check homogeneity of variance. b) Get a histogram and normal Q-Q plot to check the normality of the residuals. Use the studentized residuals for both of these plots. c) Include your regression output and the diagnostic plots in your homework. 4. Get a box plot of SBP2 as the Y-axis and the levels of SHOCKTYPE as the X-axis. Don’t forget to sort prior to running the box plot. Include the box plot in your homework. 5. Carry out a regression with the dummy variables for SHOCKTYPE as the predictors and SBP2 as the dependent variable. a) Use Non-shock as the reference category. b) Get a plot of residuals vs. predicted values to check homogeneity of variance. c) Get a histogram and normal Q-Q plot to check the normality of the residuals. Use the studentized residuals for both of these plots. d) Include your regression output and the diagnostic plots in your homework. 6. Create a box plot with SBP2 as the Y-axis and the SHOCK dummy variable as the Xaxis. Include this box plot in your homework. 1 7. Carry out a regression with SBP2 as the dependent variable and SHOCK (the dummy variable) as the predictor. Include the output for this regression in your homework. a) Create diagnostic plots for this regression, but you do not need to include them in your homework. 8. DON’T DO THIS ONE. Create a Pearson correlation matrix with the variables SBP2, SBP1, BSA1, CARDIAC1, HGB1, and MAP1. a) Use listwise deletion for the variables. b) Create a scatterplot matrix using SAS/INSIGHT. c) Include the correlation matrix and the scatterplot matrix in your homework output. 9. DON’T DO THIS ONE. Carry out a multiple regression with SBP2 as the dependent variable and the predictor variables, SBP1, BSA1, CARDIAC1, HGB1, and MAP1 as predictors. a) Check the collinearity diagnostics for this model. b) Include the model output, along with the collinearity diagnostics in your homework. c) Rerun the model, but remove MAP1 as a predictor. d) Check collinearity for this new model. 10. Answer the following questions about your analysis. a) (Scatterplot of SBP2 vs. SBP1) Does there appear to be a linear relationship between these two variables? Describe the direction of the relationship, and how much variability there is around the linear regression line. b) (Simple linear regression) Please interpret the estimated intercept and the estimated coefficient for SBP1 in this model. i. Is there a significant linear relationship between SPB2 and SBP1? Write out your response in words, and include the t-statistic, df, and p-value for the test. ii. What is the model R2? What is the sample size for this model? iii. Describe the scatterplot of studentized residuals vs. predicted values. Do the residuals appear to have constant variance for all predicted values? iv. Describe the distribution of the residuals. Does the assumption of normally distributed errors appear to be true for this model? c) (Box plot of SBP2 for levels of SHOCKTYPE) i. Describe the pattern of SBP2 for the levels of SHOCKTYPE. d) (Regression with dummy variables for SHOCKTYPE) i. Interpret the intercept and the coefficients for each level of SHOCKTYPE. Which of the dummy variables for the levels of SHOCKTYPE are significant? ii. What is the model R2? What is the sample size for this model? iii. Does the assumption of equal variances appear to hold true for the different levels of SHOCKTYPE, based on your scatterplot of studentized residuals vs. predicted values for this model? iv. Does the assumption of normality of residuals appear to hold true for this model? 2 e) (Box plot of SBP2 vs. SHOCK dummy). i. What is the relationship between SHOCK and SBP2 as shown in this boxplot? f) (Regression with one dummy variable for SHOCK as the predictor) i. Interpret the intercept and the coefficient for SHOCK in this regression model. ii. What is the model R2? What is the sample size for this model? iii. Compare the model R2 for this model to the one in which the dummy variables for SHOCKTYPE were included as predictors. Which model would you prefer to use (think of parsimony). g) DON’T ANSWER THIS QUESTION. (Pearson Correlation Matrix) i. What variables are highly correlated with each other in this correlation matrix? ii. How many observations can be included in this correlation? h) DON’T ANSWER THIS QUESTION. (Multiple Regression with Collinearity Diagnostics) i. What variables appear to be collinear in the initial regression model? ii. What is the parameter estimate, standard error, and significance for each predictor in the initial model? iii. How do the colllinearity diagnostics appear after deleting MAP1 as a predictor? iv. What is the parameter estimate, standard error, and significance for each predictor in the final model? v. Comment on the comparison between these two models. The SAS commands will be worth 50 points, and answers to questions a) through f) will be worth 50 points. Save your homework commands as Homework3.sas. Run all the commands at once and be sure they all work without error. Note: you should work on only questions 1 through 7, and answer questions a through f for this homework. 3 Data Description For Afifi Data Afifi and Azen (1972) describe data collected at the Los Angeles Shock Unit. For each patient, data were taken on admission and either shortly before death or before discharge. The variables and their formats are described in the table below. Variables 1-21 refer to data at the initial examination and variables 22-42 refer to the same variables at the final examination. Variables 1,22 2,23 3,24 4,25 5,26 6,27 7,28 8,29 9,30 10,31 11,32 12,33 13,34 14,35 15,36 16,37 17,38 18,39 19,40 20,41 21,42 Columns 1-4 5-8 9-12 13-15 16 17-20 21-24 25-28 29-32 33-36 37-40 41-44 45-48 49-52 53-56 57-60 61-64 65-68 69-72 73-76 80 Format 4.0 4.0 4.0 3.0 1.0 4.0 4.0 4.0 4.0 4.0 4.1 4.2 4.2 4.1 4.1 4.0 4.1 4.1 4.1 4.1 1.0 Name IDNUM AGE HEIGHT SEX SURVIVE SHOCKTYPE SBP1 / SBP2 MAP1 / MAP2 HRT1/ HRT2 DBP1/DBP2 CVP1/CVP2 BSA1/BSA2 CI1/CI2 APP1/APP2 CIRC1/CIRC2 UR1/UR2 PLAS1/PLAS2 RC1/RC2 HGB1/HGB2 HCT1/HCT2 TIME1/TIME2 Description Id number Age (years) Height (cm) Sex (1=male, 2=female) Survival (1=lived, 3=died) Shock type (2=non-shock, 3,4,5,6,7=shock) Systolic Blood Pressure (mm Hg) Mean Arterial Pressure (mm Hg) Heartrate (beats per minute) Diastolic blood pressure (mm Hg) Mean central venous BP (mm Hg) Body surface area (m sq) Cardiac index (1/min/min squared) Appearance time (sec) Mean circulation time (sec) Urinary Output (ml/hr) Plasma volume index (ml/kg) Red cell index (ml/kg) Hemoglobin (gm) Hematocrit (%) Time (1=initial, 2=final) A listing of the first 6 lines of the raw data file, Afifi.dat, is shown below: 340 340 412 412 426 426 70 70 56 56 47 47 160 160 173 173 176 176 23 23 11 11 11 11 4 62 4 129 4 83 4 102 4 80 4 87 38 53 74 72 66 110 75 108 64 84 68 77 29 100 187 90 190 390 0 394 241 53 190 187 120 130 300 15 394 241 60 10 182 126 221 407 110 362 240 63 90 182 281 100 206 50 564 266 55 10 180 110 120 280 80 373 272 52 40 180 410 100 170 75 508 217 131 112 166 154 146 99 400 365 500 330 490 320 1 2 1 2 1 2 libname b510 "h:\b510\2006"; DATA b510.AFIFI; INFILE 'C:\TEMP\LABDATA\AFIFI.DAT'; INPUT #1 IDNUM 1-4 AGE 5-8 SEX 13-15 SURVIVE 16 SHOKTYPE 17-20 SBP1 21-24 HGB1 69-72 1 #2 SBP2 21-24 HGB2 69-72 1; LABEL SHOCK='Shock type' SBP1='Systolic BP at time 1' SBP2='Systolic BP at time 2' HGB1='Hemoglobin at time 1' HGB2='Hemoglobin at time 2' ; IF IF IF IF RUN; SHOKTYPE=2 THEN SHOCK=0; SHOKTYPE IN (3,4,5,6,7) THEN SHOCK=1; SURVIVE=1 THEN DIED=0; SURVIVE=3 THEN DIED=1; 4