Biostat 510: Statistical Computing Packages SAS Homework 6 Due Tuesday, March 23, 2004 This homework again uses the Afifi data. Include all of your SAS commands with your homework. Please hand in as much of your annotated SAS output as necessary to show points that you wish to make in your write-up. You do NOT need to include all output from all procedures. Answer all questions and interpret your results. Use alpha=.05 for any statistical tests. For all tests, give the test statistic, the degrees of freedom, and the p-value. Note: for F-tests, give both the numerator and denominator df; for t-tests, give df (remember: the df for t-test in a regression are the error df); for a chi-square test, give the df.. Please interpret test results. Don't just say if a result is significant or not. Explain what the result means. Read in the Afifi data as you did for homework 5 to answer the questions for this assignment. You will need to create new variables for some questions. If you need to create new variables, do so in your original data step, and rerun it so that the new variables will be included in your data set. Make sure that your SAS commands will run from start to finish without any errors before you hand in your homework. 1. Multiple regression using dummy variables as the predictors. a. What type of variable is SHOCKTYPE? What is the level of measurement for this variable? (interval? nominal? ordinal?) b. Describe the relationship between SHOCKTYPE and SBP2. To do this, get descriptive statistics for SBP2 for each level of the variable SHOKTYPE, using Proc Means, with a class statement. Include this in your homework. What is the mean of SBP2 for those patients not in shock? For each shock category? c. Create 6 indicator dummy variables, one for each level of SHOKTYPE. These dummy variables should be coded with a value of 1 if the case is in a particular category, and a value of zero if not. Name your dummy variables Shock2, Shock3, Shock4, Shock5, Shock6 and Shock7. d. Get frequency tabulations of SHOKTYPE and of your new dummy variables to check your coding. They should agree. How many patients are in each shock category? Include these tables in your homework. e. Run a regression model in which the predictor variables are 5 of your SHOKTYPE dummy variables. Use Shock type=2 (non-shock) as the reference category. Include annotated output from this regression model in your homework. i. How many observations are included in this regression model? ii. What is the model R-square? iii. What is the adjusted R-square? iv. What is the overall significance of this model? Give the Fstatistic, the numerator and denominator degrees of freedom and the p-value. Interpret this test. 1 v. What is the value of the intercept? What is the meaning of the intercept in this model? vi. What are the estimated regression coefficients (Parameter Estimates) for each of the shock type dummy variables? Explain the meaning of these coefficients in words. Why are they negative? (If they aren't negative, then you need to check your commands to be sure that you have shoktype=2 as the reference). vii. What is the standard error of each of these coefficients? viii. Which of the shock types are significant? Interpret the significance tests for each of the dummy variables. What are these significance tests testing? ix. Save the residuals and predicted values from this regression model to a new data set, using an output statement in SAS. f. Calculate the predicted value of SBP2 for each one of the shock types from the regression model results by hand. Show your calculations. Compare these predicted values from the model to the means of SBP2 for each level of SHOKTYPE that you got from Proc Means in question 1 b. g. Check the normality of the residuals that you saved from this model, using proc univariate. What do you conclude about the normality of the residuals? Note: you can interpret the tests for normality, but I’m more interested in the graphs. h. Plot the residuals vs. the predicted values. What does this graph show you? Remember, you have only categorical predictors, so there will be one predicted value for each level of shoktype. i. Include annotated output from Proc Reg for this question. 2. Run a multiple regression model with both categorical and continuous predictor variables. a. Run a regression model with SBP2 as the dependent variable and all of the predictors: SHOCKDUM (coded as 1=Shock 0=NoShock) (Note: you will need to create this as a new variable in your data step. Be careful when creating this variable!). Age Sex Mean arterial pressure at time 1 Systolic blood pressure at time 1 Urinary output at time1 Body surface area at time 1 Mean arterial pressure at time1 Heart rate at time1 Central venous pressure at time 1 Cardiac output at time 1 Appearance time at time 1 Circulation time at time 1 Plasma level at time 1 Red cell level at time 1 2 Hemoglobin level at time 1 Hematocrit at time 1 Include the TOL, VIF, and COLLIN options in your model statement after a slash to get the tolerance, variance inflation factor, and collinearity diagnostics. b. What is the overall significance of this model? Write out the F-test, the numerator and denominator degrees of freedom and the p-value. Please interpret this test. c. What is the R-square? The adjusted R-square? d. Are any of the predictor variables significant in this model? Write out the t-test statistic, the degrees of freedom and the p-value for the significant variables. e. What is the meaning of the estimated coefficient for each significant predictor? f. Check for collinearity in this model, using tolerance (TOL), variance inflation factor (VIF) and COLLIN. Interpret the TOL output and the VIF output. Are there any variables that have a high value for the VIF that would make you concerned about collinearity? Which ones? Why do you think these variables are problematic for collinearity? g. Check the Condition Indices for this model. Note that the last two condition indices are high. Which variables have a high proportion (>.50) of their variance loading on the last Eigenvalue (number 18)? Which variables have a high proportion of their variance loading on the next to last Eigenvalue (number 17)? Based on these results, which variables do you think are collinear? h. Based on your examination of this model, do you feel it is adequate? What variables would you include in a final model? 3. Run several selection methods, with SBP2 as the dependent variable and the same predictor variables as in question 2. a. Stepwise selection method. a. Use the default sle and sls criteria. b. Which variables are selected for this model using stepwise selection? c. Compare the coefficients for the variables selected using this stepwise method with the coefficients for these variables in the original model in question 2. b. Backward selection method. a. Use the default sle and sls criteria. b. Which variables are selected using the backward selection method? c. Is this model the same as that chosen using Stepwise Selection? c. Rsquare selection method. a. Get SAS to print out the best 5 models for each number of predictors. b. Which model would you choose? c. Does the model that you selected have the highest r-square of all models possible? d. Adjusted R-square selection method. a. Which model would you select using this method? e. CP selection method. a. Which model would you choose using this method? b. Print out the best 20 models for this method. 3 f. Do these selection methods select the same model that you would have chosen from question 2? 4. Comparison of a Crosstab (2 X 2 table) vs. Logistic Regression: a. Create a new variable called DIED that has a value of 1 if the patient died, and a value of 0 if they survived.. b. Create formats for the variables Shockdum and Died, using proc format. Set up these formats as shown below, so that the value 1 will be alphabetically first, based on the formatted value, and the value 0 will be alphabetically second. proc format; value shockfmt 1 0 value diedfmt 1 0 run; = = = = "A: "B: "A: "B: Shock" No Shock"; Died" Lived"; c. Get a crosstab of Shockdum as the Risk factor (row variable) and Died as the Outcome (column variable). i. Use the formats that you created for these variables. In order to make the values come out in the desired order, use syntax something like that shown below. (Note: if you use the "order=formatted" option, as shown below, SAS will put the variables in the crosstab based on the alphabetic order of the formatted values, rather than their numeric order) proc freq order = formatted ; tables shockdum * died /relrisk chisq; format shockdum shockfmt. died diedfmt.; run; ii. What percent of patients in Shock died? What percent of patients not in Shock died? iii. Test whether there is an association between being in shock and dying. What do you conclude from this test? iv. Get the odds ratio (OR) and relative risk for this table. What is the OR and it's 95% Confidence interval? How does this result support the statistical test that you carried out? 5. Run a logistic regression using Shockdum to predict Died. Remember, even though Shockdum is categorical, you do not need to include it in a class statement, because it is a single dummy variable (0, 1 coding). Be sure you are modeling the probability that Died=1. Do not use any formats for the Logistic Procedure, since it will change the order of the variables and make it difficult to interpret results! proc logistic descending; model died = shockdum / rsquare; run; i. What is the value of the parameter estimate for Shockdum? Please interpret this parameter estimate. 4 ii. Is Shockdum a significant predictor of Dying? Give a statistical test, it's degrees of freedom and it's p-value. iii. What is the Odds Ratio from this analysis, and it's 95% confidence interval? Compare this odds ratio to the one from the crosstab above. iv. Compare the Likelihood Ratio chi-square test for this model to the Likelihood Ratio Chi-square test from the crosstab. What are their values, degrees of freedom and p-values? v. Compare the Score test from this model with the Pearson chi-square test from the crosstab. What are their values, degrees of freedom and pvalues? vi. What is the value of the (maximum rescaled) r-square for this model? vii. What do you conclude about the relationship between Shock and Dying from this analysis? 6. Comparison of a 6 X 2 crosstab with a Logistic Regression using a class variable. a. Get a crosstab using Shoktype as the Risk factor and Died as the Outcome. No formats are necessary for this cross-tabulation. i. What percent of patients in each level of Shoktype died? ii. Test whether there is an association between the Shoktype and Died. What do you conclude based on this test? b. Carry out a logistic regression with Shock Type as the predictor and Died as the outcome, using Proc Logistic. i. Be sure to use a class statement for Shoktype. Use the first level of Shoktype (non-shock) as the reference category. (class shoktype / param=ref ref=first;). ii. What is the overall p-value for this model (use the likelihood ratio chisquare test). What is the maximum rescaled r-square for this model? iii. Include a table (or tables) with the parameter estimates, their standard errors, the Odds Ratios and their 95% Confidence intervals, the chisquare tests and the p-values for each parameter. Interpret these results in terms of the signs of the parameter estimates, the Odds Ratios and the p-values. iv. Compare the Likelihood ratio test from this model to that from the crosstab in part a above. Compare the Score test from this model to the Pearson chi-square from the crosstab in part a above. v. What do you conclude about the relationship between being in the different types of shock and dying from this model? vi. Compare the logistic regression model from question 5 to the one for this question. Which model would you prefer to use? Why? 7. Logistic regression using several variables, both continuous and categorical. a. Run a logistic regression with DIED as the dependent variable and the predictors: Age Sex Shockdum Mean arterial pressure at time 1 i. What is the overall model signficance? Give the likelihood ratio chisquare, the degrees of freedom and p-value. 5 ii. What is the significance of each predictor in the model? Please include the test statistic, degrees of freedom and p-value. iii. What are the parameter estimates for each significant predictor. Please interpret them. 8. Run a stepwise selection procedure for a logistic regression. a. Carry out a stepwise logistic regression, in which all the physiological measurements made at time 1 are potential predictors in the model. SHOCKDUM Age Sex Mean arterial pressure at time 1 Systolic blood pressure at time 1 Urinary output at time1 Body surface area at time 1 Mean arterial pressure at time1 Heart rate at time1 Central venous pressure at time 1 Cardiac output at time 1 Appearance time at time 1 Circulation time at time 1 Plasma level at time 1 Red cell level at time 1 Hemoglobin level at time 1 Hematocrit at time 1 b. Note: if you use the details option as part of your model statement (after a slash), it will give you information on each step of the stepwise logistic regression, but you only need to include output from the last step in your homework write-up. c. What variables are included in the final model? d. What is the overall significance of this final model? What is it's r-square? 9. Check all your commands and rerun them to be sure they all run without any errors. Save your command file as hw6.sas. Be sure to hand in your commands along with your write-up. 6