homework6_2004

advertisement
Biostat 510: Statistical Computing Packages
SAS Homework 6
Due Tuesday, March 23, 2004
This homework again uses the Afifi data. Include all of your SAS commands with your
homework. Please hand in as much of your annotated SAS output as necessary to show
points that you wish to make in your write-up. You do NOT need to include all output
from all procedures. Answer all questions and interpret your results. Use alpha=.05 for any
statistical tests. For all tests, give the test statistic, the degrees of freedom, and the p-value.
Note: for F-tests, give both the numerator and denominator df; for t-tests, give df
(remember: the df for t-test in a regression are the error df); for a chi-square test, give the
df.. Please interpret test results. Don't just say if a result is significant or not. Explain what
the result means.
Read in the Afifi data as you did for homework 5 to answer the questions for this
assignment. You will need to create new variables for some questions. If you need to
create new variables, do so in your original data step, and rerun it so that the new variables
will be included in your data set. Make sure that your SAS commands will run from start
to finish without any errors before you hand in your homework.
1. Multiple regression using dummy variables as the predictors.
a. What type of variable is SHOCKTYPE? What is the level of measurement for
this variable? (interval? nominal? ordinal?)
b. Describe the relationship between SHOCKTYPE and SBP2. To do this, get
descriptive statistics for SBP2 for each level of the variable SHOKTYPE, using
Proc Means, with a class statement. Include this in your homework. What is the
mean of SBP2 for those patients not in shock? For each shock category?
c. Create 6 indicator dummy variables, one for each level of SHOKTYPE. These
dummy variables should be coded with a value of 1 if the case is in a particular
category, and a value of zero if not. Name your dummy variables Shock2,
Shock3, Shock4, Shock5, Shock6 and Shock7.
d. Get frequency tabulations of SHOKTYPE and of your new dummy variables to
check your coding. They should agree. How many patients are in each shock
category? Include these tables in your homework.
e. Run a regression model in which the predictor variables are 5 of your
SHOKTYPE dummy variables. Use Shock type=2 (non-shock) as the reference
category. Include annotated output from this regression model in your
homework.
i. How many observations are included in this regression model?
ii. What is the model R-square?
iii. What is the adjusted R-square?
iv. What is the overall significance of this model? Give the Fstatistic, the numerator and denominator degrees of freedom and
the p-value. Interpret this test.
1
v. What is the value of the intercept? What is the meaning of the
intercept in this model?
vi. What are the estimated regression coefficients (Parameter
Estimates) for each of the shock type dummy variables? Explain
the meaning of these coefficients in words. Why are they
negative? (If they aren't negative, then you need to check your
commands to be sure that you have shoktype=2 as the
reference).
vii. What is the standard error of each of these coefficients?
viii. Which of the shock types are significant? Interpret the
significance tests for each of the dummy variables. What are
these significance tests testing?
ix. Save the residuals and predicted values from this regression
model to a new data set, using an output statement in SAS.
f. Calculate the predicted value of SBP2 for each one of the shock types from the
regression model results by hand. Show your calculations. Compare these
predicted values from the model to the means of SBP2 for each level of
SHOKTYPE that you got from Proc Means in question 1 b.
g. Check the normality of the residuals that you saved from this model, using proc
univariate. What do you conclude about the normality of the residuals? Note:
you can interpret the tests for normality, but I’m more interested in the graphs.
h. Plot the residuals vs. the predicted values. What does this graph show you?
Remember, you have only categorical predictors, so there will be one predicted
value for each level of shoktype.
i. Include annotated output from Proc Reg for this question.
2. Run a multiple regression model with both categorical and continuous predictor
variables.
a. Run a regression model with SBP2 as the dependent variable and all of the
predictors:
 SHOCKDUM (coded as 1=Shock 0=NoShock) (Note: you will need to
create this as a new variable in your data step. Be careful when creating
this variable!).
 Age
 Sex
 Mean arterial pressure at time 1
 Systolic blood pressure at time 1
 Urinary output at time1
 Body surface area at time 1
 Mean arterial pressure at time1
 Heart rate at time1
 Central venous pressure at time 1
 Cardiac output at time 1
 Appearance time at time 1
 Circulation time at time 1
 Plasma level at time 1
 Red cell level at time 1
2


Hemoglobin level at time 1
Hematocrit at time 1
Include the TOL, VIF, and COLLIN options in your model statement after a slash
to get the tolerance, variance inflation factor, and collinearity diagnostics.
b. What is the overall significance of this model? Write out the F-test, the
numerator and denominator degrees of freedom and the p-value. Please
interpret this test.
c. What is the R-square? The adjusted R-square?
d. Are any of the predictor variables significant in this model? Write out the t-test
statistic, the degrees of freedom and the p-value for the significant variables.
e. What is the meaning of the estimated coefficient for each significant predictor?
f. Check for collinearity in this model, using tolerance (TOL), variance inflation
factor (VIF) and COLLIN. Interpret the TOL output and the VIF output. Are
there any variables that have a high value for the VIF that would make you
concerned about collinearity? Which ones? Why do you think these variables
are problematic for collinearity?
g. Check the Condition Indices for this model. Note that the last two condition
indices are high. Which variables have a high proportion (>.50) of their
variance loading on the last Eigenvalue (number 18)? Which variables have a
high proportion of their variance loading on the next to last Eigenvalue (number
17)? Based on these results, which variables do you think are collinear?
h. Based on your examination of this model, do you feel it is adequate? What
variables would you include in a final model?
3. Run several selection methods, with SBP2 as the dependent variable and the same
predictor variables as in question 2.
a. Stepwise selection method.
a. Use the default sle and sls criteria.
b. Which variables are selected for this model using stepwise selection?
c. Compare the coefficients for the variables selected using this stepwise
method with the coefficients for these variables in the original model in
question 2.
b. Backward selection method.
a. Use the default sle and sls criteria.
b. Which variables are selected using the backward selection method?
c. Is this model the same as that chosen using Stepwise Selection?
c. Rsquare selection method.
a. Get SAS to print out the best 5 models for each number of predictors.
b. Which model would you choose?
c. Does the model that you selected have the highest r-square of all models
possible?
d. Adjusted R-square selection method.
a. Which model would you select using this method?
e. CP selection method.
a. Which model would you choose using this method?
b. Print out the best 20 models for this method.
3
f. Do these selection methods select the same model that you would have chosen
from question 2?
4. Comparison of a Crosstab (2 X 2 table) vs. Logistic Regression:
a. Create a new variable called DIED that has a value of 1 if the patient died, and
a value of 0 if they survived..
b. Create formats for the variables Shockdum and Died, using proc format. Set up
these formats as shown below, so that the value 1 will be alphabetically first,
based on the formatted value, and the value 0 will be alphabetically second.
proc format;
value shockfmt 1
0
value diedfmt 1
0
run;
=
=
=
=
"A:
"B:
"A:
"B:
Shock"
No Shock";
Died"
Lived";
c. Get a crosstab of Shockdum as the Risk factor (row variable) and Died as the
Outcome (column variable).
i. Use the formats that you created for these variables. In order to make
the values come out in the desired order, use syntax something like that
shown below. (Note: if you use the "order=formatted" option, as shown
below, SAS will put the variables in the crosstab based on the alphabetic
order of the formatted values, rather than their numeric order)
proc freq order = formatted ;
tables shockdum * died /relrisk chisq;
format shockdum shockfmt. died diedfmt.;
run;
ii. What percent of patients in Shock died? What percent of patients not in
Shock died?
iii. Test whether there is an association between being in shock and dying.
What do you conclude from this test?
iv. Get the odds ratio (OR) and relative risk for this table. What is the OR
and it's 95% Confidence interval? How does this result support the
statistical test that you carried out?
5. Run a logistic regression using Shockdum to predict Died. Remember, even though
Shockdum is categorical, you do not need to include it in a class statement, because it
is a single dummy variable (0, 1 coding). Be sure you are modeling the probability that
Died=1. Do not use any formats for the Logistic Procedure, since it will change the
order of the variables and make it difficult to interpret results!
proc logistic descending;
model died = shockdum / rsquare;
run;
i. What is the value of the parameter estimate for Shockdum? Please
interpret this parameter estimate.
4
ii. Is Shockdum a significant predictor of Dying? Give a statistical test, it's
degrees of freedom and it's p-value.
iii. What is the Odds Ratio from this analysis, and it's 95% confidence
interval? Compare this odds ratio to the one from the crosstab above.
iv. Compare the Likelihood Ratio chi-square test for this model to the
Likelihood Ratio Chi-square test from the crosstab. What are their
values, degrees of freedom and p-values?
v. Compare the Score test from this model with the Pearson chi-square test
from the crosstab. What are their values, degrees of freedom and pvalues?
vi. What is the value of the (maximum rescaled) r-square for this model?
vii. What do you conclude about the relationship between Shock and Dying
from this analysis?
6. Comparison of a 6 X 2 crosstab with a Logistic Regression using a class variable.
a. Get a crosstab using Shoktype as the Risk factor and Died as the Outcome. No
formats are necessary for this cross-tabulation.
i. What percent of patients in each level of Shoktype died?
ii. Test whether there is an association between the Shoktype and Died.
What do you conclude based on this test?
b. Carry out a logistic regression with Shock Type as the predictor and Died as the
outcome, using Proc Logistic.
i. Be sure to use a class statement for Shoktype. Use the first level of
Shoktype (non-shock) as the reference category. (class shoktype /
param=ref ref=first;).
ii. What is the overall p-value for this model (use the likelihood ratio chisquare test). What is the maximum rescaled r-square for this model?
iii. Include a table (or tables) with the parameter estimates, their standard
errors, the Odds Ratios and their 95% Confidence intervals, the chisquare tests and the p-values for each parameter. Interpret these results
in terms of the signs of the parameter estimates, the Odds Ratios and the
p-values.
iv. Compare the Likelihood ratio test from this model to that from the
crosstab in part a above. Compare the Score test from this model to the
Pearson chi-square from the crosstab in part a above.
v. What do you conclude about the relationship between being in the
different types of shock and dying from this model?
vi. Compare the logistic regression model from question 5 to the one for
this question. Which model would you prefer to use? Why?
7. Logistic regression using several variables, both continuous and categorical.
a. Run a logistic regression with DIED as the dependent variable and the
predictors:
 Age
 Sex
 Shockdum
 Mean arterial pressure at time 1
i. What is the overall model signficance? Give the likelihood ratio chisquare, the degrees of freedom and p-value.
5
ii. What is the significance of each predictor in the model? Please include
the test statistic, degrees of freedom and p-value.
iii. What are the parameter estimates for each significant predictor. Please
interpret them.
8. Run a stepwise selection procedure for a logistic regression.
a. Carry out a stepwise logistic regression, in which all the physiological
measurements made at time 1 are potential predictors in the model.
 SHOCKDUM
 Age
 Sex
 Mean arterial pressure at time 1
 Systolic blood pressure at time 1
 Urinary output at time1
 Body surface area at time 1
 Mean arterial pressure at time1
 Heart rate at time1
 Central venous pressure at time 1
 Cardiac output at time 1
 Appearance time at time 1
 Circulation time at time 1
 Plasma level at time 1
 Red cell level at time 1
 Hemoglobin level at time 1
 Hematocrit at time 1
b. Note: if you use the details option as part of your model statement (after a
slash), it will give you information on each step of the stepwise logistic
regression, but you only need to include output from the last step in your
homework write-up.
c. What variables are included in the final model?
d. What is the overall significance of this final model? What is it's r-square?
9. Check all your commands and rerun them to be sure they all run without any errors.
Save your command file as hw6.sas. Be sure to hand in your commands along with
your write-up.
6
Download