Logistic Regression for Heart Disease - due Tues. March 22

advertisement
Homework - Logistic regression.
(1) Using the Framingham data (in regular SAS or Enterprise Miner with no validation data) use a logistic
regression to estimate the probability of heart problems (first stage coronary heart disease) as a
function just of age. Show the formula for the estimated logit L= ____________ and the estimated
probability p = _______________. Give the odds ratio ______for someone of age 50 and for someone of
age 51 _____. For a person of age 50, what is the probability p50=____ of heart problems? What is the
probability p51= _____ for a person of age 51? Is the ratio of these two probabilities p51/p50 the same as
the odds ratio? Why or why not? Why do researchers report odds ratios instead of probability ratios?
(2) The S - - A notation in SAS is a shorthand notation for all of the variables in your data set that are
stored between variable S and A (so you have to know how they are stored which would be revealed by
a PROC PRINT statement or by PROC CONTENTS). Also in PROC LOGISTIC, to be sure you are modeling
the probability that firstchd=1 you can use the “event= “ option as shown below. Run this program,
adjusting the libname statement to your data storage location:
libname aaem "C:\Users\dickey\Desktop\aaem";
/*
Definition of variables:
OBS = observation number (1--1615)
AGE = age at exam 2
SBP21 = first systolic blood pressure at exam 2
SBP22 = second systolic blood pressure at exam 2
SBP31 = first systolic blood pressure at exam 3
SBP32 = second systolic blood pressure at exam 3
SMOKE = present smoking at exam 1
serum cholesterol at exam 2
serum cholesterol at exam 3
FIRSTCHD = indicator of CHD occurring at exam 3-6
i.e., within an eight-year follow-up period to exam 2
*/
Proc print data=aaem.framingham(obs=3);run;
proc logistic data=aaem.framingham;
model firstchd(event="1") = age--cholest3 /selection=stepwise;
run;
(A) What variables were chosen to be in the model?
(B) Show the computation of the logit L and the probability p of first stage coronary heart
disease for a nonsmoker (smoke=0) of age 30 with systolic blood pressure sbp32=120 and
cholesterol cholest2=200. I am looking for something like L=__+__X1+__X2+…=___ where
the X’s are your selected predictor variables.
(C) Now do the same thing for a smoker (smoke=1) with everything else the same. Compare
the ratio of your probabilities to the odds ratio for the variable smoke.
(3) An open ended problem: You are given the Framingham data and asked to analyze it for your
employer and write up a short report, 1 page not counting graphs. Your employer is an intelligent
person with a basic understanding of statistics (hypothesis testing, linear regression estimation etc.) but
not well versed in logistic regression and obviously does not want to be burdened with a long report.
Show what you would deliver to your employer. Again here you can use regular SAS or Enterprise Miner
for the computations.
(4) Optional (not graded - for your interest and understanding only):
Run this program and look at the resulting graph window:
Data grid; age=30; cholest2=200; sbp32 = 120; part = 2;
do smoke = 0 to 1;
do age = 20 to 80 by .5;
output;
end;
end;
data both; set aaem.framingham grid;
proc logistic data=both;
model firstchd(event="1") = age--cholest3 /selection=stepwise;
output out=out1 predicted=p_hat xbeta=logit;
run;
proc gplot data=out1;
plot p_hat*age=smoke;
plot logit*age=smoke;
run;
Notice how you get the curves for the probabilities as a function of age and how they differ for smokers
and nonsmokers. Note also the spread of the predicted values that are not from the grid data (not on
the smooth curves). This shows how much the other variables affect the predicted probabilities. Notice
that the model is linear in age with parallel lines when you look at the logits of the predictions whereas
the probability difference between smokers and nonsmokers increases with age.
Notice that we have not yet reached the 50% probability age and thus do not see the inflection point of
the logistic function. Letting age go to 200 illustrates this.
Download