Homework - Logistic regression. (1) Using the Framingham data (in regular SAS or Enterprise Miner with no validation data) use a logistic regression to estimate the probability of heart problems (first stage coronary heart disease) as a function just of age. Show the formula for the estimated logit L= ____________ and the estimated probability p = _______________. Give the odds ratio ______for someone of age 50 and for someone of age 51 _____. For a person of age 50, what is the probability p50=____ of heart problems? What is the probability p51= _____ for a person of age 51? Is the ratio of these two probabilities p51/p50 the same as the odds ratio? Why or why not? Why do researchers report odds ratios instead of probability ratios? (2) The S - - A notation in SAS is a shorthand notation for all of the variables in your data set that are stored between variable S and A (so you have to know how they are stored which would be revealed by a PROC PRINT statement or by PROC CONTENTS). Also in PROC LOGISTIC, to be sure you are modeling the probability that firstchd=1 you can use the “event= “ option as shown below. Run this program, adjusting the libname statement to your data storage location: libname aaem "C:\Users\dickey\Desktop\aaem"; /* Definition of variables: OBS = observation number (1--1615) AGE = age at exam 2 SBP21 = first systolic blood pressure at exam 2 SBP22 = second systolic blood pressure at exam 2 SBP31 = first systolic blood pressure at exam 3 SBP32 = second systolic blood pressure at exam 3 SMOKE = present smoking at exam 1 serum cholesterol at exam 2 serum cholesterol at exam 3 FIRSTCHD = indicator of CHD occurring at exam 3-6 i.e., within an eight-year follow-up period to exam 2 */ Proc print data=aaem.framingham(obs=3);run; proc logistic data=aaem.framingham; model firstchd(event="1") = age--cholest3 /selection=stepwise; run; (A) What variables were chosen to be in the model? (B) Show the computation of the logit L and the probability p of first stage coronary heart disease for a nonsmoker (smoke=0) of age 30 with systolic blood pressure sbp32=120 and cholesterol cholest2=200. I am looking for something like L=__+__X1+__X2+…=___ where the X’s are your selected predictor variables. (C) Now do the same thing for a smoker (smoke=1) with everything else the same. Compare the ratio of your probabilities to the odds ratio for the variable smoke. (3) An open ended problem: You are given the Framingham data and asked to analyze it for your employer and write up a short report, 1 page not counting graphs. Your employer is an intelligent person with a basic understanding of statistics (hypothesis testing, linear regression estimation etc.) but not well versed in logistic regression and obviously does not want to be burdened with a long report. Show what you would deliver to your employer. Again here you can use regular SAS or Enterprise Miner for the computations. (4) Optional (not graded - for your interest and understanding only): Run this program and look at the resulting graph window: Data grid; age=30; cholest2=200; sbp32 = 120; part = 2; do smoke = 0 to 1; do age = 20 to 80 by .5; output; end; end; data both; set aaem.framingham grid; proc logistic data=both; model firstchd(event="1") = age--cholest3 /selection=stepwise; output out=out1 predicted=p_hat xbeta=logit; run; proc gplot data=out1; plot p_hat*age=smoke; plot logit*age=smoke; run; Notice how you get the curves for the probabilities as a function of age and how they differ for smokers and nonsmokers. Note also the spread of the predicted values that are not from the grid data (not on the smooth curves). This shows how much the other variables affect the predicted probabilities. Notice that the model is linear in age with parallel lines when you look at the logits of the predictions whereas the probability difference between smokers and nonsmokers increases with age. Notice that we have not yet reached the 50% probability age and thus do not see the inflection point of the logistic function. Letting age go to 200 illustrates this.