Logistic Regression Using SAS

advertisement
Logistic Regression Using SAS
For this handout we will examine a dataset that is part of the data collected from “A study of preventive
lifestyles and women’s health” conducted by a group of students in School of Public Health, at the University
of Michigan during the1997 winter term. There are 370 women in this study aged 40 to 91 years.
Description of variables:
Variable Name
Location
IDNUM
STOPMENS
AGESTOP1
NUMPREG1
AGEBIRTH
MAMFREQ4
DOB
EDUC
TOTINCOM
SMOKER
WEIGHT1
Description
Column
Identification number
1-4
1= Yes, 2= NO, 9= Missing
5
88=NA (haven't stopped) 99= Missing
6-7
88=NA (no births) 99= Missing
8-9
88=NA (no births) 99= Missing
10-11
1= Every 6 months
12
2= Every year
3= Every 2 years
4= Every 5 years
5= Never
6= Other
9= Missing
01/01/00 to 12/31/57
13-20
99/99/99= Missing
1= No formal school
21-22
2= Grade school
3= Some high school
4= High school graduate/ Diploma equivalent
5= Some college education/ Associate’s degree
6= College graduate
7= Some graduate school
8= Graduate school or professional degree
9= Other
99= Missing
1= Less than $10,000
23
2= $10.000 to 24,999
3= $25,000 to 39,999
4= $40.000 to 54,999
5= More than $55,000
8= Don’t know
9= Missing
1= Yes, 2= No, 9= Missing
999= Missing
24
25-27
1
The yearcutoff option is used, which defines the 100-year window SAS will use for a two-digit year. We set
yearcutoff=1900 so that a date of birth of 12/21/05 will be read as Dec 21, 1905, rather than as Dec 21, 2005
(the default yearcutoff for SAS 9.2 is 1920).
options yearcutoff=1900;
The data step commands read in the raw data and set up the missing value codes. We set up the missing value
code for DOB to be 09/09/99, using a SAS date constant ("09SEP99"D). We also create some new variables:
MENOPAUSE (a 0,1 dummy variable), YEARBIRTH, AGE (age in years), EDCAT (a 3-level categorical
variable), AGECAT (a 4-level categorical variable), OVER50 (a 0, 1 dummy variable), and HIGHAGE (a
categorical variable with values 1 and 2).
data bcancer;
infile "brca.dat" lrecl=300;
input idnum 1-4 stopmens 5 agestop1 6-7 numpreg1 8-9 agebirth 10-11
mamfreq4 12 @13 dob mmddyy8. educ 21-22
totincom 23 smoker 24 weight1 25-27;
format dob mmddyy10.;
if
if
if
if
if
if
if
if
if
if
dob = "09SEP99"D then dob=.;
stopmens=9 then stopmens=.;
agestop1 = 88 or agestop1=99 then agestop1=.;
agebirth =99 then agebirth=.;
numpreg1=99 then numpreg1=.;
mamfreq4=9 then mamfreq4=.;
educ=99 then educ=.;
totincom=8 or totincom=9 then totincom=.;
smoker=9 then smoker=.;
weight1=999 then weight1=.;
if stopmens = 1 then menopause=1;
if stopmens = 2 then menopause=0;
yearbirth = year(dob);
age = int(("01JAN1997"d - dob)/365.25);
if educ not=. then do;
if educ in (1,2,3,4) then edcat = 1;
if educ in (5,6)
then edcat = 2;
if educ in (7,8)
then edcat = 3;
highed = (educ in (6,7,8));
end;
if age not=. then do;
if age <50 then agecat=1;
if age >=50 and age < 60 then agecat=2;
if age >=60 and age < 70 then agecat=3;
if age >=70 then agecat=4;
if age < 50 then over50 = 0;
if age >=50 then over50 = 1;
if age >= 50 then highage = 1;
if age < 50 then highage = 2;
end;
run;
2
Descriptives and Frequencies
We first get descriptive statistics for all the numerical variables in the dataset. We request specific statistics,
including nmiss, to stress the number of missing values for each variable.
title "Descriptive Statistics";
proc means data=bcancer n nmiss min max mean std;
run;
Descriptive Statistics
The MEANS Procedure
N
Variable
N
Miss
Minimum
Maximum
Mean
Std Dev
---------------------------------------------------------------------------------------idnum
370
0
1008.00
2448.00
1761.69
412.7290352
stopmens
369
1
1.0000000
2.0000000
1.1598916
0.3670031
agestop1
297
73
27.0000000
61.0000000
47.1818182
6.3101650
numpreg1
366
4
0
12.0000000
2.9480874
1.8726683
agebirth
359
11
9.0000000
88.0000000
30.2228412
19.5615468
mamfreq4
328
42
1.0000000
6.0000000
2.9420732
1.3812853
dob
361
9
-19734.00
-1248.00
-7899.50
4007.12
educ
365
5
1.0000000
9.0000000
5.6410959
1.6374595
totincom
325
45
1.0000000
5.0000000
3.8276923
1.3080364
smoker
364
6
1.0000000
2.0000000
1.4862637
0.5004993
weight1
360
10
86.0000000
295.0000000
148.3527778
31.1093049
menopause
369
1
0
1.0000000
0.8401084
0.3670031
yearbirth
361
9
1905.00
1956.00
1937.86
10.9836177
age
361
9
40.0000000
91.0000000
58.1440443
10.9899588
edcat
364
6
1.0000000
3.0000000
2.0137363
0.7694786
highed
365
5
0
1.0000000
0.4383562
0.4968666
agecat
361
9
1.0000000
4.0000000
2.3296399
1.0798313
over50
361
9
0
1.0000000
0.7257618
0.4467488
highage
361
9
1.0000000
2.0000000
1.2742382
0.4467488
----------------------------------------------------------------------------------------
Next, we examine oneway frequencies for selected variables. Note that these could all have been requested in a
single tables statement. We will carefully check the Frequency Missing for each variable.
title "Oneway Frequencies";
proc freq data=bcancer;
tables dob stopmens menopause educ edcat age agecat over50 highage;
run;
The FREQ Procedure
Cumulative
Cumulative
dob
Frequency
Percent
Frequency
Percent
--------------------------------------------------------------12/21/1905
1
0.28
1
0.28
09/11/1909
1
0.28
2
0.55
12/04/1909
1
0.28
3
0.83
07/15/1911
1
0.28
4
1.11
04/01/1913
1
0.28
5
1.39
07/28/1913
1
0.28
6
1.66
....
11/18/1955
1
0.28
358
99.17
11/22/1955
1
0.28
359
99.45
02/24/1956
1
0.28
360
99.72
08/01/1956
1
0.28
361
100.00
Frequency Missing = 9
Cumulative
Cumulative
stopmens
Frequency
Percent
Frequency
Percent
------------------------------------------------------------1
310
84.01
310
84.01
2
59
15.99
369
100.00
Frequency Missing = 1
3
Cumulative
Cumulative
menopause
Frequency
Percent
Frequency
Percent
-------------------------------------------------------------0
59
15.99
59
15.99
1
310
84.01
369
100.00
Frequency Missing = 1
Cumulative
Cumulative
educ
Frequency
Percent
Frequency
Percent
--------------------------------------------------------1
1
0.27
1
0.27
2
4
1.10
5
1.37
3
11
3.01
16
4.38
4
89
24.38
105
28.77
5
99
27.12
204
55.89
6
50
13.70
254
69.59
7
23
6.30
277
75.89
8
87
23.84
364
99.73
9
1
0.27
365
100.00
Frequency Missing = 5
Cumulative
Cumulative
edcat
Frequency
Percent
Frequency
Percent
---------------------------------------------------------1
105
28.85
105
28.85
2
149
40.93
254
69.78
3
110
30.22
364
100.00
Frequency Missing = 6
Cumulative
Cumulative
age
Frequency
Percent
Frequency
Percent
-------------------------------------------------------40
2
0.55
2
0.55
41
5
1.39
7
1.94
42
7
1.94
14
3.88
43
11
3.05
25
6.93
44
7
1.94
32
8.86
45
11
3.05
43
11.91
46
10
2.77
53
14.68
47
16
4.43
69
19.11
48
13
3.60
82
22.71
49
17
4.71
99
27.42
50
12
3.32
111
30.75
51
9
2.49
120
33.24
52
14
3.88
134
37.12
53
13
3.60
147
40.72
54
13
3.60
160
44.32
55
10
2.77
170
47.09
56
9
2.49
179
49.58
57
10
2.77
189
52.35
58
11
3.05
200
55.40
59
14
3.88
214
59.28
60
10
2.77
224
62.05
61
8
2.22
232
64.27
62
11
3.05
243
67.31
63
5
1.39
248
68.70
64
4
1.11
252
69.81
65
8
2.22
260
72.02
66
8
2.22
268
74.24
67
8
2.22
276
76.45
68
7
1.94
283
78.39
69
7
1.94
290
80.33
70
9
2.49
299
82.83
71
10
2.77
309
85.60
72
13
3.60
322
89.20
73
5
1.39
327
90.58
74
4
1.11
331
91.69
4
75
76
77
78
79
80
81
82
83
85
87
91
5
4
5
2
2
2
3
1
2
1
2
1
1.39
1.11
1.39
0.55
0.55
0.55
0.83
0.28
0.55
0.28
0.55
0.28
336
340
345
347
349
351
354
355
357
358
360
361
93.07
94.18
95.57
96.12
96.68
97.23
98.06
98.34
98.89
99.17
99.72
100.00
Frequency Missing = 9
Cumulative
Cumulative
agecat
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
99
27.42
99
27.42
2
115
31.86
214
59.28
3
76
21.05
290
80.33
4
71
19.67
361
100.00
Frequency Missing = 9
Cumulative
Cumulative
over50
Frequency
Percent
Frequency
Percent
----------------------------------------------------------0
99
27.42
99
27.42
1
262
72.58
361
100.00
Frequency Missing = 9
Cumulative
Cumulative
highage
Frequency
Percent
Frequency
Percent
-----------------------------------------------------------1
262
72.58
262
72.58
2
99
27.42
361
100.00
Frequency Missing = 9
Crosstabulation
Prior to fitting a logistic regression model, we check a crosstabulation to understand the relationship between
menopause and high age. In this 2 by 2 table both the predictor variable, HIGHAGE, and the outcome variable,
STOPMENS, are coded as 1 and 2. For HIGHAGE, the value 1 represents the high risk group (those whose age
is greater than or equal to 50 years), and for STOPMENS, the value 1 represents the outcome of interest (those
who are in menopause). Notice also that HIGHAGE is considered to be the risk factor so it is listed first (the
row variable) in the tables statement and STOPMENS is the outcome of interest so it is listed second (the
column variable).
We request the relative risk and the odds ratio.
/*Crosstabs of HIGHAGE by STOPMENS*/
title "2 x 2 Table";
title2 "HIGHAGE Coded as 1, 2";
proc freq data=bcancer;
tables highage*stopmens / relrisk chisq;
run;
2 x 2 Table
HIGHAGE Coded as 1, 2
The FREQ Procedure
Table of highage by stopmens
highage
stopmens
5
Frequency|
Percent |
Row Pct |
Col Pct |
1|
2| Total
---------+--------+--------+
1 |
251 |
10 |
261
| 69.72 |
2.78 | 72.50
| 96.17 |
3.83 |
| 83.39 | 16.95 |
---------+--------+--------+
2 |
50 |
49 |
99
| 13.89 | 13.61 | 27.50
| 50.51 | 49.49 |
| 16.61 | 83.05 |
---------+--------+--------+
Total
301
59
360
83.61
16.39
100.00
Frequency Missing = 10
Statistics for Table of highage by stopmens
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
1
109.2191
<.0001
Likelihood Ratio Chi-Square
1
99.0815
<.0001
Continuity Adj. Chi-Square
1
105.9122
<.0001
Mantel-Haenszel Chi-Square
1
108.9157
<.0001
Phi Coefficient
0.5508
Contingency Coefficient
0.4825
Cramer's V
0.5508
6
Fisher's Exact Test
---------------------------------Cell (1,1) Frequency (F)
251
Left-sided Pr <= F
1.0000
Right-sided Pr >= F
5.719E-23
Table Probability (P)
Two-sided Pr <= P
1.204E-21
5.719E-23
The output below says "Estimates of the Relative Risk (Row1/Row2)". This is what we want: the risk of
menopause for those who are high age (ROW1) divided by the risk of menopause for those who are not high
age (ROW2). To get the relative risk, we read the Cohort (Col 1 Risk) because we are interested in the relative
risk for being in menopause (Column 1 of STOPMENS). Notice that the odds ratio (24.6) is not a good estimate
of the risk ratio (1.90), because the outcome is not rare in this group of older women.
Estimates of the Relative Risk (Row1/Row2)
Type of Study
Value
95% Confidence Limits
----------------------------------------------------------------Case-Control (Odds Ratio)
24.5980
11.6802
51.8021
Cohort (Col1 Risk)
1.9041
1.5644
2.3176
Cohort (Col2 Risk)
0.0774
0.0408
0.1467
Effective Sample Size = 360
Frequency Missing = 10
Logistic Regression Model with a dummy variable predictor
We now fit a logistic regression model, but using two different variables: OVER50 (coded as 0, 1) is used as the
predictor, and MENOPAUSE (also coded as 0,1) is used as the outcome. We use the descending option so SAS
will fit the probability of being a 1, rather than of being a zero.
proc logistic data=bcancer descending;
model menopause = over50/ rsquare;
run;
Alternatively, the same model can be fitted by specifying the level of menopause that is to be considered the
"event", as shown below:
proc logistic data=bcancer;
model menopause(event="1") =
run;
over50/ rsquare;
You can check that the correct probability is being modeled by looking at the portion of the output that lists it.
In this case, SAS reports "Probability modeled is menopause=1." so we know the model is set up correctly. We use
the predictor variable in this model as OVER50, which is a binary dummy variable coded as 0, 1. Just as in a
linear regression model, we can use a dummy variable as a predictor in a logistic regression model.
The value of the parameter estimate for OVER50 (3.2) tells us that the log-odds of being in menopause
increase (because the estimate is positive) by 3.2 units for those in menopause compared to those women who
are not. This result is significant, Wald chi-square (1 df) = 71.04, p< 0.0001. The odds ratio (24.6) is easier to
interpret. It tells us that the odds of being in menopause are 24.6 times higher for a woman who is over 50 than
for someone who is not. We can see that the 95% CI for the odds ratio does not include 1, so we can be pretty
confident that there is a strong relationship between being over 50 and being in menopause.
Logistic Regression with Dummy Variable Predictor
Over50 Coded as 0, 1
The LOGISTIC Procedure
7
Model Information
Data Set
WORK.BCANCER
Response Variable
menopause
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
370
360
Response Profile
Ordered
Value
1
2
menopause
1
0
Total
Frequency
301
59
Probability modeled is menopause=1.
NOTE: 10 observations were deleted due to missing values for the response or explanatory
variables.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
R-Square
0.2406
Intercept
Intercept
and
Only
Covariates
323.165
226.084
327.051
233.856
321.165
222.084
Max-rescaled R-Square
0.4076
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
99.0815
1
<.0001
Score
109.2191
1
<.0001
Wald
71.0363
1
<.0001
Parameter
Intercept
over50
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
0.0202
0.2010
0.0101
1
3.2026
0.3800
71.0363
Effect
over50
Pr > ChiSq
0.9199
<.0001
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
24.596
11.680
51.798
Association of Predicted Probabilities and Observed Responses
Percent Concordant
69.3
Somers' D
0.664
Percent Discordant
2.8
Gamma
0.922
Percent Tied
27.9
Tau-a
0.183
Pairs
17759
c
0.832
Logistic Regression Model with a class variable as predictor
We now fit the same model, but using a class variable, HIGHAGE (coded as 1=Highage and 2=Not Highage),
as the predictor. In this case, we want to use a class statement, because this predictor is not a dummy (0,1)
variable as in the previous example. Note that in the class statement, we specify the reference level for
HIGHAGE (ref="2"), and we also use the option param=ref after a forward slash. So, we will be fitting a
model in which we are comparing the odds of being in menopause for those women who are over 50
(HIGHAGE=1) to those who are not over 50 (HIGHAGE=2, the reference category).
Note that the results of this model fit are the same as in the previous model, but with some minor modifications.
8
proc logistic data=bcancer;
class highage(ref="2") / param=ref;
model menopause(event="1") = highage/ rsquare;
run;
Logistic Regression with a Class Statement
Highage used as Predictor
Reference Category is Not-Highage (HighAge=2)
The LOGISTIC Procedure
Model Information
Data Set
WORK.BCANCER
Response Variable
menopause
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
370
360
Response Profile
Ordered
Value
1
2
menopause
Total
Frequency
1
0
301
59
Probability modeled is menopause=1.
NOTE: 10 observations were deleted due to missing values for the response or explanatory
variables.
Check the Class Level Information to be sure SAS has set up the coding for HIGHAGE correctly. We see that
the Design Variables are set up so that HIGHAGE=1 is coded as 1, and HIGHAGE=2 is coded as 0, as we
intended.
Class Level Information
Design
Class
Value
Variables
highage
1
1
2
0
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
R-Square
0.2406
Intercept
Only
323.165
327.051
321.165
Intercept
and
Covariates
226.084
233.856
222.084
Max-rescaled R-Square
0.4076
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
99.0815
1
<.0001
Score
109.2191
1
<.0001
Wald
71.0363
1
<.0001
Type 3 Analysis of Effects
Wald
Effect
DF
Chi-Square
Pr > ChiSq
highage
1
71.0363
<.0001
9
The output for the parameter estimate is slightly different than for the previous model. In this case, we see
highage 1, to emphasize that this is for HIGHAGE=1 compared to the reference category (HIGHAGE=2).
Parameter
Intercept
highage
1
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
0.0202
0.2010
0.0101
1
3.2026
0.3800
71.0363
Pr > ChiSq
0.9199
<.0001
The output for the odds ratio also emphasizes that we are looking at the odds ratio for HIGHAGE=1 vs. 2.
Odds Ratio Estimates
Point
95% Wald
Effect
Estimate
Confidence Limits
highage 1 vs 2
24.596
11.680
51.798
Logistic Regression Model with a class predictor with more than two categories
We now look at the relationship of education categories to menopause. Again, we begin by checking the crosstabulation between education and menopause, using the variable EDCAT as the "exposure" and STOPMENS as
the "outcome" or event. Because we are interested in the probability of STOPMENS = 1, for each level of
EDCAT, we really need only the row percents, so we suppress the display of column and total percent by using
the nocol nopercent options. We see in the output that the proportion of women in menopause decreases with
increasing education level.
title "Relationship of Education Categories to Menopause";
proc freq data=bcancer;
tables edcat*stopmens / chisq nocol nopercent;
run;
10
Table of edcat by stopmens
edcat
stopmens
Frequency|
Row Pct |
1|
2|
---------+--------+--------+
1 |
96 |
9 |
| 91.43 |
8.57 |
---------+--------+--------+
2 |
125 |
23 |
| 84.46 | 15.54 |
---------+--------+--------+
3 |
84 |
26 |
| 76.36 | 23.64 |
---------+--------+--------+
Total
305
58
Total
105
148
110
363
Frequency Missing = 7
Statistics for Table of edcat by stopmens
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
2
9.1172
0.0105
Likelihood Ratio Chi-Square
2
9.3370
0.0094
Mantel-Haenszel Chi-Square
1
9.0715
0.0026
Phi Coefficient
0.1585
Contingency Coefficient
0.1565
Cramer's V
0.1585
We now fit a logistic regression model, using EDCAT as a predictor, by including it in the class statement. The
reference category is EDCAT=1.
proc logistic data=bcancer;
class edcat(ref="1") / param = ref;
model menopause(event="1") = edcat/ rsquare;
run;
Logistic Regression to Predict Menopause From Education
The LOGISTIC Procedure
Model Information
Data Set
WORK.BCANCER
Response Variable
menopause
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
370
363
Response Profile
Ordered
Value
1
2
menopause
1
0
Total
Frequency
305
58
Probability modeled is menopause=1.
NOTE: 7 observations were deleted due to missing values for the response or explanatory
variables.
We can look at the class level information below to see that EDCAT=1 is the reference category, because it has
a value of 0 for all of the design variables.
Class Level Information
Design
Class
Value
Variables
11
edcat
1
2
3
0
1
0
0
0
1
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
R-Square
0.0254
Intercept
Only
320.935
324.829
318.935
Intercept
and
Covariates
315.598
327.281
309.598
Max-rescaled R-Square
0.0434
The portion of the output on Testing the Global Null Hypothesis provides an overall test for all parameters in
the model. Thus, we can see that there is a likelihood ratio chi-square test of whether there is any effect of
EDCAT, Χ2 (2df) = 9.337, p=.0094. The Wald chi-square test is slightly smaller Χ2 (2df) = 8.63, p=.0134, but
gives similar results.
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
9.3370
2
0.0094
Score
9.1172
2
0.0105
Wald
8.6314
2
0.0134
Effect
edcat
Type 3 Analysis of Effects
Wald
DF
Chi-Square
Pr > ChiSq
2
8.6314
0.0134
The parmeter estimate for EDCAT 2 shows that the log-odds of menopause for someone with EDCAT=2 are
smaller than for someone with EDCAT=1, but this difference is not significant (p=0.1050). The parameter
estimate for EDCAT 3 are negative, indicating that someone with EDCAT=3 has a lower log-odds of
menopause than a person with EDCAT=1, and this difference is significant (p=0.004) .
Parameter
Intercept
edcat
2
edcat
3
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
2.3671
0.3486
46.1069
1
-0.6743
0.4159
2.6279
1
-1.1944
0.4146
8.2990
Pr > ChiSq
<.0001
0.1050
0.0040
The odds ratio estimate for EDCAT 3 vs 1 is .303, indicating that the odds of being in menopause for a person
with EDCAT=3 are only 30% of the odds of being in menopause for a person with EDCAT=1.
Effect
edcat 2 vs 1
edcat 3 vs 1
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
0.510
0.225
1.151
0.303
0.134
0.683
Association of Predicted Probabilities and Observed Responses
Percent Concordant
45.0
Somers' D
0.234
Percent Discordant
21.6
Gamma
0.352
Percent Tied
33.5
Tau-a
0.063
Pairs
17690
c
0.617
Logistic Regression Model with a continuous predictor
12
We now look at a logistic regression model, but this time with a single continuous predictor (AGE). We also
request ods graphics to obtain a plot showing the relationship between AGE and the estimated probability of
being in menopause. The units statement allows us to see the effect of a 1-year, 5-year, and 10-year increase in
age on the odds of being in menopause.
The parameter estimate for AGE is positive (0.2829) telling us that the log-odds of being in menopause increase
by .28 units for a woman who is one year older compared to her counterpart who is one year younger. The odds
ratio (1.33) tells us that the odds of being in menopause for a woman who is one year older are 1.33 times
greater. The last part of the output shows us the odds ratio for a one-year, five-year and 10-year increase in age.
We estimate that the odds of being in menopause for a woman who is five years older than her counterpart are
4.1 times greater, and for a 10-year increase in age, the odds of being in menopause are almost 17 times greater.
ods graphics on;
proc logistic data=bcancer plots=(effect);
model menopause(event="1") = age / rsquare;
units age = 1 5 10;
run;
ods graphics off;
Logistic Regression with a Continuous Predictor
The LOGISTIC Procedure
Model Information
Data Set
WORK.BCANCER
Response Variable
menopause
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
370
360
Response Profile
Ordered
Total
Value
menopause
Frequency
1
1
301
2
0
59
Probability modeled is menopause=1.
NOTE: 10 observations were deleted due to missing values for the response or explanatory
variables.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
323.165
201.019
SC
327.051
208.792
-2 Log L
321.165
197.019
R-Square
0.2917
Max-rescaled R-Square
0.4942
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
124.1456
1
<.0001
Score
81.0669
1
<.0001
Wald
49.7646
1
<.0001
Parameter
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
13
Pr > ChiSq
Intercept
age
1
1
-12.8675
0.2829
1.9360
0.0401
44.1735
49.7646
<.0001
<.0001
Odds Ratio Estimates
Effect
age
Point
Estimate
1.327
95% Wald
Confidence Limits
1.227
1.436
Association of Predicted Probabilities and Observed Responses
Percent Concordant
89.3
Somers' D
0.806
Percent Discordant
8.7
Gamma
0.822
Percent Tied
2.0
Tau-a
0.222
Pairs
17759
c
0.903
Adjusted Odds Ratios
Effect
Unit
Estimate
age
1.0000
1.327
age
5.0000
4.115
age
10.0000
16.935
Quasi-Complete Separation in a Logistic Regression Model
One fairly common occurrence in a logistic regression model is that the model fails to converge. This often
happens when you have a categorical predictor that is too perfect, that is, there may be a category with no
variability in the response (all subjects in one category of the predictor have the same response). This is called
14
quasi-complete separation. When this happens, SAS will give a warning message in the output. These warnings
should be taken seriously, and the model should be refitted, perhaps by combining some categories of the
predictor.
Even if there is not quasi-complete separation, separation may be nearly complete, so the standard error for a
parameter estimate can become very large. It is good practice to examine the parameter estimates and their
standard errors carefully for any logistic regression output.
We now examine a situation where quasi-complete separation occurs, using the variable AGECAT as a
predictor in a logistic regression. However, first we check the crosstabulation between AGECAT and
STOPMENS, using Proc Freq. Notice that in the highest age category, all 71 women are in menopause (not
surprisingly).
title "Relationship of AGECAT to MENOPAUSE";
proc freq data=bcancer;
tables agecat*stopmens/ chisq nocol nopercent;
run;
Relationship of AGECAT to MENOPAUSE
The FREQ Procedure
Table of agecat by stopmens
agecat
stopmens
Frequency|
Row Pct |
1|
2| Total
---------+--------+--------+
1 |
50 |
49 |
99
| 50.51 | 49.49 |
---------+--------+--------+
2 |
106 |
9 |
115
| 92.17 |
7.83 |
---------+--------+--------+
3 |
74 |
1 |
75
| 98.67 |
1.33 |
---------+--------+--------+
4 |
71 |
0 |
71
| 100.00 |
0.00 |
---------+--------+--------+
Total
301
59
360
Frequency Missing = 10
Statistics for Table of agecat by stopmens
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
3
111.6605
<.0001
Likelihood Ratio Chi-Square
3
110.1752
<.0001
Mantel-Haenszel Chi-Square
1
78.6978
<.0001
Phi Coefficient
0.5569
Contingency Coefficient
0.4866
Cramer's V
0.5569
Effective Sample Size = 360
Frequency Missing = 10
We now fit the corresponding logistic regression model, using AGECAT in a class statement, with AGECAT=1
as the reference category.
proc logistic data=bcancer;
class agecat(ref="1") / param = ref;
model menopause(event="1") = agecat/ rsquare;
run;
15
Logistic Regression with AGECAT as Predictor
This Analysis Does not Work
Check out the Parameter Estimates and Standard Errors
The LOGISTIC Procedure
Model Information
Data Set
WORK.BCANCER
Response Variable
menopause
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
370
Number of Observations Used
360
Response Profile
Ordered
Total
Value
menopause
Frequency
1
1
301
2
0
59
Probability modeled is menopause=1.
NOTE: 10 observations were deleted due to missing values for the response or explanatory
variables.
We check out the Class Level Information to see that the design variables are set up correctly to have
AGECAT=1 as the reference category.
Class
agecat
Class Level Information
Value
Design Variables
1
0
0
0
2
1
0
0
3
0
1
0
4
0
0
1
The notes in the output below alert us to the problem of quasi-complete separation of data points.
Model Convergence Status
Quasi-complete separation of data points detected.
WARNING: The maximum likelihood estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are
based on the last maximum likelihood iteration. Validity of the model fit is
questionable.
WARNING: The validity of the model fit is questionable.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
R-Square
0.2636
Intercept
Only
323.165
327.051
321.165
Intercept
and
Covariates
218.990
234.534
210.990
Max-rescaled R-Square
0.4467
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
110.1752
3
<.0001
Score
111.6605
3
<.0001
Wald
50.0793
3
<.0001
Effect
agecat
Type 3 Analysis of Effects
Wald
DF
Chi-Square
Pr > ChiSq
3
50.0793
<.0001
16
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
agecat
2
agecat
3
agecat
4
DF
1
1
1
1
Estimate
0.0202
2.4460
4.2839
14.8969
Standard
Error
0.2010
0.4012
1.0266
205.9
Wald
Chi-Square
0.0101
37.1721
17.4126
0.0052
Pr > ChiSq
0.9199
<.0001
<.0001
0.9423
WARNING: The validity of the model fit is questionable.
Note that the point estimate of the odds ratio for AGECAT=4 is >999.999 and the 95% confidence interval is
huge. This alerts us that there is a problem with AGECAT=4.
Odds Ratio Estimates
Point
95% Wald
Effect
Estimate
Confidence Limits
agecat 2 vs 1
11.542
5.258
25.339
agecat 3 vs 1
72.520
9.696
542.384
agecat 4 vs 1
>999.999
<0.001
>999.999
Association of Predicted Probabilities and Observed Responses
Percent Concordant
77.0
Somers' D
0.736
Percent Discordant
3.4
Gamma
0.915
Percent Tied
19.6
Tau-a
0.202
Pairs
17759
c
0.868
Based on the information that we saw in the crosstabulation, we will create a new variable AGECAT3, with 3
age categories, collapsing category 3 and category 4.
/*Recode Agecat into AGECAT3
data bcancer2;
set bcancer;
if age not=. then do;
if age < 50 then agecat3
if age >=50 and age < 60
if age >=60 then agecat3
end;
run;
with 3 categories*/
= 1;
then agecat3 = 2;
= 3;
We now fit a new logistic regression, with AGECAT3 as a categorical predictor. Note in the output for this
model, we do not have a problem with quasi-complete separation, but we do have a very large confidence
interval for the odds ratio for AGECAT=3, owing to the fact that there was only one person who was not in
menopause in this category.
proc logistic data=bcancer2;
class agecat3(ref="1") / param = ref;
model menopause(event="1") = agecat3/ rsquare;
run;
17
Logistic Regression with Recoded Age Categories
This Analysis Works
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Model
Optimization Technique
WORK.BCANCER2
menopause
2
binary logit
Fisher's scoring
Number of Observations Read
Number of Observations Used
370
360
Response Profile
Ordered
Value
1
2
menopause
Total
Frequency
1
0
301
59
Probability modeled is menopause=1.
NOTE: 10 observations were deleted due to missing values for the response or explanatory
variables.
Class Level Information
Design
Class
Value
Variables
agecat3
1
0
0
2
1
0
3
0
1
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
R-Square
0.2609
Intercept
Only
323.165
327.051
321.165
Intercept
and
Covariates
218.329
229.987
212.329
Max-rescaled R-Square
0.4420
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
108.8365
2
<.0001
Score
111.6132
2
<.0001
Wald
55.3535
2
<.0001
Type 3 Analysis of Effects
Wald
Effect
DF
Chi-Square
Pr > ChiSq
agecat3
2
55.3535
<.0001
Parameter
Intercept
agecat3
2
agecat3
3
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
0.0202
0.2010
0.0101
1
2.4460
0.4012
37.1721
1
4.9565
1.0234
23.4578
Odds Ratio Estimates
Point
95% Wald
18
Pr > ChiSq
0.9199
<.0001
<.0001
Effect
agecat3 2 vs 1
agecat3 3 vs 1
Estimate
11.542
142.097
Confidence Limits
5.258
25.339
19.120
>999.999
Association of Predicted Probabilities and Observed Responses
Percent Concordant
76.6
Somers' D
0.732
Percent Discordant
3.4
Gamma
0.915
Percent Tied
20.0
Tau-a
0.201
Pairs
17759
c
0.866
Logistic Regression Model with Several Predictors
We now fit a logistic regression model with several predictors, both continuous and categorical. Note especially
the global test for the model, which has 6 degrees of freedom, due to the 6 parameters that are estimated for the
predictors in the model. There are two parameters for EDCAT and one each for AGE, SMOKER, TOTINCOM,
and NUMPREG1. The only predictor that is significant in this model is AGE (p<0.0001)..
proc logistic data=bcancer;
class edcat(ref="1") / param = ref;
model menopause(event="1") = age edcat smoker totincom numpreg1
/ rsquare;
run;
Logistic Regression with Several Predictors
The LOGISTIC Procedure
Model Information
Data Set
WORK.BCANCER
Response Variable
menopause
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
Number of Observations Used
370
313
Response Profile
Ordered
Total
Value
menopause
Frequency
1
1
259
2
0
54
Probability modeled is menopause=1.
NOTE: 57 observations were deleted due to missing values for the response or explanatory
variables.
Class Level Information
Design
Class
Value
Variables
edcat
1
0
0
2
1
0
3
0
1
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
R-Square
0.2971
Intercept
Intercept
and
Only
Covariates
289.876
191.510
293.622
217.734
287.876
177.510
Max-rescaled R-Square
19
0.4941
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
110.3657
6
<.0001
Score
73.1512
6
<.0001
Wald
44.6630
6
<.0001
Type 3 Analysis of Effects
Wald
Effect
DF
Chi-Square
Pr > ChiSq
age
1
40.6094
<.0001
edcat
2
2.3776
0.3046
smoker
1
2.9092
0.0881
totincom
1
0.3032
0.5819
numpreg1
1
0.0025
0.9605
Parameter
Intercept
age
edcat
2
edcat
3
smoker
totincom
numpreg1
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-10.8151
2.2132
23.8788
1
0.2797
0.0439
40.6094
1
-0.4356
0.5524
0.6219
1
-0.8401
0.5636
2.2214
1
-0.6543
0.3836
2.9092
1
-0.0927
0.1683
0.3032
1
0.00646
0.1305
0.0025
Pr > ChiSq
<.0001
<.0001
0.4304
0.1361
0.0881
0.5819
0.9605
Odds Ratio Estimates
Point
95% Wald
Effect
Estimate
Confidence Limits
age
1.323
1.214
1.442
edcat
2 vs 1
0.647
0.219
1.910
edcat
3 vs 1
0.432
0.143
1.303
smoker
0.520
0.245
1.102
totincom
0.911
0.655
1.268
numpreg1
1.006
0.779
1.300
Association of Predicted Probabilities and Observed Responses
Percent Concordant
90.0
Somers' D
0.802
Percent Discordant
9.8
Gamma
0.804
Percent Tied
0.2
Tau-a
0.230
Pairs
13986
c
0.901
Logistic Regression Model Using Proc Genmod
Logistic regression models, along with several other types of models, can be fitted using Proc Genmod. You
need to supply the distribution that the dependent variable has (in this case we use dist=bin), and you can also
specify a link function. In this case we use the default link (for a binary distribution, the default link is logit), so
we will be fitting the same model as in Proc Logistic. The type3 option in the model statement gives us the
type3 test for each predictor in the model. However, unlike Proc Logistic, which gives Wald tests in the type3
output, we get Likelihood Ratio tests, which are preferable.
proc genmod data=bcancer descending;
class edcat(ref="1") / param = ref;
model menopause = age edcat smoker totincom numpreg1
/ dist=bin type3;
run;
Logistic Regression Using Proc Genmod
The GENMOD Procedure
Model Information
Data Set
WORK.BCANCER
Distribution
Binomial
Link Function
Logit
20
Dependent Variable
menopause
Number of Observations Read
Number of Observations Used
Number of Events
Number of Trials
Missing Values
370
313
259
313
57
Class Level Information
Design
Class
Value
Variables
edcat
1
0
0
2
1
0
3
0
1
Response Profile
Ordered
Value
1
2
menopause
1
0
Total
Frequency
259
54
PROC GENMOD is modeling the probability that menopause='1'.
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
306
177.5102
Scaled Deviance
306
177.5102
Pearson Chi-Square
306
297.4367
Scaled Pearson X2
306
297.4367
Log Likelihood
-88.7551
Value/DF
0.5801
0.5801
0.9720
0.9720
Algorithm converged.
Parameter
Intercept
age
edcat
edcat
smoker
totincom
numpreg1
Scale
2
3
DF
1
1
1
1
1
1
1
0
Analysis Of Parameter Estimates
Standard
Wald 95% Confidence
Estimate
Error
Limits
-10.8151
2.2132
-15.1530
-6.4773
0.2797
0.0439
0.1937
0.3658
-0.4356
0.5524
-1.5182
0.6470
-0.8401
0.5636
-1.9448
0.2647
-0.6543
0.3836
-1.4062
0.0976
-0.0927
0.1683
-0.4226
0.2372
0.0065
0.1305
-0.2494
0.2623
1.0000
0.0000
1.0000
1.0000
ChiSquare
23.88
40.61
0.62
2.22
2.91
0.30
0.00
Pr > ChiSq
<.0001
<.0001
0.4304
0.1361
0.0881
0.5819
0.9605
NOTE: The scale parameter was held fixed.
LR Statistics For Type 3 Analysis
ChiSource
DF
Square
Pr > ChiSq
age
1
89.12
<.0001
edcat
2
2.45
0.2932
smoker
1
2.96
0.0852
totincom
1
0.31
0.5794
numpreg1
1
0.00
0.9605
Compare these LR (Likelihood Ratio) statistics in the Type 3 Analysis with the Type 3 tests from Proc Logistic
shown below. Those given below show more significant digits, and are in some cases quite different.
Type 3 Analysis of Effects
Wald
Effect
DF
Chi-Square
Pr > ChiSq
age
1
40.6094
<.0001
edcat
2
2.3776
0.3046
smoker
1
2.9092
0.0881
totincom
1
0.3032
0.5819
numpreg1
1
0.0025
0.9605
Estimating the Risk Ratio Using Proc Genmod
21
Because we can specify a large range of link functions using Proc Genmod, we have the opportunity to fit a
model using the Log link with the binary distribution. When we exponentiate this estimate, we get an estimate
of the Risk Ratio, rather than the Odds Ratio. Here is the code for this example, using HIGHAGE as the
predictor:
proc genmod data=bcancer descending;
class highage(ref="2") / param = ref;
model menopause = highage / dist=bin type3 link=log;
estimate "Effect of High Age" highage 1 / exp;
run;
Partial output from this model is shown below. Note that the Risk Ratio (1.90) matches the one we calculated
for HIGHAGE in the Proc Freq output at the beginning of this handout:
Criteria For Assessing Goodness Of Fit
Criterion
Log Likelihood
Full Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
DF
Value
-111.0418
-111.0418
226.0836
226.1172
233.8558
Value/DF
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Parameter
Intercept
highage
Scale
1
DF
Estimate
Standard
Error
1
1
0
-0.6831
0.6440
1.0000
0.0995
0.1003
0.0000
Wald 95% Confidence
Limits
-0.8781
0.4475
1.0000
-0.4881
0.8405
1.0000
Wald
Chi-Square
Pr > ChiSq
47.14
41.26
<.0001
<.0001
NOTE: The scale parameter was held fixed.
LR Statistics For Type 3 Analysis
Source
highage
Label
Effect of High Age
Exp(Effect of High Age)
Mean
Estimate
1.9041
DF
1
ChiSquare
99.08
Pr > ChiSq
<.0001
Contrast Estimate Results
Mean
L'Beta
Confidence Limits Estimate
1.5644
2.3176
0.6440
1.9041
Standard
Error
0.1003
0.1909
Contrast Estimate Results
ChiLabel
Square
Pr > ChiSq
Effect of High Age
41.26
<.0001
Exp(Effect of High Age)
22
Alpha
0.05
0.05
L'Beta
Confidence Limits
0.4475
0.8405
1.5644
2.3176
Download