Logistic Regression Example with Grouped Data

advertisement
1
Logistic Regression, Testing for Interaction
Example, Influenza Shots
A local health clinic sent fliers to its clients to encourage everyone, but especially older persons
at high risk of complications, to get a flu shot for protection against an expected flu epidemic. In
a pilot follow-up study, 159 clients were randomly selected and asked whether they actually
received a flu shot. A client who received a flu shot was coded Y = 1; and a client who did not
receive a flu shot was coded Y = 0. In addition, data were collected on their age (X1) and their
health awareness. The latter data were combined into a health awareness index (X2), for which
higher values indicate greater awareness. Also included in the data were client gender (X3), with
males coded X3 = 1 and females coded X3 = 0.
It is suspected that there may be some interactions between predictor variables; e.g., perhaps the
relationship between health awareness and the response variable is mediated by gender. Hence,
we want to test for interaction effects. To do this, we will estimate two logistic regression
models – one with interactions included and another without.
Assuming that we have already looked at the univariate relationships between Y and each
explanatory variable, we will proceed to look for possible interaction effects.
The reduced model (with only the three explanatory variables) is
  x i  
   0  1 X 1i   2 X 2i   3 X 3i   i ,
ln 


1


x
i 

where the i subscript denotes the ith observation in the data set, and  i is a random error term
associated with the ith observation.
The full model (with interaction terms included) is
  x i  
   0  1 X 1i   2 X 2i   3 X 3i   12 X 1i X 2i   13 X 1i X 3i   23 X 2i X 3i   i .
ln 


1


x
i 

We want to test whether there are interaction effects present.
Step 1: H0:  12   13   23  0
v.
HA: Not all 0.
Step 2: We have n  159 ,  = 0.05.
Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2  2l 0  l1  , where l1 is
the maximum of the log-likelihood function for the model with the interaction terms as well as
the predictors, and l 0 is the maximum of the log-likelihood function for the model with just the
predictor variables. Under the null hypothesis, the statistic G 2 has a chi-square distribution with
d.f. = 3.
Step 4: We will reject the null hypothesis if G2 >  32, 0.05  7.81 .
Step 5: From the output, we find G2 = 105.093 – 104.994 = 0.099.
Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have
sufficient evidence to conclude that the interaction terms need to be included in the model.
2
If we had rejected the null hypothesis, then we could have used a follow-up procedure, such as
stepwise multiple regression, to find which explanatory variables and which interaction terms
would need to be included in the model. If we find that a particular interaction term is
significant, then we would also want to include those two explanatory variables in our final
model
(Note: In this case, if we were to perform stepwise regression, we would find that only two of
the predictors, Age and Health Awareness, would need to be included in our final model.)
The estimated model is therefore (from the output of the last PROC LOGISTIC):
 ˆ x i   ˆ
   0  ˆ1 X 1i  ˆ2 X 2i , or
ln 
ˆ


1


x
i 

 ˆ x i  
  1.4578  0.0779 * Age  0.0955 * Health Awareness  .
ln 
ˆ


1


x
i 

We also want to check the assumption that the logit is linear in each of the (continuous) predictor
variables. There are several ways to do this. One way is a rather tedious, graphical approach,
involving grouped data. A simpler approach is to test whether we need to include nonlinear
terms in the model. To do this, we will use the final model above as our reduced model, and add
quadratic terms in each of the two explanatory variables, so that the full model is
  x i  
   0  1 X 1i   2 X 2i   1 X 12i   2 X 22i   i .
ln 
 1   x i  
We want to test whether these quadratic terms are needed. The last PROC LOGISTIC in the
SAS program below estimates the model with the quadratic terms included.
Step 1: H0:  1   2  0
v.
HA: Not both 0.
Step 2: We have n  159 ,  = 0.05.
Step 3: The test statistic is the Likelihood Ratio chi-square statistic G 2  2l 0  l1  , where l1 is
the maximum of the log-likelihood function for the model with the quadratic terms as well as the
predictors, and l 0 is the maximum of the log-likelihood function for the model with just the
(1)
predictor variables. Under the null hypothesis, the statistic G 2 has a chi-square distribution with
d.f. = 2.
Step 4: We will reject the null hypothesis if G2 >  22, 0.05  5.99 .
Step 5: From the output, we find G2 = 105.795 – 104.706 = 1.089.
Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have
sufficient evidence to conclude that the quadratic terms need to be included in the model.
Our final model therefore is given by Equation (1) above.
The estimate of the regression slope for Age is ˆ1  0.0779 , with a standard error of
S .E. ˆ  0.0297 . Thus, a 95% confidence interval estimate for the slope is
 
1
 
ˆ1  1.96S .E. ˆ1  0.019688, 0.136112 . Now, since Age is a continuous variable, it is not very
interesting to consider the odds ratio for a unit increase in Age. Instead, we will calculate a 95%
confidence interval estimate of the odds ratio for an increase in Age of 5 years. The point
estimate of the odds ratio is e 0.07795   1.4762 , and a 95% confidence interval estimate is
e 0.0196885 , e 0.1361125  1.1034, 1.9750 . We are 95% confident that the odds of having had a


3
flue shot increase by between 1.1034 and 1.9750 for each 5-year increase in Age, for this
population.
The SAS program for conducting data analysis is given below, followed by the output.
SAS Program
proc format;
value difmt 0 = "No "
1 = "Yes";
value sexfmt 0 = "Female"
1 = "Male ";
;
data flushot;
input y x1 x2 x3;
x1x2 = x1*x2;
x1x3 = x1*x3;
x2x3 = x2*x3;
x1sq = x1**2;
x2sq = x2**2;
label y = "Flu Shot?"
x1 = "Age in Years"
x2 = "Health Awareness Index"
x3 = "Gender"
x1x2 = "Interaction of Age with Health Awareness"
x1x3 = "Interaction of Age with Gender"
x2x3 = "Interaction of Health Awareness with Gender"
x1sq = "Square of Age in Years"
x2sq = "Square of Health Awareness";
format y difmt. x3 sexfmt.;
cards;
The data set is listed in the appendix.
;
proc logistic;
model y (order=formatted event='Yes') = x1 x2 x3;
title "Multiple Logistic Regression of Flu Shot";
title2 "Against Age, Health Awareness, and Gender";
;
proc logistic;
model y (order=formatted event='Yes') = x1 x2 x3 x1x2 x1x3 x2x3;
title "Multiple Logistic Regression of Flu Shot";
title2 "Against Age and Health Awareness";
title3 "With Interaction Terms Included";
;
proc logistic;
model y (order=formatted event='Yes') = x1 x2;
title "Multiple Logistic Regression of Flu Shot";
title2 "Against Age and Health Awareness";
title3;
;
proc logistic;
model y (order=formatted event='Yes') = x1 x2 x1sq x2sq;
title "Multiple Logistic Regression of Flu Shot";
title2 "Against Age and Health Awareness";
title3 "With Quadratic Terms Included";
;
run;
4
Output of SAS Program
Multiple Logistic Regression of Flu Shot
Against Age, Health Awareness, and Gender
The LOGISTIC Procedure
Model Information
Data Set
WORK.FLUSHOT
Response Variable
y
Flu Shot?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
159
Number of Observations Used
159
Response Profile
Ordered
Total
Value
y
Frequency
1
No
135
2
Yes
24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
136.941
113.093
SC
140.010
125.369
-2 Log L
134.941
105.093
Multiple Logistic Regression of Flu Shot
Against Age, Health Awareness, and Gender
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
29.8476
3
<.0001
Score
27.0173
3
<.0001
Wald
19.9803
3
0.0002
Parameter
Intercept
x1
x2
x3
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
-1.1772
2.9824
0.1558
1
0.0728
0.0304
5.7401
1
-0.0990
0.0335
8.7419
1
0.4339
0.5218
0.6917
Effect
x1
x2
x3
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.076
1.013
1.141
0.906
0.848
0.967
1.543
0.555
4.291
Pr > ChiSq
0.6930
0.0166
0.0031
0.4056
5
Association of Predicted Probabilities and Observed Responses
Percent Concordant
82.1
Somers' D
0.644
Percent Discordant
17.7
Gamma
0.645
Percent Tied
0.2
Tau-a
0.166
Pairs
3240
c
0.822
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Interaction Terms Included
The LOGISTIC Procedure
Model Information
Data Set
WORK.FLUSHOT
Response Variable
y
Flu Shot?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
159
Number of Observations Used
159
Response Profile
Ordered
Total
Value
y
Frequency
1
No
135
2
Yes
24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
136.941
118.994
SC
140.010
140.476
-2 Log L
134.941
104.994
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Interaction Terms Included
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
29.9472
6
<.0001
Score
32.4819
6
<.0001
Wald
19.4560
6
0.0035
6
Parameter
Intercept
x1
x2
x3
x1x2
x1x3
x2x3
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
2.6164
13.5486
0.0373
1
0.0138
0.2007
0.0047
1
-0.1650
0.2490
0.4394
1
0.3907
6.0739
0.0041
1
0.00103
0.00371
0.0773
1
0.00588
0.0615
0.0091
1
-0.00634
0.0679
0.0087
Effect
x1
x2
x3
x1x2
x1x3
x2x3
Pr > ChiSq
0.8469
0.9452
0.5074
0.9487
0.7809
0.9238
0.9255
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.014
0.684
1.503
0.848
0.520
1.381
1.478
<0.001
>999.999
1.001
0.994
1.008
1.006
0.892
1.135
0.994
0.870
1.135
Association of Predicted Probabilities and Observed Responses
Percent Concordant
82.5
Somers' D
0.653
Percent Discordant
17.2
Gamma
0.656
Percent Tied
0.4
Tau-a
0.168
Pairs
3240
c
0.827
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
The LOGISTIC Procedure
Model Information
Data Set
WORK.FLUSHOT
Response Variable
y
Flu Shot?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
159
Number of Observations Used
159
Response Profile
Ordered
Total
Value
y
Frequency
1
No
135
2
Yes
24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
136.941
111.795
SC
140.010
121.002
-2 Log L
134.941
105.795
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
29.1454
2
<.0001
Score
26.7071
2
<.0001
Wald
19.8291
2
<.0001
Analysis of Maximum Likelihood Estimates
7
Parameter
Intercept
x1
x2
DF
1
1
1
Estimate
-1.4578
0.0779
-0.0955
Effect
x1
x2
Standard
Error
2.9153
0.0297
0.0324
Wald
Chi-Square
0.2500
6.8761
8.6786
Pr > ChiSq
0.6170
0.0087
0.0032
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.081
1.020
1.146
0.909
0.853
0.969
Association of Predicted Probabilities and Observed Responses
Percent Concordant
80.7
Somers' D
0.618
Percent Discordant
18.9
Gamma
0.620
Percent Tied
0.4
Tau-a
0.159
Pairs
3240
c
0.809
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Quadratic Terms Included
The LOGISTIC Procedure
Model Information
Data Set
WORK.FLUSHOT
Response Variable
y
Flu Shot?
Number of Response Levels
2
Model
binary logit
Optimization Technique
Fisher's scoring
Number of Observations Read
159
Number of Observations Used
159
Response Profile
Ordered
Total
Value
y
Frequency
1
No
135
2
Yes
24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
8
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
136.941
114.706
SC
140.010
130.050
-2 Log L
134.941
104.706
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Quadratic Terms Included
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
30.2348
4
<.0001
Score
34.2112
4
<.0001
Wald
19.4995
4
0.0006
Parameter
Intercept
x1
x2
x1sq
x2sq
Analysis of Maximum Likelihood Estimates
Standard
Wald
DF
Estimate
Error
Chi-Square
1
0.2193
14.2594
0.0002
1
0.2296
0.4052
0.3210
1
-0.3518
0.2638
1.7780
1
-0.00112
0.00303
0.1363
1
0.00238
0.00236
1.0171
Effect
x1
x2
x1sq
x2sq
Pr > ChiSq
0.9877
0.5710
0.1824
0.7120
0.3132
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
1.258
0.569
2.784
0.703
0.419
1.180
0.999
0.993
1.005
1.002
0.998
1.007
Association of Predicted Probabilities and Observed Responses
Percent Concordant
81.2
Somers' D
0.629
Percent Discordant
18.3
Gamma
0.632
Percent Tied
0.4
Tau-a
0.162
Pairs
3240
c
0.815
Appendix: Flu Shot Data
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
59
61
82
51
53
62
51
70
71
55
58
53
72
56
56
81
62
49
56
50
53
52
55
51
70
70
49
69
54
65
58
48
58
65
68
83
68
44
70
69
74
57
0
1
0
0
0
1
1
1
1
1
0
1
0
0
0
0
0
0
1
0
0
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
56
56
50
52
52
67
51
70
64
61
53
77
73
67
50
80
75
65
60
68
61
62
53
72
54
59
61
50
48
52
54
62
71
65
49
58
62
69
56
76
51
64
57
51
81
50
64
64
59
53
63
59
70
72
68
75
57
64
67
83
48
81
53
61
51
51
65
51
54
64
69
71
38
51
54
59
57
63
48
58
56
59
75
48
79
66
57
68
48
60
63
61
57
69
38
50
45
72
51
62
81
55
77
65
53
49
65
58
60
57
37
49
55
60
1
1
1
1
0
1
0
0
0
1
0
1
1
0
0
0
0
1
1
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
1
1
1
0
0
1
0
1
0
1
1
1
0
0
1
1
0
0
1
0
10
0
0
0
1
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
67
59
55
75
66
67
59
78
59
68
59
68
78
55
71
51
65
54
79
64
82
64
70
59
59
63
48
61
51
48
71
51
57
49
67
73
73
56
48
50
50
66
53
50
51
68
72
51
62
60
67
70
55
66
65
84
58
57
56
58
64
51
59
61
49
49
55
61
50
47
73
45
45
59
61
52
50
46
67
56
50
56
61
74
78
68
71
58
57
51
74
56
57
65
47
69
71
76
60
75
65
42
66
49
58
61
55
60
54
63
56
59
52
63
1
1
0
1
0
0
0
1
0
0
1
1
1
1
1
0
0
1
0
0
1
0
1
0
1
1
0
0
0
0
1
0
1
0
1
0
0
0
1
0
1
1
1
1
0
1
1
1
1
0
1
1
1
0
1
1
0
11
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
68
51
67
52
68
76
54
50
63
77
60
51
51
66
52
66
56
49
67
57
56
76
68
73
57
59
53
67
62
63
62
52
58
49
65
55
60
51
67
64
55
58
66
64
66
22
32
56
1
1
1
0
0
1
1
1
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
1
Download