Problems with infinite solutions in logistic regression Ian White MRC Biostatistics Unit, Cambridge

advertisement
Problems with infinite
solutions in logistic regression
Ian White
MRC Biostatistics Unit, Cambridge
UK Stata Users’ Group
London, 12th September 2006
1
Introduction
• I teach logistic regression for the analysis of
case-control studies to Epidemiology Master’s
students, using Stata
• I stress how to work out degrees of freedom
– e.g. if E has 2 levels and M has 4 levels then you get
3 d.f. for testing the E*M interaction
• Our practical uses data on 244 cases of leprosy
and 1027 controls
– previous BCG vaccination is the exposure of interest
– level of schooling is a possible effect modifier
– in what follows I’m ignoring other confounders
2
Leprosy data
-> tabulation of d
outcome
0=control, |
1=case |
Freq.
Percent
Cum.
------------+----------------------------------0 |
1,027
80.80
80.80
1 |
244
19.20
100.00
------------+----------------------------------Total |
1,271
100.00
-> tabulation of bcg
exposure
BCG scar |
Freq.
Percent
Cum.
------------+----------------------------------Absent |
743
58.46
58.46
Present |
528
41.54
100.00
------------+----------------------------------Total |
1,271
100.00
-> tabulation of school
possible effect modifier
Schooling |
Freq.
Percent
Cum.
------------+----------------------------------0 |
282
22.19
22.19
1 |
606
47.68
69.87
2 |
350
27.54
97.40
3 |
33
2.60
100.00
------------+----------------------------------Total |
1,271
100.00
3
Main effects model
. xi: logistic d i.bcg i.school
i.bcg
_Ibcg_0-1
i.school
_Ischool_0-3
Logistic regression
Log likelihood = -572.86093
(naturally coded; _Ibcg_0 omitted)
(naturally coded; _Ischool_0 omitted)
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
=
=
=
=
1271
97.50
0.0000
0.0784
-----------------------------------------------------------------------------d | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ibcg_1 |
.2908624
.0523636
-6.86
0.000
.204384
.4139314
_Ischool_1 |
.7035071
.1197049
-2.07
0.039
.5040026
.9819836
_Ischool_2 |
.4029998
.0888644
-4.12
0.000
.2615825
.6208704
_Ischool_3 |
.09077
.0933769
-2.33
0.020
.0120863
.6816944
-----------------------------------------------------------------------------. estimates store main
4
Interaction model
. xi: logistic d i.bcg*i.school
i.bcg
_Ibcg_0-1
i.school
_Ischool_0-3
i.bcg*i.school
_IbcgXsch_#_#
Logistic regression
Log likelihood = -570.90012
(naturally coded; _Ibcg_0 omitted)
(naturally coded; _Ischool_0 omitted)
(coded as above)
Number of obs
LR chi2(7)
Prob > chi2
Pseudo R2
=
=
=
=
1271
101.43
0.0000
0.0816
-----------------------------------------------------------------------------d | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ibcg_1 |
.2248804
.0955358
-3.51
0.000
.0977993
.5170913
_Ischool_1 |
.6626409
.1234771
-2.21
0.027
.4599012
.9547549
_Ischool_2 |
.4116581
.1027612
-3.56
0.000
.2523791
.6714598
_Ischool_3 |
1.28e-08
1.42e-08
-16.41
0.000
1.46e-09
1.12e-07
_IbcgXsch_~1 |
1.448862
.7046411
0.76
0.446
.5585377
3.758385
_IbcgXsch_~2 |
1.086848
.6226504
0.15
0.884
.3536056
3.340553
_IbcgXsch_~3 |
4.25e+07
.
.
.
.
.
-----------------------------------------------------------------------------Note: 17 failures and 0 successes completely determined.
. estimates store inter
5
The problem
. table bcg school, by(d)
---------------------------------0=control |
, 1=case |
and BCG
|
Schooling
scar
|
0
1
2
3
----------+----------------------0
|
Absent | 141
257
129
17
Present |
57
229
182
15
----------+----------------------1
|
Absent |
77
93
29
Present |
7
27
10
1
---------------------------------6
LR test
. xi: logistic d i.bcg i.school
LR chi2(4)
=
97.50
Log likelihood = -572.86093
. estimates store main
. xi: logistic d i.bcg*i.school
LR chi2(7)
=
101.43
Log likelihood = -570.90012
. estimates store inter
. lrtest main inter
Likelihood-ratio test
(Assumption: main nested in inter)
LR chi2(2) =
Prob > chi2 =
3.92
0.1407
7
What is Stata doing? (guess)
• Recognises the information matrix is singular
• Hence reduces model df by 1
• In other situations Stata drops observations
– if a single variable perfectly predicts success/failure
– this happens if the problematic cell doesn’t occur in a
reference category
– then Stata refuses to perform lrtest, but we can
force it to do so
– Stata still gets df=2; can use df(3) option
8
. gen bcgrev=1-bcg
. xi: logistic d i.bcgrev*i.school
i.bcgrev
_Ibcgrev_0-1
i.school
_Ischool_0-3
i.bcg~v*i.sch~l
_IbcgXsch_#_#
(naturally coded; _Ibcgrev_0 omitted)
(naturally coded; _Ischool_0 omitted)
(coded as above)
note: _IbcgXsch_1_3 != 0 predicts failure perfectly
_IbcgXsch_1_3 dropped and 17 obs not used
Logistic regression
Log likelihood = -570.90012
Number of obs
LR chi2(6)
Prob > chi2
Pseudo R2
=
=
=
=
1254
94.12
0.0000
0.0762
-----------------------------------------------------------------------------d | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ibcgrev_1 |
4.446809
1.889136
3.51
0.000
1.933895
10.22502
_Ischool_1 |
.9600749
.4312915
-0.09
0.928
.3980361
2.315729
_Ischool_2 |
.4474097
.2307071
-1.56
0.119
.1628482
1.229215
_Ischool_3 |
.5428571
.6013396
-0.55
0.581
.0619132
4.75979
_IbcgXsch_~1 |
.6901971
.3356713
-0.76
0.446
.2660717
1.79039
_IbcgXsch_~2 |
.920092
.5271167
-0.15
0.884
.2993516
2.82801
-----------------------------------------------------------------------------. est store interrev
. lrtest interrev main
observations differ: 1254 vs. 1271
r(498);
. lrtest interrev main, force
Likelihood-ratio test
(Assumption: main nested in interrev)
LR chi2(2) =
Prob > chi2 =
3.92
0.1407
9
What’s right?
• Zero cell suggests small sample so asymptotic c2
distribution may be inappropriate for LRT
– true in this case: have a bcg*school category with only 1
observation
– but I’m going to demonstrate the same problem in hypothetical
example with expected cell counts > 3 but a zero observed cell
count
• Could combine or drop cells to get rid of zeroes
– but the cell with zeroes may carry information
• Problems with testing boundary values are well known
– e.g. LRT for testing zero variance component isn’t c21
– but here the point estimate, not the null value, is on the boundary
10
Example to explain why LRT
makes some sense
. tab x y, chi2 exact
|
y
x |
0
1 |
Total
-----------+----------------------+---------0 |
10
20 |
30
1 |
0
10 |
10
-----------+----------------------+---------Total |
10
30 |
40
Pearson chi2(1) =
Fisher's exact =
1-sided Fisher's exact =
4.4444
Pr = 0.035
0.043
0.035
11
-20
-18
Model: logit P(y=1|x) = a + bx
Difference in log lik = 3.4
-26
-24
-22
LRT = 6.8 on 0 df?
0
5
beta
10
12
Example to explore correct df using
Pearson / Fisher as gold standard
. tab x y, chi2 exact
|
y
x |
0
1 |
Total
-----------+----------------------+---------1 |
6
0 |
6
2 |
3
6 |
9
3 |
3
6 |
9
-----------+----------------------+---------Total |
12
12 |
24
Pearson chi2(2) =
Fisher's exact =
8.0000
Pr = 0.018
0.029
• All expected counts ≥3
13
• Don’t want to drop or merge category 1 - contains the evidence for association!
. xi: logistic y i.x
i.x
_Ix_1-3
Logistic regression
Log likelihood = -11.457255
(naturally coded; _Ix_1 omitted)
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
24
10.36
0.0056
0.3113
-----------------------------------------------------------------------------y | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ix_2 |
1.61e+08
1.61e+08
18.90
0.000
2.27e+07
1.14e+09
_Ix_3 |
1.61e+08
.
.
.
.
.
-----------------------------------------------------------------------------Note: 6 failures and 0 successes completely determined.
. est store x
. xi: logistic y
Logistic regression
Log likelihood = -16.635532
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
=
=
=
=
24
0.00
.
0.0000
-----------------------------------------------------------------------------y | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+--------------------------------------------------------------------------------------------------------------------------------------------14
. est store null
LRT
. xi: logistic y i.x
Log likelihood = -11.457255
. est store x
. xi: logistic y
Log likelihood = -16.635532
. est store null
. lrtest x null
Likelihood-ratio test
(Assumption: null nested in x)
LR chi2(1) =
Prob > chi2 =
10.36
0.0013
15
Comparison of tests
|
y
x |
0
1 |
Total
-----------+----------------------+---------1 |
6
0 |
6
2 |
3
6 |
9
3 |
3
6 |
9
-----------+----------------------+---------Total |
12
12 |
24
Pearson chi2(2) =
Fisher's exact =
LR chi2(1) =
8.0000
P
P
10.36
P
(using 2df: P
Clearly LRT isn’t great.
But 1df is even worse than 2df
=
=
=
=
0.018
0.029
0.0013
0.0056)
16
Note
• In this example, we could use Pearson /
Fisher as gold standard.
• Can’t do this in more complex examples
(e.g. adjust for several covariates).
17
My proposal for Stata
• lrtest appears to adjust df for infinite
parameter estimates: it should not
• Model df should be incremented to allow
for any variables dropped because they
perfectly predict success/failure
– Don’t need to increment log lik as it is 0 for
the cases dropped
• Can the ad hoc handling of zeroes by
xi:logistic be improved?
18
Conclusions for statisticians
• Must remember the c2 approximation is
still poor for these LRTs
– typically anti-conservative? (Kuss, 2002)
• Performance of LRT can be improved by
using penalised likelihood (Firth, 1993;
Bull, 2006) - like a mildly informative prior
– worth using routinely?
• Gold standard: Bayes or exact logistic
regression (logXact)?
19
The end
20
Output for example with 2-level x
. logit y x
Log likelihood = -19.095425
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_cons |
.6931472
.3872983
1.79
0.074
-.0659436
1.452238
------------------------------------------------------------------------------
. estimates store x
. logit y
Log likelihood = -22.493406
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_cons |
1.098612
.3651484
3.01
0.003
.3829346
1.81429
-----------------------------------------------------------------------------. estimates store null
. lrtest x null
df(unrestricted) = df(restricted) = 1
r(498);
. lrtest x null, force df(1)
Likelihood-ratio test
(Assumption: null nested in x)
LR chi2(1) =
Prob > chi2 =
6.80
0.0091
21
Download