Exact Logistic Regression

advertisement
Chapter 5-6. Exact Logistic Regression
Case Study: Schroerlucke Dataset
Schroerlucke et al (2009) concluded that the eight-plate (Orthofix) device fails more
often when implanted in orthopaedic patients with Blount disease than in patients with
other diagnoses. They conclude this without providing a p value for the Blount variable.
Instead, the authors provided the dataset in a table in their article, giving the opportunity
for the reader to verify their conclusion. The dataset schroerlucke.dta was created from
the table of data in the article.
Reading in the data file, schroerlucke.dta
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on schroerlucke.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\schroerlucke.dta",
clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use schroerlucke, clear
The dichotomous outcome variable is failed (1=a screw broke, 0=all screws intact). The
predictor variable of interest is blount (1=Blount disease, 0=other diagnosis).
Some of the patients had more than one knee surgery in the study. We will ignore that in
this chapter, and just assume all observations are independent. We will come back to this
dataset again in a later chapter, once we’ve had some experience with the repeated
measurements analysis approaches.
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University
of Utah School of Medicine, 2010.
Chapter 5-6 (revision 16 May 2010)
p. 1
Simply computing a Fisher’s exact test on these data,
tab failed blount , col exact
+-------------------+
| Key
|
|-------------------|
|
frequency
|
| column percentage |
+-------------------+
|
blount
failed |
0
1 |
Total
-----------+----------------------+---------0 |
13
10 |
23
|
100.00
55.56 |
74.19
-----------+----------------------+---------1 |
0
8 |
8
|
0.00
44.44 |
25.81
-----------+----------------------+---------Total |
13
18 |
31
|
100.00
100.00 |
100.00
Fisher's exact =
1-sided Fisher's exact =
0.010
0.006
Without controlling for BMI or weight, the device fails significantly more often in
patients with Blount disease.
Fitting a univariable logistic regression model, with the intent to control for body size as
a covariate later in a multivariable model,
logistic failed blount
note: blount != 1 predicts failure perfectly
blount dropped and 13 obs not used
Logistic regression
Log likelihood = -12.365308
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
=
=
=
=
18
0.00
.
0.0000
-----------------------------------------------------------------------------failed | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------------------------------------------------------------------------------------
We discover that a logistic regression model cannot be fitted to these data. Perhaps this
is what Schroerlucke et al (2009) discovered when they attempted to analyze these data,
although they never mention it or discuss what statistical methods were used.
What is happening here, and how to model such data, is the subject of this chapter.
Chapter 5-6 (revision 16 May 2010)
p. 2
Maximum Likelihood Estimation
In linear regression, we found the values of regression coefficients that described a line of
best fit through the data using the method of least squares. We did this by finding the
line that minimized the deviations of the observed values from the predicted values.
For the logistic regression model
 P( X) 
logit P( X)  ln e 
     i X i
1  P( X) 
the method of least squares does not work. That is, we cannot mathematically derive the
equations for  and  that will lead to the best fit.
We turn to another estimation method, then, called maximum likelihood. In this method,
we find the values of  and ’s that maximize the likelihood function. The likelihood
function has the form,
L = P(data|parameters)
The likelihood is the probability that we observe the data that we observed in our sample,
given some set of values for the model parameters (the  and ’s). This is done using
iterative methods (keep trying values for the  and ’s until new choices fail to increase
the value of the probability equation, L).
By assuming that our observations are independent, which they are if it is a random
sample, we can use the following probability identity for independent events
P(A and B and C and ....) = P(A)P(B)P(C) ...
Next, using the following result from Chapter 5-5,
1
 (    i X i )
P( X)
 1  e(   X )
1  P( X)
e i i
 (   X )
1 e  i i
where we have n1 observations where the outcome occurred, and n2 observations where
the outcome did not occur, so that we have n1 terms that are of the form P(X) and n2
terms that are of the form 1-P(X), the likelihood function for our observed data is
 (   X )
  n1
 n1
  n2
  n2 e  j j 
1


L    P( Xi )    P( X j )    
 (   i X i )   
 (    j X j ) 
i

1
j

1
i

1
j

1


  1 e
  1 e

Chapter 5-6 (revision 16 May 2010)
p. 3
where the actual values of X in the data are substituted in the equation.
We will return to maximum likelihood estimation shortly.
Small Sample Sizes in Logistic Regression
It is well-known that when your data are sparse in a crosstabulation table, you should use
a Fisher’s exact test, rather than a chi-square test.
For example, creating a dataset from 2 x 2 table table (you can use Ch 5-6.do),
clear
input disease exposure count
1 1 7
1 0 3
0 1 2
0 0 7
end
drop if count==0
expand count
drop count
tab disease exposure , expect col chi2 exact
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
| column percentage |
+--------------------+
|
exposure
disease |
0
1 |
Total
-----------+----------------------+---------0 |
7
2 |
9
|
4.7
4.3 |
9.0
|
70.00
22.22 |
47.37
-----------+----------------------+---------1 |
3
7 |
10
|
5.3
4.7 |
10.0
|
30.00
77.78 |
52.63
-----------+----------------------+---------Total |
10
9 |
19
|
10.0
9.0 |
19.0
|
100.00
100.00 |
100.00
Pearson chi2(1) =
Fisher's exact =
1-sided Fisher's exact =
4.3372
Pr = 0.037
0.070
0.051
We notice that this result is statistically significant if we use a chi-square test (p = 0.037),
but it is not significant if we use a Fisher’s exact test (p = 0.070).
Chapter 5-6 (revision 16 May 2010)
p. 4
To justify the use of a chi-square test, we apply the minimum expected frequency rule
(see box).
________________________________________________________________________
Minimum Expected Frequency Rule for the Chi-Square Test
Being an asymptotic test, the chi-square test requires a sufficiently large sample size.
The widely accepted criterion for “how large is large enough” is that (Rosner, 1995,
p.421):
No cell can have an expected frequency < 1 and no more than 20% of the cells
can have an expected frequency < 5. For a 2 × 2 table, that means no cell can
have an expected frequency < 5.
Altman (1991, p.253) relaxes this somewhat, but no many people are aware of Altman’s
perspective:
In practice this rule can be relaxed for a 2 × 2 table to allow one cell to have an
expected value slightly lower than 5.
The expected frequency of a contingency table cell is calculated as
expected cell frequency = (row total × column total) / grand total.
(See Ch 2-4, “Minimum Expected Frequency Rule for Using Chi-Square Test” , p.18 for
a more detailed description of this rule of thumb.)
________________________________________________________________________
Example The minimum expected frequency rule is well-known and widely used. For
example, Cuchel et al. (2007) state in the Statistical Analysis section of their
article,
“Percentages were analyzed using the chi-square test or Fisher’s exact test
when expected cell counts were less than 5.”
Returning to the above example, we observe that three (75%) of the cells have an
expected frequency < 5, and so the chi-square test is not appropriate.
A univariable logistic regression is basically a chi-square test (it is asymptotically
identical, giving identical results for infinitely large sample sizes).
logistic disease exposure
Logistic regression
Log likelihood = -10.875999
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
19
4.53
0.0332
0.1725
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------exposure |
8.166667
8.639103
1.99
0.047
1.027074
64.93634
------------------------------------------------------------------------------
Chapter 5-6 (revision 16 May 2010)
p. 5
This compares with the above crosstabulation analysis,
Pearson chi2(1) =
4.3372
Pr = 0.037
in that the crosstabulation chi-square test and the logistic regression test on the
“exposure” coefficient (called the Wald test) are both significant. With the larger sample
size example in the previous chapter, the p values were essentially identical.
This raises an interesting question. If we do a crosstabulation analysis and fail to get
significance because we were required to use the Fisher’s exact test, is it okay to switch
to logistic regression?
This is actually a question regarding the adequacy of maximum likelihood (ML)
estimation for small sample sizes. (see box)
ML estimators with small samples
Long and Freese (2006, p.77) explain,
“Although ML estimators are not necessarily bad estimators in small samples, the
small-sample behavior of ML estimators for the models we consider is largely
unknown. Except for the logit and Poisson regression, which can be fitted using
exact permutation methods with LogXact (Cytel Corporation 2005), alternative
estimators with known small-sample properties are generally not available. With
this in mind, Long (1997, 54) proposed the following guidelines for the use of ML
in small samples:
It is risky to use ML with samples smaller than 100, while samples over 500
seem adequate. These values should be raised depending on characteristics of
the model and the data. First, if there are many parameters, more observations
are needed…. A rule of at least 10 observations per parameter seems
reasonable…. This does not imply that a minimum of 100 is not needed if
you have only two parameters. Second, if the data are ill-conditioned (e.g.,
independent variables are highly collinear) or if there is little variation in the
dependent variable (e.g., nealry all the outcomes are 1), a larger sample is
required. Third, some models seem to require more observations (such as the
ordinal regression model or the zero-inflated count models).”
_______________
Long, JS. (1997). Regression Models for Categorical and Limited Dependent Variables,
vol. 7 of Advanced Quantitative Techniques in the Social Sciences. Thousand
Oakes, CA: Sage.
The only solution to a small sample size is to resort to exact logistic regression, for many
years available only in the LogXact software. Since LogXact has not been widely used,
researchers and statisticians have historically basically just ignored the problem. Now,
Chapter 5-6 (revision 16 May 2010)
p. 6
however, exact logistic regression is available in popular statistical packages such as SAS
and Stata (beginning with Stata version 10), so the use of exact logistic regression is
becoming more common.
Let’s see what happens if we model the above example data using the LogXact-7
software, which fits an exact logistic regression model. Such a model is the exact
counterpart of logistic regression, just as the Fisher’s exact test is the exact counterpart to
the chi-square test.
Parameter Estimates
Point Estimate
Confidence Interval and P-Value for Odds Ratio
Odds Ratio
Type
95 %CI
SE(Odds)
Lower
2*1-sided
Model Term
Type
Upper
P-Value
%Const
MLE
0.4286 NA
Asymptotic
0.1108
1.657
0.2195
exposure
MLE
8.167 NA
Asymptotic
1.027
64.94
0.04712
CMLE
7.166 NA
Exact
0.752
113.4
0.1025
The exact logistic regression solution (p = 0.1025) is not significant. LogXact also shows
the asymptotic p value (identically to ordinary logistic regression) for comparison
(p=0.047, the same as Stata’s logistic command).
The exact p value is nearly twice the asymptotic p value, similar to the ordinary 2 × 2
crosstabulation analysis, which gave,
Pearson chi2(1) =
Fisher's exact =
4.3372
Pr = 0.037
0.070
Modeling these data using exact logistic regression in Stata, we get an identical result to
LogXact.
Statistics
Exact statistics
Exact logistic regression
Model tab: Dependent variable: disease
Independent variables: exposure
OK
exlogistic disease exposure
Exact logistic regression
Number of obs =
19
Model score
= 4.108889
Pr >= score
=
0.0698
--------------------------------------------------------------------------disease | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------exposure |
7.166306
7
0.1025
.7520147
113.4444
---------------------------------------------------------------------------
The “Wald test” p = 0.1025 is larger than the Fisher’s exact test, but the model’s
likelihood ratio test p = 0.0698 rounds to p = 0.070, identical to the Fisher’s exact test.
Chapter 5-6 (revision 16 May 2010)
p. 7
(Likelihood ratio tests are generally more powerful than the Wald test—it is fine to report
either one.)
Apache Score Example
In the previous chapter, we fit a logistic regression to the 4.11.Sepsis.dta dataset. Recall,
this dataset has two variables, apache and fate. The variable fate, represents the 30-day
mortality status in a sample of patients admitted to an intensive care unit with sepsis
(1=died, 0=survived). The variable apache is the APACHE Score upon admission (a
continuous variable ranging from 0 to 41 in this sample). (Dupont, 2002, p.108)
use "4.11.Sepsis.dta" , clear
sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------apache |
38
19.55263
11.30343
0
41
fate |
38
.4473684
.5038966
0
1
Fitting an ordinary logistic regression model,
logistic fate apache
Logistic regression
Number of obs
=
38
LR chi2(1)
=
22.35
Prob > chi2
=
0.0000
Log likelihood = -14.956085
Pseudo R2
=
0.4276
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------apache |
1.222914
.0744759
3.30
0.001
1.085319
1.377953
------------------------------------------------------------------------------
and examing the predicted values overlaid on the scatterplot of original values.
predict pred_fate
twoway (scatter fate apache)(scatter pred_fate apache) ///
, ytitle("predicted mortality risk")
Note: the “///” is another way to continue a command across more than one line.
One “/” means division, “//” means start of an inline comment, and
“///” means continue the command on the next line
Chapter 5-6 (revision 16 May 2010)
p. 8
1
.8
.6
predicted mortality risk
.4
.2
0
0
10
20
30
APACHE II Score at Baseline
Mortal Status at 30 Days
40
Pr(fate)
We observed that the predicted values agreed with the scatterplot pretty good. However,
it is easy to see that the graph begins to rise too soon on the left and does not flatten out
soon enough on the right.
We might try fitting APACHE as quintiles, to be objective in our choice of cut-points.
Stata’s “generate quantiles” command, xtile, is very useful here. It has the syntax:
xtile new_variable_name = orginal_variable_name , nq(5)
where the “nq” options is “number of quantiles”. Specifying 5 gives quintiles.
This command will do the best in can to divide up the variable equally into the number of
quantiles requested.
xtile apache5 = apache ,nq(5)
tab apache5
5 quantiles |
of apache |
Freq.
Percent
Cum.
------------+----------------------------------1 |
8
21.05
21.05
2 |
8
21.05
42.11
3 |
7
18.42
60.53
4 |
8
21.05
81.58
5 |
7
18.42
100.00
------------+----------------------------------Total |
38
100.00
Summary for variables: apache
by categories of: apache5 (5 quantiles of apache )
Chapter 5-6 (revision 16 May 2010)
p. 9
From the percent column of the tab output, we see that approximately 20% of the
continuous variable’s values were classified into each category of the new ordered
categorical variable.
To get a nice table of the minimum and maximun for each category, so we know just
what the category represents, we can use,
tabstat apache , by(apache5) stat(min max)
apache5 |
min
max
---------+-------------------1 |
0
8
2 |
9
16
3 |
17
23
4 |
24
31
5 |
32
41
---------+-------------------Total |
0
41
------------------------------
Finally, to create some indicator variables for each category, so we can use them later to
make combinations of categories the referent group,
tab apache5 , gen(Iapache)
describe I* // describe all variables beginning with I
storage display
value
variable name
type
format
label
variable label
-------------------------------------------------------------Iapache1
byte
%8.0g
apache5== 1.0000
Iapache2
byte
%8.0g
apache5== 2.0000
Iapache3
byte
%8.0g
apache5== 3.0000
Iapache4
byte
%8.0g
apache5== 4.0000
Iapache5
byte
%8.0g
apache5== 5.0000
Note: The * is called a “wildcard”, which means any text whatsoever in the
variable name beginning with I.
Chapter 5-6 (revision 16 May 2010)
p. 10
Stata Version 10:
Rather than generate dummy variables with nice variable names, since we are not sure we
want to use quintiles just yet, we can use Stata’s generate indicator (xi) facility, which
creates 4 dummy variables with the lowest category as the referent:
_Iapache5_2
_Iapache5_3
_Iapache5_4
_Iapache5_5
and then runs the model.
xi: logistic fate i.apache5
Note: The “xi:” placed before any regression command, informs the regression
command to generate indicator variables for any categorical variable preceded by
“i.”.
This is very fast, but it always assumes the first category is the referent, which may not
be what you had in mind.
i.apache5
_Iapache5_1-5
(naturally coded; _Iapache5_1 omitted)
note: _Iapache5_2 != 0 predicts failure perfectly
_Iapache5_2 dropped and 8 obs not used
Logistic regression
Log likelihood = -10.665332
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
30
19.72
0.0002
0.4804
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Iapache5_3 |
1.11e+08
1.46e+08
14.00
0.000
8283299
1.48e+09
_Iapache5_4 |
5.81e+08
8.83e+08
13.28
0.000
2.96e+07
1.14e+10
_Iapache5_5 |
4.98e+08
.
.
.
.
.
-----------------------------------------------------------------------------note: 8 failures and 0 successes completely determined.
The “xi:” in the command “xi: logistic fate i.apache5” dropped the first quintile
to use as the referent group, which is why we see the message _Iapache5_1 omitted
This model is a complete disaster.
Chapter 5-6 (revision 16 May 2010)
p. 11
Stata Version 11:
Rather than generate dummy variables with nice variable names, since we are not sure we
want to use quintiles just yet, we can let Stata create dummy variables behind the scenes.
Selecting category 1 as the baseline, or referent,
logistic fate ib1.apache5
// Stata version 11
note: 2.apache5 != 0 predicts failure perfectly
2.apache5 dropped and 8 obs not used
convergence not achieved
Logistic regression
Log likelihood = -10.665332
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
30
19.72
0.0001
0.4804
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------apache5 |
2 | (empty)
3 |
9.37e+07
1.24e+08
13.88
0.000
7008821
1.25e+09
4 |
4.92e+08
7.47e+08
13.17
0.000
2.50e+07
9.67e+09
5 |
4.22e+08
.
.
.
.
.
-----------------------------------------------------------------------------Note: 8 failures and 0 successes completely determined.
convergence not achieved
r(430);
end of do-file
r(430);
This model is a complete disaster.
Chapter 5-6 (revision 16 May 2010)
p. 12
First, consider the messages shown in blue.
note: 2.apache5 != 0 predicts failure perfectly
2.apache5 dropped and 8 obs not used
convergence not achieved
Logistic regression
Log likelihood = -10.665332
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
30
19.72
0.0001
0.4804
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------apache5 |
2 | (empty)
3 |
9.37e+07
1.24e+08
13.88
0.000
7008821
1.25e+09
4 |
4.92e+08
7.47e+08
13.17
0.000
2.50e+07
9.67e+09
5 |
4.22e+08
.
.
.
.
.
-----------------------------------------------------------------------------Note: 8 failures and 0 successes completely determined.
convergence not achieved
r(430);
Stata informs us that the second quintile, 2.apache5, for its values not equal to zero
“!=0” (“!=”, as well as “~=” are the “not equal” symbols in Stata), predicted failure
perfectly. Therefore, the 8 observations in the 2nd quintile are dropped.
Verifying that is the case:
tab apache5 fate
5 | Mortal Status at 30
quantiles |
Days
of apache |
Alive
Dead |
Total
-----------+----------------------+---------1 |
8
0 |
8
2 |
8
0 |
8
3 |
3
4 |
7
4 |
1
7 |
8
5 |
1
6 |
7
-----------+----------------------+---------Total |
21
17 |
38
We see that all 8 values of the second quintile of fate had the status of “alive”, which is
scored as 1, which is not equal to zero. From this crosstabulation table, we see that no
deaths occurred for the second quintile.
Chapter 5-6 (revision 16 May 2010)
p. 13
Looking at the indicator variable for the second quintile, created by the “xi” command, if
you ran that,
tab Iapache2 fate
| Mortal Status at 30
apache5== |
Days
2.0000 |
Alive
Dead |
Total
-----------+----------------------+---------0 |
13
17 |
30
1 |
8
0 |
8
-----------+----------------------+---------Total |
21
17 |
38
or without labels,
tab Iapache2 fate, nolabel
| Mortal Status at 30
apache5== |
Days
2.0000 |
0
1 |
Total
-----------+----------------------+---------0 |
13
17 |
30
1 |
8
0 |
8
-----------+----------------------+---------Total |
21
17 |
38
We see there is no variation in quintile 2, in that there were no deaths.
Why is that a problem?
The model cannot be fitted because the coefficient for the second quintile indicator
variable is effectly negative infinity, or “infinitely protective.” Stata’s solution, then, is
to simply drop the variable, along with the observations identified by the indicator
variable (Iapche2==1) (Long and Freese, 2006, pp.192-193).
Notice that the sample size was reduced from n=38 to n=30, when you compare the two
models above.
Long and Freese’ explanation is consistent with what happens with an odds ratio
calculation in a 2 x 2 table that contains a cell with zero. Notice the odds ratio is ab/cd =
(13 x 0)/(17 x 8) = 0. Mathematically, the log odds is then undefined, since log(0) is
undefined (the graph of the log odds ratio asymptotically approaches negative infinity as
the odds ratio approaches 0).
| Mortal Status at 30
apache5== |
Days
2.0000 |
Alive
Dead |
Total
-----------+----------------------+---------0 |
13
17 |
30
1 |
8
0 |
8
-----------+----------------------+---------Total |
21
17 |
38
Chapter 5-6 (revision 16 May 2010)
p. 14
It would not help to even recode the variable,
recode Iapache2 0=1 1=0 , gen(Iapache2rev)
tab Iapache2rev fate
RECODE of |
Iapache2 | Mortal Status at 30
(apache5== |
Days
2.0000) |
Alive
Dead |
Total
-----------+----------------------+---------0 |
8
0 |
8
1 |
13
17 |
30
-----------+----------------------+---------Total |
21
17 |
38
because this time the odds ratio itself is undefined, since this time we would have to
divide by 0 [OR=8*17/(0*13)=8*17/0]
Either way, the model is undefined for that variable, and so Stata has to drop it to proceed
with fitting the model.
Next, let’s consider the regression coefficient and standard errors that Stata left in the
model.
note: 2.apache5 != 0 predicts failure perfectly
2.apache5 dropped and 8 obs not used
convergence not achieved
Logistic regression
Log likelihood = -10.665332
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
30
19.72
0.0001
0.4804
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------apache5 |
2 | (empty)
3 |
9.37e+07
1.24e+08
13.88
0.000
7008821
1.25e+09
4 |
4.92e+08
7.47e+08
13.17
0.000
2.50e+07
9.67e+09
5 |
4.22e+08
.
.
.
.
.
-----------------------------------------------------------------------------Note: 8 failures and 0 successes completely determined.
convergence not achieved
r(430);
Chapter 5-6 (revision 16 May 2010)
p. 15
When you see very large standard errors, you have a problem with multicollinearity (high
correlation among the predictor variables). To see this multicollinearity,
tab apache5 fate
5 | Mortal Status at 30
quantiles |
Days
of apache |
Alive
Dead |
Total
-----------+----------------------+---------1 |
8
0 |
8
2 |
8
0 |
8
3 |
3
4 |
7
4 |
1
7 |
8
5 |
1
6 |
7
-----------+----------------------+---------Total |
21
17 |
38
Notice that the 3rd through the 5th quintiles taken as a set predict all of the deaths. When
they are in the model together, having already dropped category 2 from the model, then
near perfect collinearity exists [because the sum of these three indicator variables is
nearly identical to the behind the scenes column of 1’s which represents the intercept
term].
To illustrate, we will fit models various combinations of the indicator variables, with the
indicators left out representing the referent group.
logistic
logistic
logistic
logistic
fate
fate
fate
fate
Iapache3
Iapache3 Iapache4
Iapache3 Iapache5
Iapache3 Iapache4 Iapache5
Chapter 5-6 (revision 16 May 2010)
p. 16
. logistic fate Iapache3
Logistic regression
Log likelihood = -25.862927
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
38
0.53
0.4660
0.0102
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Iapache3 |
1.846154
1.561951
0.72
0.469
.3516439
9.69243
-----------------------------------------------------------------------------. logistic fate Iapache3 Iapache4
Logistic regression
Log likelihood = -20.995701
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
38
10.27
0.0059
0.1964
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Iapache3 |
3.777778
3.39753
1.48
0.139
.6482038
22.01716
Iapache4 |
19.83333
23.20032
2.55
0.011
2.003046
196.3814
-----------------------------------------------------------------------------. logistic fate Iapache3 Iapache5
Logistic regression
Log likelihood = -22.138465
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
38
7.98
0.0185
0.1527
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Iapache3 |
3.238095
2.868985
1.33
0.185
.5703171
18.38497
Iapache5 |
14.57143
17.04513
2.29
0.022
1.471626
144.2802
-----------------------------------------------------------------------------. logistic fate Iapache3 Iapache4 Iapache5
convergence not achieved
Logistic regression
Log likelihood = -10.665332
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
38
30.93
0.0000
0.5918
-----------------------------------------------------------------------------fate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Iapache3 |
1.79e+08
2.37e+08
14.37
0.000
1.34e+07
2.40e+09
Iapache4 |
9.42e+08
1.43e+09
13.60
0.000
4.79e+07
1.85e+10
Iapache5 |
8.07e+08
.
.
.
.
.
-----------------------------------------------------------------------------Note: 16 failures and 0 successes completely determined.
convergence not achieved
r(430);
We see that the model converged on a solution until we got to the point where the three
indicators, taken as a set, identified all of the death cases.
This is an issue with maximum likelihood estimation, which cannot fit the logistic model
when perfect, or nearly perfect, discrimination is achieved.
Chapter 5-6 (revision 16 May 2010)
p. 17
When Maximum Likelihood Will Fail Completely
There are some datasets for which maximum likelihood estimates do not even exist. This
occurs if there is complete separation or quasi-complete separation.
For any number of predictor variables, if you were to plot the data in however many
dimensions is required, and you can draw a line (or plane, or hyperplane) that separates
the outcome=1 values from the outcome=0 values, then you have complete separation. If
just a few values overlap, then you have quasi-complete separation.
Example of “Complete Separation”
To see an example, with one predictor variable,
0
.2
.4
disease
.6
.8
1
clear
input id disease exposure
1 1 20
2 1 21
3 1 22
4 1 23
5 1 24
6 0 25
7 0 26
8 0 27
9 0 28
10 0 29
end
twoway scatter disease exposure, xline(24.5)
20
22
24
26
28
30
exposure
This is a case of complete separation, since a vertical line drawn at exposure=24.5
separates all of the disease=1 values from the disease=0 values.
Chapter 5-6 (revision 16 May 2010)
p. 18
Attempting to model this with logistic regression, fitted by maximum likelihood
estimation,
logistic disease exposure
. logistic disease exposure
outcome = exposure <= 24 predicts data perfectly
r(2000);
the model simply crashes.
This is a very frustrating result since clearly exposure is associated with disease. In fact,
the cutpoint at  24 predicts the data perfectly, so we are really on to something clinically
interesting.
Exact Logistic Regression Solution to Complete Separation Example
We can solve this problem by using exact logistic regression.
exlogistic disease exposure
note: CMLE estimate for exposure is -inf; computing MUE
Exact logistic regression
Number of obs =
10
Model score
= 6.818182
Pr >= score
=
0.0079
--------------------------------------------------------------------------disease | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------exposure |
.3732273*
110
0.0079
0
.8397828
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
The median unbiased estimate (MUE) is reported whenever the conditional maximum
likelihood estimate (CMLE) cannot be obtained. Either estimate is fine, so there is no
need to informed the reader which one you are reporting.
Exact logistic regression provides the result (OR=0.37, 95% CI, 0-0.84, p=0.008).
Chapter 5-6 (revision 16 May 2010)
p. 19
Example of “Quasi-Complete Separation”
Let’s make one value overlap to create an example of quasi-complete separation.
Beginning with the same dataset,
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+-------------------------+
| id
disease
exposure |
|-------------------------|
| 1
1
20 |
| 2
1
21 |
| 3
1
22 |
| 4
1
23 |
| 5
1
24 |
|-------------------------|
| 6
0
25 |
| 7
0
26 |
| 8
0
27 |
| 9
0
28 |
| 10
0
29 |
+-------------------------+
Let’s change the exposure=25 to 24 for id=6.
replace exposure=24 if id==6
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+-------------------------+
| id
disease
exposure |
|-------------------------|
| 1
1
20 |
| 2
1
21 |
| 3
1
22 |
| 4
1
23 |
| 5
1
24 |
|-------------------------|
| 6
0
24 |
| 7
0
26 |
| 8
0
27 |
| 9
0
28 |
| 10
0
29 |
+-------------------------+
Chapter 5-6 (revision 16 May 2010)
p. 20
The dataset no longer passes the vertical line test for complete separation, but it comes
very close.
0
.2
.4
disease
.6
.8
1
twoway scatter disease exposure, xline(24.5)
20
22
24
26
28
30
exposure
Attempting to fit ordinary logistic regression to these data,
logistic disease exposure
. logistic disease exposure
note: outcome = exposure < 24 predicts data perfectly except for
exposure == 24 subsample:
exposure dropped and 8 obs not used
Logistic regression
Log likelihood = -1.3862944
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
=
=
=
=
2
0.00
.
0.0000
-----------------------------------------------------------------------------disease | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------------------------------------------------------------------------------------
The ordinary logistic regression model could still could not be fit.
The exact logistic model can be fit, however.
Chapter 5-6 (revision 16 May 2010)
p. 21
exlogistic disease exposure
note: CMLE estimate for exposure is -inf; computing MUE
Exact logistic regression
Number of obs =
10
Model score
= 6.291262
Pr >= score
=
0.0159
--------------------------------------------------------------------------disease | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------exposure |
.465919*
110
0.0159
0
.899146
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Let’s see how exact logistic regression does with the APACHE quintiles data.
use "4.11.Sepsis.dta" , clear
xtile apache5 = apache ,nq(5)
tab apache5 , gen(Iapache)
exlogistic fate ib1.apache5
. exlogistic fate ib1.apache5
factor variables and time-series operators not allowed
r(101);
We discover the exlogistic does not work with Stata-11’s factor variable facility (putting
“i” in front of a categorical variable). Perhaps the command will be updated later to work
with this.
For now, specifying the model with all indicators but category 1 left out as the referent,
exlogistic fate Iapache2 Iapache3 Iapache4 Iapache5
note:
note:
note:
note:
distribution for (Iapache2
CMLE estimate for Iapache3
CMLE estimate for Iapache4
CMLE estimate for Iapache5
| Iapache3 I~5) is
is +inf; computing
is +inf; computing
is +inf; computing
degenerate
MUE
MUE
MUE
Exact logistic regression
Number of obs =
38
Model score
= 23.42667
Pr >= score
=
0.0000
--------------------------------------------------------------------------fate | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------Iapache2 |
1
0
0
+Inf
Iapache3 |
9.827552*
4
0.0513
.9882278
+Inf
Iapache4 |
34.16842*
7
0.0014
3.503236
+Inf
Iapache5 |
29.14227*
6
0.0028
2.921463
+Inf
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Chapter 5-6 (revision 16 May 2010)
p. 22
As expected, it provides reasonable estimates for the 3rd , 4th, and 5th quintiles. Nothing
could be done with the 2nd quintile, which had no death events, but there is no reason to
not just combine that with the first quintile, becoming part of the referent group where no
deaths occurred.
Leaving both the 1st and 2nd quintiles out of the model, which combines them as the
reference group,
exlogistic fate Iapache3 Iapache4 Iapache5
note: CMLE estimate for Iapache3 is +inf; computing MUE
note: CMLE estimate for Iapache4 is +inf; computing MUE
note: CMLE estimate for Iapache5 is +inf; computing MUE
Exact logistic regression
Number of obs =
38
Model score
= 23.42667
Pr >= score
=
0.0000
--------------------------------------------------------------------------fate | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------Iapache3 |
19.9062*
4
0.0079
2.093135
+Inf
Iapache4 |
69.20091*
7
0.0000
7.478186
+Inf
Iapache5 |
59.00067*
6
0.0001
6.219852
+Inf
--------------------------------------------------------------------------(*) median unbiased estimates (MUE) Exact logistic regression
Number of obs =
38
This is close to our previous model which had only the first quartile as the referent.
Exact logistic regression
Number of obs =
38
Model score
= 23.42667
Pr >= score
=
0.0000
--------------------------------------------------------------------------fate | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------_Iapache5_2 |
1
0
0
+Inf
_Iapache5_3 |
9.827552*
4
0.0513
.9882278
+Inf
_Iapache5_4 |
34.16842*
7
0.0014
3.503236
+Inf
_Iapache5_5 |
29.14227*
6
0.0028
2.921463
+Inf
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Having a larger number of subjects in the reference group (combined 1st and 2nd
quintiles), the second model would be considered more reliable. Notice the confidence
intervals are tighter, lower bound further from zero, in the model with the combined 1st
and 2nd quintiles.
Chapter 5-6 (revision 16 May 2010)
p. 23
Another Example of Quasi-Separation
The dataset we will use for this example is described and analyzed in detail in King and
Ryan (2002).
The dataset is also described in Cytel Statistical Software’s LogXact5 sales brochure as
follows:
Red Blood Cells Settling Out of Suspension
“The erythrocyte sedimentation rate (ESR) is the rate at which red blood cells
settle out of suspension in blood under standard conditions. It is a commonly
used indicator in tests that screen for infections and certain diseases. A study
report in Collett (Modeling Binary Data, 1999, CRC) develops a logistic
regression model with a dichotomized response variable for ESR with a value <
20 being coded as zero and a value  20 coded as one. The predictor variables are
Fibrinogen and Gamma globulin. The study, carried out by the Institute of
Medical Research, Malaysia, sought to determine if there is a relationship
between ESR and the predictor variables. The data (after removing outliers; for
details see Collett, pp. 8 and 168) are shown below:
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Fibrinogen
2.52
2.56
2.19
2.18
3.41
2.46
3.22
2.21
3.15
2.6
2.29
2.35
5.06
3.34
3.15
3.53
2.68
2.6
2.23
2.88
2.65
2.28
2.67
2.29
2.15
2.54
3.93
3.34
2.99
3.32
Chapter 5-6 (revision 16 May 2010)
Gamma
Globulin
38
31
33
31
37
36
38
37
39
41
36
29
37
32
36
46
34
38
37
30
46
36
39
31
31
28
32
30
36
35
ESR
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
p. 24
Results
P-Value for Fibrinogen
Using large sample approximation:
Using exact logistic regression:
0.439
0.001
Using the large sample (asymptotic) approximation would mislead an analyst to
erroneously conclude that Fibrinogen is not significantly related to ESR when in
fact there is a very significant relationship indicated by a p-value of 0.001 as
computed from the exact conditional distribution by LogXact. For a detailed
analysis of this data set comparing exact inference and asymptotic inference see
King and Ryan (“A Preliminary Investigation of Maximum Likelihood Logistic
Regression versus Exact Logistic Regression, “ The American Statistician, 56,
163-170, 2002).”
Let’s try it in Stata.
use esr , clear
logistic esr fibrinogen gamglob
Logistic regression
Log likelihood = -2.7244098
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
30
18.11
0.0001
0.7687
-----------------------------------------------------------------------------esr | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------fibrinogen |
6.37e+10
2.05e+12
0.77
0.439
2.87e-17
1.41e+38
gamglob |
.8704882
.3850277
-0.31
0.754
.3658187
2.07138
-----------------------------------------------------------------------------note: 17 failures and 1 success completely determined.
We see the odds ratio “blowing up” .
Chapter 5-6 (revision 16 May 2010)
p. 25
We can graphically observe quasi-separation with fibrinogen,
0
.2
.4
esr
.6
.8
1
twoway scatter esr fibrinogen , xline(3.33)
2
3
4
5
fibrinogen
Separation is not a problem for gamma globulin,
0
.2
.4
esr
.6
.8
1
twoway scatter esr gamglob
30
35
40
45
gamglob
Chapter 5-6 (revision 16 May 2010)
p. 26
Graphing both variables together,
twoway (scatter fibrinogen gamglob ,mlabel(esr))(pci 3.8 26 3.4 47)
4
5
1
1
1
0
1
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
25
30
35
fibrinogen
40
45
y/yb
We see that we can draw a diagonal line and only one ESR=1 case will cross over it,
suggesting quasi-separation defined by two variables. We can check to see if the “joint”
quasi-separation creates the problem by modeling the two variables separately.
logistic esr fibrinogen
logistic esr gamglob
. logistic esr fibrinogen
Logistic regression
Number of obs
=
30
LR chi2(1)
=
17.98
Prob > chi2
=
0.0000
Log likelihood = -2.7911477
Pseudo R2
=
0.7631
-----------------------------------------------------------------------------esr | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------fibrinogen |
3.82e+07
5.16e+08
1.29
0.196
.0001238
1.18e+19
-----------------------------------------------------------------------------note: 9 failures and 1 success completely determined.
. logistic esr gamglob
Logistic regression
Log likelihood = -11.548594
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
30
0.46
0.4961
0.0197
-----------------------------------------------------------------------------esr | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gamglob |
1.08396
.1273327
0.69
0.493
.8610389
1.364596
------------------------------------------------------------------------------
Chapter 5-6 (revision 16 May 2010)
p. 27
We notice that the estimates do not “blow up” quite as much, but still they blow up too
much to provide a useful model.
Note: This example suggests that quasi-separation could be created by a set of variables.
This is something to watch for in your own datasets.
Adding an interaction term does not help.
gen fibgam=fibrinogen*gamglob
logistic esr fibrinogen gamglob fibgam
Logistic regression
Log likelihood = -2.3542984
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
30
18.85
0.0003
0.8001
-----------------------------------------------------------------------------esr | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------fibrinogen |
2.44e-27
2.03e-25
-0.74
0.462
3.05e-98
1.96e+44
gamglob |
.0002056
.0020858
-0.84
0.403
4.75e-13
88962.48
fibgam |
11.30184
32.69964
0.84
0.402
.0389374
3280.44
------------------------------------------------------------------------------
The solution is to use exact logistic regression and report its result.
exlogistic esr fibrinogen gamglob
Exact logistic regression
Number of obs =
30
Model score
= 14.61946
Pr >= score
=
0.0004
--------------------------------------------------------------------------esr | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------fibrinogen |
12.79579*
15.86
0.0022
2.262284
+Inf
gamglob |
1*
147
1.0000
.1601282
+Inf
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Protocol Suggestion
If you suspect that you will have near or perfect separation, and particularly if you have a
sample size < 100, you say something like the following in your protocol,
The outcome will be modeled using logistic regression with potential confounding
variables included as covariates. Interaction terms will be included to assess
effect-measure modification, and then removed if not significant. A graphical
assessment of quasi-separation will be performed, as quasi-separation can lead to
inaccurate maximum likelihood estimates (King and Ryan, 2002). If quasiseparation is present, the data will be modeled using exact logistic regression
(Mehta and Patel, 1995).
Chapter 5-6 (revision 16 May 2010)
p. 28
Articles on Exact Logistic Regression
Here are four papers on exact logistic regression:
Ammann, R.A. (2004). Defibrotide for hepatic VOD in children: exact statistics can help!
Bone Marrow Transplantation 34: 277-278.
King EN, Ryan TP. (2002). A preliminary investigation of maximum likelihood logistic
regression versus exact logistic regression. The American Statistician, 56(3):163170.
Metha CR, et al (2000). Efficient Monte Carlo methods for conditional logistic
regression. J Am Statit Assoc 95(449):99-108.
Bull SB, Mak C, Greenwood CMT. (2002). A modified score function estimator for
multinomial logistic regression in small samples. Computational Statistics and
Data Analysis. 39:57-74.
Any statistician consulting with a client where exact logistic regression is needed, or any
researcher needing to convey the concept to a co-author, should share the Ammann
paper, which is a one-page article.
Chapter 5-6 (revision 16 May 2010)
p. 29
Overfitting
Overfitting is a common problem in regression models. It is the problem of obtaining
unreliable associations, which will not show up in future datasets or patients, due to
having too many predictor variables for the number of events or sample size.
This would be a good time to review the topic (see Chapter 2-5, pp.24-31).
Exact Solution to Overfitting
Suppose you want to publish a paper where you have only 10 cases of the disease
outcome and you want to fit a logistic regression with four predictor variables. Clearly
this will produce an overfitting problem, for which they could be criticized (you need at
least 5 cases for every predictor, if the aim is to adjust for confounding, at 10 more or
cases if the aim is to develop a prediction model).
Likewise, suppose you wanted to show univariable logistic regression models for which
there were zero cases with an exposure for some of our predictor variables, so that the
logistic regression model could not be fit at all.
Exact logistic regression would provide a solution to both issues.
Here is a rather lengthy Statistic Methods paragraph, which you could use if the the
reviewer came back and asked you to elaborate on the use of exact logistic regression:
“All reported p values, odds ratios, and confidence intervals were obtained using
exact logistic regression (LogXact-5 statistical software, Cambridge, MA: Cytel
Software Corporation). Ordinary maximum likelihood logistic regression fails
when: 1) the data are sparse, such as few outcome events, 2) the number of events
divided by the number of predictor variables in the model is small, such as < 10 ,
or 3) there exists near perfect or perfect separation, where all events occur in one
predictor category or the other. In these cases, an ordinary logistic regression
model cannot be fit at all, or when it can be fit, the estimates of odds ratios,
confidence intervals, and p values are biased. Exact logistic regression, on the
other hand, can fit a model and the model estimates are unbiased. (King and
Ryan, 2002; Mehta. 2000; Ammann, 2004). Using exact logistic regression, we
were able to obtained unbiased estimates in both univariable and multivariable
models, even though we only had 12 HHV-6 Positive cases for our primary
analysis, and 5 cases in our secondary analysis.”
Look at the above quote and notice the three times that exact logistic regression is
indicated.
Chapter 5-6 (revision 16 May 2010)
p. 30
Specifically, Mehta, one of the developers of LogXact, specifically stated that exact
logistic regression is not affected by the overfitting problem (Mehta, 2000, Introduction
paragraph):
“Logistic regression is a popular mathematical model for the analysis of binary
data with widespread applicability in the physical, biomedical, and behavioral
sciences. Parameter inference for this model is usually based on maximizing the
unconditional likelihood function. For large well-balanced datasets or for datasets
with only a few parameters, unconditional maximum likelihood inference is a
satisfactory approach. However unconditional maximum likelihood inference can
produce inconsistent point estimates, inaccurate p values, and inaccurate
confidence intervals for small or unbalanced datasets and for datasets with a large
number of parameters relative to the number of observations. Sometimes the
method fails entirely as no estimates can be found that maximize the
unconditional likelihood function. A methodologically sound alternative approach
that has none of the aforementioned drawbacks is the exact conditional approach.
Here one estimates the parameters of interest by computing the exact permutation
distributions of their sufficient statistics, conditional on the observed values of the
sufficient statistics for the remaining nuisance parameters.”
Chapter 5-6 (revision 16 May 2010)
p. 31
Defining an Odds Ratio When a Cell Has a Zero Count
Sometimes exact logistic regression produces an OR in the opposite direction that you
would expect, in the situation when the result is not statistically significant. As an
example,
clear
input hhv6 plate count
1 1 0
1 0 5
0 1 11
0 0 58
end
drop if count==0
expand count
drop count
exlogistic hhv6 plate
*
clear
input hhv6 death count
1 1 0
1 0 5
0 1 5
0 0 64
end
drop if count==0
expand count
drop count
exlogistic hhv6 death
Exact logistic regression
Number of obs =
74
Model score
= .9236255
Pr >= score
=
0.5932
--------------------------------------------------------------------------hhv6 | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------plate |
.8200102*
0
0.8727
0
6.579727
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Exact logistic regression
Number of obs =
74
Model score
= .3833228
Pr >= score
=
1.0000
--------------------------------------------------------------------------hhv6 | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------death |
2.045171*
0
1.0000
0
18.28607
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Chapter 5-6 (revision 16 May 2010)
p. 32
Exact logistic regression produced the following non-significant ORs and CIs:
Clinical Variable
HHV-6 Positive
(n=5)
0 (0%)
0 (0%)
Platelets < 100,000
Death before Discharge
HHV-6 Negative
(n=69)
11 (16%)
5 (7%)
OR (95% CI)
0.8 (0-6.6)
2.0 (0-18.3)
Given the zero in the cell of the 2 × 2 table that would make the association infinitely
protective, it would seem that exact logistic regression would provide an odds ratio < 1.0.
It did for one of the associations shown, but not for the other.
Although it can at least provide an odds ratio, which ordinary logistic regression cannot
do, exact logistic regression can give these unexpected OR estimates in the nonstatistically significant situations of zero cells.
The same thing happens with the long-accepted practice of adding ½ to each cell of the 2
× 2 table to avoid zero cell counts. Selvin (2004, p.450) provides Haldane’s (1956)
formulas, which are:
(a  1/ 2)(d  1/ 2)
OR 
(b  1/ 2)(c  1/ 2)
with estimated variance
1
1
1
1
variance[log(OR)] 



a  1/ 2 b  1/ 2 c  1/ 2 d  1/ 2
The two odds ratios computed using Haldane’s method are
Platelets < 100,000
Yes
No
HHV-6 Positive
Yes
No
0
11
5
58
OR = (0.5 × 58.5)/(11.5 × 5.5) = 0.46
and
Death before Discharge
Yes
No
HHV-6 Positive
Yes
No
0
5
5
64
OR = (0.5 × 64.5)/(5.5 × 5.5) = 1.07
Chapter 5-6 (revision 16 May 2010)
p. 33
Case Study: Schroerlucke Dataset
Returning to the case study, where ordinary logistic regression could not be fitted,
we can fit the univariable exact logistic model.
After reading in the dataset, schroerlucke.dta, we fit the model without covariates, using
exlogistic failed blount
note: CMLE estimate for blount is +inf; computing MUE
Exact logistic regression
Number of obs =
31
Model score
= 7.536232
Pr >= score
=
0.0096
--------------------------------------------------------------------------failed | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------blount |
12.48967*
8
0.0111
1.655921
+Inf
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
Next, adjusting for weight,
exlogistic failed blount weight
note: CMLE estimate for blount is +inf; computing MUE
Exact logistic regression
Number of obs =
31
Model score
=
7.65015
Pr >= score
=
0.0178
--------------------------------------------------------------------------failed | Odds Ratio
Suff. 2*Pr(Suff.)
[95% Conf. Interval]
-------------+------------------------------------------------------------blount |
8.495888*
8
0.0600
.9205344
+Inf
weight |
1.01012
762.3
0.7005
.9613036
1.064392
--------------------------------------------------------------------------(*) median unbiased estimates (MUE)
It appears that Schroerlucke et al concluded the right thing, that Blount disease increases
the risk for the screw breakages, after controlling for the patient’s weight.
Two things would need to be argued:
1) Having up to two implants on the same patient did not require something special to
account for a potential lack of independence in the data. We will return to this later in the
course after we have covered the topic of clustered sampling.
2) It is okay to conclude an effect with a p value slightly larger than 0.05. An argument
for this is found in Chapter 2-13, p.5.
Chapter 5-6 (revision 16 May 2010)
p. 34
Appendix. Exact Logistic Regression Using SAS (included here in case your coinvestigators are SAS users)
0
.2
.4
disease
.6
.8
1
Exact logistic regression can also be computed in SAS, which is a widely used statistical
software package. How to do this will be demonstrated using the complete separation
example from page 13.
20
22
24
26
28
30
exposure
These data are in the Excel file, “complete separation.xls”.
id
1
2
3
4
5
6
7
8
9
10
disease
1
1
1
1
1
0
0
0
0
0
exposure
20
21
22
23
24
25
26
27
28
29
To read this Excel file into SAS, copy the following into the SAS Editor window and hit
the run button (the toolbar icon that looks like a little man running).
PROC IMPORT OUT= WORK.DATA1
DATAFILE= "C:\Documents and Settings\u0032770.SRVR\D
esktop\regressionclass\datasets & do-files\complete separation.xls"
DBMS=EXCEL REPLACE;
SHEET="Sheet1$";
GETNAMES=YES;
RUN;
Chapter 5-6 (revision 16 May 2010)
p. 35
Note: The DATEFILE line is very sensitive to embedded spaces. Notice that the
continuation of the line must begin at the left margin. If you add spaces or tab over, it
will think those spaces or tab is part of the directory path and then give you an error
message because it cannot find the file.
To run an ordinary logistic regression, copy the following into the Editor window,
highlight it, and hit the run botton.
proc logistic descending data=work.data1;
model disease=exposure;
run;
Note: By default, the logistic procedure in SAS thinks the outcome event is scored as 0
(0=disease, 1= not disease). Always be sure to include the word “descending” after
“logistic”, as shown here, for the outcome event to be scored as 1 (1=disease, 0 = not
disease). SAS confirms your choice but displaying the following in the Log window
when you run this block of commands:
NOTE: PROC LOGISTIC is modeling the probability that disease=1.
When you run this block of commands, you get the following warning in the Log
window:
WARNING: There is a complete separation of data points. The maximum
likelihood estimate does not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning.
Results shown are based on the last maximum likelihood
iteration. Validity of the model fit is questionable.
In the SAS Output window, you see the result:
The LOGISTIC Procedure
WARNING: The validity of the model fit is questionable.
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
exposure
DF
1
1
Estimate
247.0
-10.0824
Effect
exposure
Standard
Error
433.7
17.6974
Wald
Chi-Square
0.3244
0.3246
Pr > ChiSq
0.5690
0.5689
Odds Ratio Estimates
Point
95% Wald
Estimate
Confidence Limits
<0.001
<0.001
>999.999
Whereas Stata simply crashes and gives an error message, SAS gives outputs a model
that sort of looks valid, but actually blows up (OR <0.001, 95% CI, <0.001 - >999.999).
Chapter 5-6 (revision 16 May 2010)
p. 36
To fit an exact logistic regression, use the following:
proc logistic descending data=work.data1;
model disease=exposure;
exact exposure/estimate=both;
run;
Since we specified “/estimate=both”, we get the ordinary logistic regression followed by
the exact logistic regression. The exact logistic regression from the Output window is:
Exact Conditional Analysis
Conditional Exact Tests
Exact Odds Ratios
95% Confidence
Parameter
Estimate
Limits
p-Value
exposure
0.373*
0
0.840
0.0079
NOTE: * indicates a median unbiased estimate.
This result agrees exact with LogXact-7, from page 14:
Parameter Estimates
Point Estimate
Confidence Interval and P-Value for Odds Ratio
95 %CI
2*1-sided
Model Term Type
Odds Ratio
SE(Odds)
Type
Lower
Upper
P-Value
%Const
MLE
?
?
Asymptotic
?
?
?
exposure
MLE
?
?
Asymptotic
?
?
?
MUE
Chapter 5-6 (revision 16 May 2010)
0.3732 NA
Exact
0
0.8398 0.007937
p. 37
As an enhancement, to get nicely formatted output in SAS, add the following lines:
ods pdf;
ods graphics on;
proc logistic descending data=work.data1;
model disease=exposure;
exact exposure/estimate=both;
run;
ods graphics off;
ods pdf close;
Not only is the output in nice looking tables, but it is in pdf format, which can be saved as
a pdf file.
Chapter 5-6 (revision 16 May 2010)
p. 38
References
Ammann, R.A. (2004). Defibrotide for hepatic VOD in children: exact statistics can help!
Bone Marrow Transplantation 34: 277-278.
Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman &
Hall/CRC.
Cuchel M, Bledon LT, Szapary PO, et al. (2007). Inhibition of microsmal triglyceride
transfer protein in familial hypercholesterolemia. N Engl J Med 356:148-56.
Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: a Simple
Introduction to the Analysis of Complex Data. Cambridge, Cambridge University
Press.
Haldane JBS. (1956). The estimation and significance of logarithm of a ratio of
frequencies. Annals of Human Genetics 20:309-11.
King EN, Ryan TP. (2002). A preliminary investigation of maximum likelihood logistic
regression versus exact logistic regression. The American Statistician, 56(3):163170.
Long JS, Freese J. (2006). Regression models for categorical dependent variables using
Stata. 2nd edition. College Station TX, Stata Press.
Mehta CR, Patel NR. (1995). Exact logistic regression: theory and examples. Statistics in
Medicine 14:2143-2160.
Metha CR, et al (2000). Efficient Monte Carlo methods for conditional logistic
regression. J Am Statit Assoc 95(449):99-108.
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed. Belmont CA, Duxbury Press.
Schroerlucke S, Bertrand S, Clapp J, et al. (2009). Failure of orthofix eight-plate for the
treatment of blount disease. J Pediatr Orthop 29(1):57-60.
Selvin S. (2004). Statistical Analysis of Epidemiologic Data. 3rd ed. New York, Oxford
University Press.
Chapter 5-6 (revision 16 May 2010)
p. 39
Download