Logistic Regression
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Linear Regression Review
FEV1 = b0 + b1Age + b2Height
E(FEV1 ) = m = b0 + b1Age + b2Height
Where data are assumed to be normally
distributed with mean equal to m
Models such as these are appropriate for continuous
outcome measures such as FEV1, weight, blood pressure
What if our outcome is Binary?
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Common binary outcome measures
 Healthy vs unhealthy
 E.g., heart disease (y/n), Cancer (y/n), COPD (y/n)
 Progressive disease vs stable disease
 Based on, e.g., cancer stage
 Alive vs dead
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Convenient coding of binary outcomes
 COPD = 0
= 1
if FEV1/FVC > 0.70
if FEV1/FVC < 0.70
 Large = 0
= 1
if tumor size is “small”
if tumor size is “large”
 Dead = 0
= 1
if alive
if deceased
Note use of 0/1 coding and descriptive names that define “1”
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
 Consider the case of a binary indicator
of vital status
 Dead
= 0
= 1
if alive
if deceased
 If Dead is coded 0/1, then its expected value
is equal to the probability that Dead=1. i.e.,
E(Dead) = P = Probability of death
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Suppose we want to model the association
between vital status and age…
 If we fit the data using standard linear
regression, our model would be of the form
P = β0 + β1Age
 That is, we assume the probability of death
varies in a linear manner with age.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
.8
1
The Logistic Regression Model
d ea d
.6
When age = 60,
estimated value of dead = .6
.4
Is this a sensible result?
0
.2
What if predicted value
is >1 or <0?
20
40
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
60
age
80
100
The Logistic Regression Model
 Logistic regression analysis is tool for modeling
binary data that overcomes some of the
limitations of linear regression.
 Rather than assuming the data are normally
distributed, which we know isn’t true, we first
assume the data follow a binomial distribution,
which implicitly assumes we have a series of
0/1 observations each with probability P of being
dead, i.e., Dead = 1.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Rather than assuming P is a linear combination of
variables of interest, e.g.,
P = β0 + β1Age + β2Male
we instead assume
P
e
b 0 + b 1 Age + b 2 Male
1+ e
b 0 + b 1 Age + b 2 Male
or equivalently,
ln[P/(1-P)] = β0 + β1Age + β2Male
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
ln[P/(1-P)] = β0 + β1Age + β2Height
 The function ln[P/(1-P)] is referred to as the “logit”
of P, hence the term “logistic” regression!
 Unlike the linear regression model, the logit function
has the desirable property that it is always between
0 and 1.
 It also turns out to have some statistical properties
that makes it a particularly desirable function of P to
estimate.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
.2
0
.1
d ea dp r
.3
.4
Sample logistic function
20
40
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
60
age
80
100
The Logistic Regression Model
Interpretation of coefficients
ln[P/(1-P)] = β0 + β1Age + β2Male
 Recall that P/(1-P) is the odds of our outcome of
interest, in this case death.
 Hence the logit of P is the same as the ln(odds) of
death, and so the odds of death can be written
odds  e
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
b 0 + b 1 Age + b 2 Male
The Logistic Regression Model
Interpretation of coefficients
odds  e
OR (male vs female)
b 0 + b 1 Age + b 2 Male
e
=
e
=
=>
e
b 0 + b 1 Age + b 2
b 0 + b 1 Age
b2
b2 = ln(OR males vs females )
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Interpretation of coefficients
b 0 + b 1 Age + b 2 Male
odds  e
Similarly we can calculate the OR associated with an increase in
age of 10 years as
e
OR (10 yr incr in age) =
b 0 + b 1 ( Age + 10 ) + b 2 Male
e
=
b 0 + b 1 Age + b 2 Male
e
10 b 1
=> 10b2 = ln(OR 10 year increase in age )
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Interpretation of coefficients
Ever smoker
(X=1)
Never smoker
(X=0)
Dead (Y=1)
62 (14% = p1)
18 (9% = p0)
Alive (Y= 0)
384 (86%)
186 (91%)
Vital status
Odds ratio for smokers to never-smokers =
 p1

 (1  p ) 
1 

 p0


(1  p 0 ) 

STATA logistic regression output for: logit dead smk
Coef.
Std. Err.
.28225
z
= 1.67
OR = e.512 = 1.67
P>z
[95% Conf. Interval]
smk
.5118667
1.81
0.070
-.0413331 1.065067
_cons
-2.335375 .2468438 -9.46
0.000
-2.81918 -1.85157
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Hypothesis testing and confidence intervals
 Testing H0: ln(OR) = b1 = 0 vs. Ha: b1 = 0 is equivalent to
testing H0: OR = eb1 = 1 vs. Ha: eb1 = 1
 Use large sample normality of b1 to compute p-values and
to construct confidence limits
 b1 / SE(b1) should look like a z-score under H0 … use to compute p-value
 b1 ± 1.96 * SE(b1) is an approximate 95% confidence interval
STATA logistic regression output for: logit dead smk
Coef.
Std. Err.
.28225
z
P>z
[95% Conf. Interval]
smk
.5118667
1.81
0.070
-.0413331 1.065067
_cons
-2.335375 .2468438 -9.46
0.000
-2.81918 -1.85157
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
computing CIs for the odds ratio
Because b1 is more normally distributed than eb1, we
construct CIs for the ln(OR) and then exponentiate these
to get corresponding CIs for the OR.
95% CI for ln(OR) = (-0.04, 1.07)
 95% CI for OR = e(-0.04, 1.07) = (0.96, 2.90)
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Variations in software output
STATA logistic regression output: logit dead smk
Coef.
Std. Err.
.28225
z
P>z
[95% Conf. Interval]
smk
.5118667
1.81
0.070
-.0413331 1.065067
_cons
-2.335375 .2468438 -9.46
0.000
-2.81918 -1.85157
Default output in the log scale
STATA logistic regression output: logit dead smk, or
Odds Ratio Std. Err.
smk
1.668403
.4709067
z
P>|z|
[95% Conf. Interval]
1.81 0.070
.9595094 2.901032
Output requested in the transformed scale
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Adjusting for potential confounder variables
Suppose we conduct a cross-sectional study to investigate
the association between gender and COPD. If P is the
probability of having COPD and Male is a 0/1 indicator of
male sex, then we might fit the logistic model
ln[P/(1-P)] = β0 + β1Male
to assess the OR for COPD associated with male sex.
Might this association be confounded by smoking status,
and if so how might we adjust for the potentially
confounding effects of smoking?
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Adjusting for potential confounder variables
If ES is a 0/1 indicator of ever having smoked, we might fit the
model
ln[P/(1-P)] = β0 + β1Male + β2ES
Under this model, we say the effect of male sex is now adjusted for
the potentially confounding effect of having ever smoked. The
resulting odds ratio is analogous to the pooled OR that you would
get from a stratified 2x2 table analysis that crosses Male by COPD
for each level of ES.
We could adjust for additional potential confounders, including
continuous variables, by adding them to the model as main effects.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Adjusting for potential effect modification
Now suppose we want to know whether smoking
modifies the effect of male sex on COPD
prevalence. In classical epidemiology this means
we want to know if the OR associated with male
sex varies by smoking status.
How would we test for the presence of effect
modification in our logistic model?
As we learned previously, we use interaction terms!
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
The Logistic Regression Model
Adjusting for potential effect modification
ln[(P/(1-P)] = b0 + b1Male + b2ES + b3Male*ES
E vS m k
M ale
m od el
0
0
b0
0
1
b0 + b1
1
0
b0 +
1
1
b0 + b1 + b2 + b3
b2
b1 = ln(OR) for male sex in never smokers
b2 = ln(OR) for ever smoking in women
b3 = difference in ln(OR) for ever smoking between men &
women
= difference in ln(OR) for male sex between ever & never
smokers
Testing H0: β3 = 0 is a test of whether there is effect modification.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Some Examples
STATA output from: logit dead smk, or
Odds Ratio Std. Err. Z P>z
Smk 1.668403 .4709067 1.81 0.070
STATA output from:
Odds Ratio
Smk 4.219142
Age1 1.108467
[95% Conf. Interval]
.9595094 2.901032
logit dead age smk, or
Std. Err
Z
P>z [95% Conf. Interval]
1.397128 4.35 0.000 2.204738 8.074048
.0147424 7.74 0.000 1.079946 1.137742
• The second model gives the OR for death associated with smoking
after adjusting for age
• Note the change in the size of the smoking OR between the two models –
what might explain this change?
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Some Examples
In the PAD Trial, (non-medical) lay volunteers were trained to
respond to cardiac arrests in public and to perform CPR.
Volunteers received retraining at various intervals to see how long
it took before their CPR skills degraded to the point that they were
unlikely to perform adequate CPR.
We want to know
1) whether the amount of time between trainings is related to
CPR quality, and
2) whether the relationship of CPR quality and time between
trainings differs across age groups.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Some Examples
We define the following variables:

Response variable: cprok
= 0 if CPR performed during testing was inadequate
= 1 if CPR performed during testing was adequate

Predictor variable: agegt50
= 0 if age is < 50
= 1 if age is > 50

Predictor variable: late
= 0 if volunteer was tested/retrained ≤ 7months after initial training
= 1 if volunteer was tested/retrained > 7months after initial training
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Some Examples
STATA output from: logit cprok late, or robust
Robust
Odds Ratio Std. Err
Z
P>z [95% Conf. Interval]
late
0.932
0.094
-0.69 0.490 0.7646 1.1372
agegt50 0.4129
0.0486 -7.51 0.000 0.3278 0.5201
STATA output from: logit cprok late if agegt50==1, or robust
Robust
Odds Ratio Std. Err
Z
P>z [95% Conf. Interval]
late
1.3303
0.2252 1.69 0.092 0.9546 1.8538
STATA output from: logit cprok late if agegt50==0, or robust
Robust
Odds Ratio Std. Err
Z
P>z [95% Conf. Interval]
late
0.74
0.0976
-2.25 0.025 0.5764 0.9631
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
Some Examples
STATA output from: logit cprok late agegt50 agelate, or robust
(agelate = late*agegt50)
Robust
Odds Ratio Std. Err
Z
P>z [95% Conf. Interval]
Late
0.7709
1.090
-1.
0.066
0.5843 1.0170
agegt50 0.4129
0.0486 -7.51 0.000
0.3278 0.5201
agelate 1.6661
0.4134 2.06 0.040
1.0243 2.7099
We reject the null hypothesis of no interaction and conclude that the
impact of time between retraining on CPR performance varies
significantly for those over and under the age of 50.
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH