Chapter 2-11. Logistic Regression & Dummy Variables

advertisement
Chapter 2-11. Logistic Regression & Dummy Variables
When the outcome is dichotomous (scored as 0 or 1), logistic regression is almost universally
used.
Let’s see what happens if we just use linear regression.
We will use the fev dataset, and attempt to predict being a current smoker, rather than what is
predictive of FEV.
Reading in the data,
File
Open
Find the directory where you copied the course CD:
Change to the subdirectory datasets & do-files
Single click on fev.dta
Open
use fev, clear
Obtaining the means and percents,
bysort male: sum smoker
tab smoker male, col
-> male = 0
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------smoker |
318
.1226415
.3285422
0
1
-> male = 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------smoker |
336
.077381
.2675934
0
1
|
male
Smoking Status |
0
1 |
Total
-------------------+----------------------+---------not current smoker |
279
310 |
589
|
87.74
92.26 |
90.06
-------------------+----------------------+---------current smoker |
39
26 |
65
|
12.26
7.74 |
9.94
-------------------+----------------------+---------Total |
318
336 |
654
|
100.00
100.00 |
100.00
We see that the means of the dichotomous variable smoker are identical to the proportions
computed using the crosstabulation approach.
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 2-11 (revision 16 May 2010)
p. 1
This is because the mean and proportion are computed identically for a 0-1 scored variable:
X
X
n

1  1  0  1  0  ....  1 #1' s

 proportion
n
n
Since linear regression fits a straight line through the group means, it seems reasonable that it
will fit a straight line through the proportions if we have a dichotous outcome variable.
Fitting the regression line to predict smoker from male,
Statistics
Linear models and related
Linear regression
Model tab: Dependent variable: smoker
Independent variables: male
OK
regress smoker male
Source |
SS
df
MS
-------------+-----------------------------Model | .334678982
1 .334678982
Residual | 58.2050764
652
.08927159
-------------+-----------------------------Total | 58.5397554
653 .089647405
Number of obs
F( 1,
652)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
654
3.75
0.0533
0.0057
0.0042
.29878
-----------------------------------------------------------------------------smoker |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male | -.0452606
.0233756
-1.94
0.053
-.091161
.0006399
_cons |
.1226415
.0167549
7.32
0.000
.0897413
.1555417
------------------------------------------------------------------------------
Asking for predicted values,
Statistics
Postestimation
Predictions, residuals, etc.
Main tab: New variable name: pred_smoker
Produce: fitted values
OK
predict pred_smoker
Chapter 2-11 (revision 16 May 2010)
p. 2
Creating a frequency table of the fitted values,
tab pred_smoker
Fitted |
values |
Freq.
Percent
Cum.
------------+----------------------------------.077381 |
336
51.38
51.38
.1226415 |
318
48.62
100.00
------------+----------------------------------Total |
654
100.00
We see that for n=336 subjects, the males, the predicted value is 0.077, and for n=318 subjects,
the females, the predicted value is 0.123. These are identically the means computed using the
sum command on the previous page, and are identically the proportions from the crosstabulation
table computed using the tab command on the previous page.
-> male = 0
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------smoker |
318
.1226415
.3285422
0
1
-> male = 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------smoker |
336
.077381
.2675934
0
1
|
male
Smoking Status |
0
1 |
Total
-------------------+----------------------+---------not current smoker |
279
310 |
589
|
87.74
92.26 |
90.06
-------------------+----------------------+---------current smoker |
39
26 |
65
|
12.26
7.74 |
9.94
-------------------+----------------------+---------Total |
318
336 |
654
|
100.00
100.00 |
100.00
Chapter 2-11 (revision 16 May 2010)
p. 3
From the above linear regression model output,
Source |
SS
df
MS
-------------+-----------------------------Model | .334678982
1 .334678982
Residual | 58.2050764
652
.08927159
-------------+-----------------------------Total | 58.5397554
653 .089647405
Number of obs
F( 1,
652)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
654
3.75
0.0533
0.0057
0.0042
.29878
-----------------------------------------------------------------------------smoker |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male | -.0452606
.0233756
-1.94
0.053
-.091161
.0006399
_cons |
.1226415
.0167549
7.32
0.000
.0897413
.1555417
------------------------------------------------------------------------------
The prediction equation is shown to be
smoker = a + bX = 0.1226415 – 0.0452606(male)
is the straight line fitted to the data, which provides predicted proportions of smokers
smoker = 0.1226415 – 0.0452606(1) = 0.0774 for males
= 0.1226415 – 0.0452606(0) = 0.1226 for females
Finally, notice that the p value from the chi-square test, p=0.053, is identical to the p value from
the regression model, p=0.053. (It only comes out this perfectly close when the sample size is
large.)
From this example, we see that using linear regression to compare two groups on a dichotomous
variable is just as good as using the chi-square test from a crosstabulation approach.
Linear regression is not used for modeling a dichotomous outcome, however. The major
criticism is that it sometimes produces predicted values outside of the 0-1 range, which are
impossible values for proportions. Statisticians are driven crazy by such inconsistencies.
An example dataset for which such inconsistencies arise, is vaso.dta.
These data, originally published by Finney (1947), were obtained in a carefully controlled study
of the effect of the RATE and VOLume of air inspired by human subjects on the occurrence
(coded 1) or non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of
the fingers.
Chapter 2-11 (revision 16 May 2010)
p. 4
Opening this data file,
File
Open
Find the directory where you copied the course CD:
Change to the subdirectory datasets & do-files
Single click on vaso.dta
Open
use vaso, clear
Fitting a multivariable linear regression,
Statistics
Linear models and related
Linear regression
Model tab: Dependent variable: resp
Independent variables: vol rate
OK
regress resp vol rate
Source |
SS
df
MS
-------------+-----------------------------Model | 4.37997786
2 2.18998893
Residual | 5.36361188
36 .148989219
-------------+-----------------------------Total | 9.74358974
38 .256410256
Number of obs
F( 2,
36)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
39
14.70
0.0000
0.4495
0.4189
.38599
-----------------------------------------------------------------------------resp |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------vol |
.4011113
.084634
4.74
0.000
.2294657
.5727569
rate |
.343427
.0783695
4.38
0.000
.1844862
.5023678
_cons |
-.612613
.2176896
-2.81
0.008
-1.054108
-.1711181
------------------------------------------------------------------------------
The model output seems reasonable enough. However, when we look at the residuals, we
discover the problem.
predict pred_resp
tab pred_resp
Fitted |
values |
Freq.
Percent
Cum.
------------+-----------------------------------.1143759 |
1
2.56
2.56
-.0970707 |
1
2.56
5.13
-.0959705 |
1
2.56
7.69
.0059574 |
1
2.56
10.26
and so on ...
.9760718 |
1
2.56
92.31
1.154826 |
1
2.56
94.87
1.165612 |
1
2.56
97.44
1.220426 |
1
2.56
100.00
------------+----------------------------------Total |
39
Chapter 2-11 (revision 16 May 2010)
100.00
p. 5
We see that the predicted proportion of vasoconstriction was a <0 for three observations and >1
for three observations. These are undefined values, since a proportion by definition is between 0
and 1. [You can increase or decrease a proportion by greater than 1, or >100%, but the
proportion itself must be a number between 0 and 1.]
Statisticians demand that statistical approaches be “consistent” across all datasets. They expect
logical results all the time.
Primarily for this reason, linear regression lost credibility when the outcome is a dichotomy, even
though it usually predicts between 0 and 1. Logistic regression was developed to fill the need for
a regression model for a dichotomous outcome. Logistic regression is defined in a such way that
it is impossible to predict a proportion outside of the 0-1 interval. (How it does this is explained
in detail in the K30 regression models class.)
Fitting a logistic regression to this same dataset,
Statistics
Binary outcomes
Logistic regression (reporting odds ratios)
Model tab: Dependent variable: resp
Independent variables: vol rate
OK
logistic resp vol rate
Logistic regression
Log likelihood = -14.886152
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
39
24.27
0.0000
0.4491
-----------------------------------------------------------------------------resp | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------vol |
48.52846
69.32595
2.72
0.007
2.951221
797.9788
rate |
14.14156
12.92809
2.90
0.004
2.356874
84.8513
------------------------------------------------------------------------------
and requesting the predicted values of resp
predict pred_resp_logistic
tab pred_resp_logistic
Pr(resp) |
Freq.
Percent
Cum.
------------+----------------------------------.0054134 |
1
2.56
2.56
.0072905 |
1
2.56
5.13
.0078175 |
1
2.56
7.69
and so on ...
.9990379 |
1
2.56
94.87
.9991069 |
1
2.56
97.44
.9992014 |
1
2.56
100.00
------------+----------------------------------Total |
39
100.00
Chapter 2-11 (revision 16 May 2010)
p. 6
We see that logistic regression predicted proportions consistently inside the 0-1 range.
Returning to the fev dataset,
use fev, clear
tab smoker male, col chi2
|
male
Smoking Status |
0
1 |
Total
-------------------+----------------------+---------not current smoker |
279
310 |
589
|
87.74
92.26 |
90.06
-------------------+----------------------+---------current smoker |
39
26 |
65
|
12.26
7.74 |
9.94
-------------------+----------------------+---------Total |
318
336 |
654
|
100.00
100.00 |
100.00
Pearson chi2(1) =
3.7390
Pr = 0.053
and computing the odds ratio,
Statistics
Observational/Epi. analysis
Tables for epidemiologists
Case-control odds ratio
Main tab: Case variable: smoker
Exposed variables: male
OK
cc smoker male
Proportion
|
Exposed
Unexposed |
Total
Exposed
-----------------+------------------------+-----------------------Cases |
26
39 |
65
0.4000
Controls |
310
279 |
589
0.5263
-----------------+------------------------+-----------------------Total |
336
318 |
654
0.5138
|
|
|
Point estimate
|
[95% Conf. Interval]
|------------------------+-----------------------Odds ratio |
.6
|
.3414402
1.041289 (exact)
Prev. frac. ex. |
.4
|
-.0412895
.6585598 (exact)
Prev. frac. pop |
.2105263
|
+------------------------------------------------chi2(1) =
3.74 Pr>chi2 = 0.0532
We’ll define the odds ratio in a moment.
Chapter 2-11 (revision 16 May 2010)
p. 7
Now, analyzing the data using logistic regression,
logistic smoker male
Logistic regression
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
Log likelihood = -209.84678
=
=
=
=
654
3.75
0.0527
0.0089
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.6
.1597765
-1.92
0.055
.3560256
1.011163
------------------------------------------------------------------------------
We see that logistic regression models the odds ratio, which agrees exactly with the “cc”
command, where the odds ratio is created directly from the 2  2 table.
The predicted values from logistic regression are proportions,
predict pred_smoker
tab pred_smoker
Pr(smoker) |
Freq.
Percent
Cum.
------------+----------------------------------.077381 |
336
51.38
51.38
.1226415 |
318
48.62
100.00
------------+----------------------------------Total |
654
100.00
These predicted proportions agree with the proportions in the crosstabulation table computed on
the previous page.
|
male
Smoking Status |
0
1 |
Total
-------------------+----------------------+---------not current smoker |
279
310 |
589
|
87.74
92.26 |
90.06
-------------------+----------------------+---------current smoker |
39
26 |
65
|
12.26
7.74 |
9.94
-------------------+----------------------+---------Total |
318
336 |
654
|
100.00
100.00 |
100.00
Pearson chi2(1) =
3.7390
Pr = 0.053
Notice the chi-square p value from the crosstabulation table (p = 0.053) is nearly identical to the
p value from the logistic regression ( p = 0.055). Logistic regression can be thought of as
extending the chi-square analysis in the 2 × 2 table to allow for covariates.
Chapter 2-11 (revision 16 May 2010)
p. 8
Definition of Odds Ratio
We define “odds” as the ratio of the probability of some event occurring to not occurring (such as
the odds of heads vs tails on a coin flip),
We then define the odds ratio (also called exposure odds ratio) as:
(a / N1 ) a
P(E=1|D=1)/P(E=0|D=1) (b / N1 ) b ad
OR 

 
P(E=1|D=0)/P(E=0|D=0) (c / N 0 ) c bc
(d / N 0 ) d
where D = disease, and E = exposure. In our example, D = smoking status and E = male. It’s
hard to apply this definition to the 2  2 table from the crosstabulation, because it is in ascending
sort order.
|
male
Smoking Status |
0
1 |
Total
-------------------+----------------------+---------not current smoker |
279
310 |
589
|
87.74
92.26 |
90.06
-------------------+----------------------+---------current smoker |
39
26 |
65
|
12.26
7.74 |
9.94
-------------------+----------------------+---------Total |
318
336 |
654
|
100.00
100.00 |
100.00
The 2  2 table from the cc command, however, is in the correct format for applying the odds
ratio formula.
|
Exposed
Unexposed |
|
(E=1)
(E=0)
|
-----------------+------------------------+
Cases (D=1) |
a
b
|
Controls (D=0) |
c
d
|
-----------------+------------------------+
Total |
336
318
odds ratio (OR) = ab/bc
|
male
Smoking Status |
0
1 |
Total
-------------------+----------------------+---------not current smoker |
279
310 |
589
|
87.74
92.26 |
90.06
-------------------+----------------------+---------current smoker |
39
26 |
65
|
12.26
7.74 |
9.94
-------------------+----------------------+---------Total |
318
336 |
654
|
100.00
100.00 |
100.00
display (279*26)/(310*39)
.6
Chapter 2-11 (revision 16 May 2010)
p. 9
Interpreting this, the odds of being a male if you are a smoker is 0.6 the odds of being a female if
you are a smoker. This can also be flipped around. You would say, then, the odds of smoking if
you are a male are 0.6, or (1-0.6) = 40% less than if you are a female.
This 40% is real close to a calculation done directly on the percents
(12.26-7.74)/12.26 = 0.37 or 37%
Now let’s add a covariate.
logistic smoker male age
Logistic regression
Log likelihood = -155.83995
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
654
111.77
0.0000
0.2639
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.4525142
.1394614
-2.57
0.010
.2473423
.8278772
age |
1.650156
.0941799
8.78
0.000
1.475516
1.845465
------------------------------------------------------------------------------
The way you interpret a logistic regression is for each line of the regression model,
...the odds fold-increase (or multiplicative increase) of the outcome variable for each one
unit increase in the predictor variable, after controlling for all other predictor variables in
the model.
If the OR = 1, there is no effect.
If the OR < 1, there is a protective effect (odds decrease).
If the OR > 1, there is a deleterous effect (odds increase).
We might report the male effect as:
The odds of smoking for males is approximately one-half the odds of smoking for
females, after controlling for age [adjusted OR, 0.45, 95% CI(0.25 – 0.83), p=0.010].
We might report the AGE effect as:
Age was associated with increased odds for smoking, controlling for gender [adjusted
OR, per 1 year increase in age, 1.65, 95% CI (1.48 – 1.85), p < 0.001].
or,
The odds of smoking increased 1.65-fold for each one year increase in age, controlling for
gender [adjusted OR, per 1 year increase in age, 1.65, 95% CI (1.48 – 1.85), p < 0.001].
Chapter 2-11 (revision 16 May 2010)
p. 10
Exercise
Look at logistic regression models in Table 3 of the Bergstrom et al. (2000) article. Notice how
it follows the linear regression exercise we did in Chapter 6, looking at increasingly complete
models.
In their Table 3, they report
odds ratio = OR = 7x734)/(240*4) = 5.35
In their Table 2, they report
relative risk = RR = (7/247)/(4/738) = 5.23
When the disease outcome is rare, < 10%, the odds ratio is a good estimate of the risk ratio.
Chapter 2-11 (revision 16 May 2010)
p. 11
Assessing Linearity of Effect
Linear regression assumed that as the predictor variable increased, such as increasing age, that
the effect on the outcome increased by a constant amount for each one unit increase in the
predictor variable.
Logistic regression assumes something similar. It assume that the log odds increases linearly, by
a constant amount, for each one unit increase in the predictor variable. This is the same thing as
saying it assumes the odds increases exponentially for each one unit increase in the predictor
variable.
Letting y = odds = p / (1-p),
log(y) = a + bX
or
i.e.,
exp(log(y)) = exp(a+bX)
y = exp(a)exp(bX)
= 1exp(bX)
since a=1 in logistic regression models
= exp(b)X
= ORX
since exp(b) = OR in logistic regression
We can easily verify that exp(b) = OR in logistic regression by requesting the regression
coefficient, b, be displayed instead of the OR.
logistic smoker male age, coef
Logistic regression
Log likelihood = -155.83995
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
654
111.77
0.0000
0.2639
-----------------------------------------------------------------------------smoker |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male | -.7929362
.3081923
-2.57
0.010
-1.396982
-.1888904
age |
.5008697
.0570733
8.78
0.000
.389008
.6127314
_cons | -7.586072
.7205451
-10.53
0.000
-8.998315
-6.17383
------------------------------------------------------------------------------
and then,
display exp(-.7929362)
.45251417
which matches the OR from the
logistic smoker male age
Chapter 2-11 (revision 16 May 2010)
p. 12
Logistic regression
Log likelihood = -155.83995
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
654
111.77
0.0000
0.2639
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.4525142
.1394614
-2.57
0.010
.2473423
.8278772
age |
1.650156
.0941799
8.78
0.000
1.475516
1.845465
------------------------------------------------------------------------------
To assess the linearity assumption in regression models, using quartiles, quintiles, or some other
percentile is useful.
It is always a good idea to examine the linearity assumption, no matter what type of model you
are fitting.
This is easy to do in Stata,
xtile age5 = age, nq(5)
which creates a variable of age categorized into 5 quantiles, or quintiles. The “nq( )” options is
used to specific the number of quantiles (nq).
tab age5
5 quantiles |
of age |
Freq.
Percent
Cum.
------------+----------------------------------1 |
215
32.87
32.87
2 |
94
14.37
47.25
3 |
171
26.15
73.39
4 |
57
8.72
82.11
5 |
117
17.89
100.00
------------+----------------------------------Total |
654
100.00
We see the categories represent rough 20% each, but Stata could not do this any closer due to a
lot of tied ages.
Chapter 2-11 (revision 16 May 2010)
p. 13
To discover what the categories include, we use
bysort age5: sum age
-> age5 = 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
215
6.8
1.257501
3
8
-> age5 = 2
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
94
9
0
9
9
-> age5 = 3
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
171
10.52632
.5007734
10
11
-> age5 = 4
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
57
12
0
12
12
-> age5 = 5
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
117
14.55556
1.663215
13
19
If we now use
logistic smoker male age5
the age5 variable would be treated as if it were a continuous variable, which is no better than
using age.
Categorical variables, nominal or ordinal scales, have to be modeled using indicator, or dummy
variables, which are a series of 0-1 coded variables.
This can be done very quickly in Stata by the xi (generate indicator variables) command, where
you precede the regression command name by “xi:” and then precede each categorical variable by
“i.”.
Chapter 2-11 (revision 16 May 2010)
p. 14
xi: logistic smoker male i.age5
i.age5
_Iage5_1-5
(naturally coded; _Iage5_1 omitted)
Logistic regression
Log likelihood = -149.22261
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
654
125.00
0.0000
0.2952
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.4899082
.1460738
-2.39
0.017
.2730962
.878848
_Iage5_2 |
1022168
.
.
.
.
.
_Iage5_3 |
8588918
8969815
15.29
0.000
1109145
6.65e+07
_Iage5_4 |
1.31e+07
1.42e+07
15.11
0.000
1563884
1.10e+08
_Iage5_5 |
5.77e+07
5.92e+07
17.43
0.000
7735301
4.30e+08
-----------------------------------------------------------------------------note: 215 failures and 0 successes completely determined.
First note the the first category _Iage5_1, where age5 = 1, is omitted from the model. One
category must be left out, which becomes part of the intercept, to act as the referent group.
All included categories are interpreted relative to the referent group.
Then, notice the rest of the model looks like some kind of diaster. The reason becomes clear by
looking at
tab age5 smoker
5 |
quantiles |
Smoking Status
of age | not curre current s |
Total
-----------+----------------------+---------1 |
215
0 |
215
2 |
93
1 |
94
3 |
157
14 |
171
4 |
50
7 |
57
5 |
74
43 |
117
-----------+----------------------+---------Total |
589
65 |
654
Chapter 2-11 (revision 16 May 2010)
p. 15
We discover there are no smokers in the lowest category, which presents the ages 3 to 8.
Therefore, all other age categories are infinitely deleterious. That is surely not the case in the
population.
We need to drop the first category from the analysis, and let the second category be the referent
group.
The “xi” facility always uses the first variable as the referent. If we create the indictor variables
ourself, we can choose whichever category we want as the referent, simply by leaving out that
indicator variable.
Using the tabulate command, with the generate option, we specify the stub name of the indicator
variables we want to create.
tabulate age5 , gen(agecat)
describe
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------id
long
%12.0g
ID
age
byte
%8.0g
Age (years)
fev
float %9.0g
Forced Expiratory Volume
(liters)
height
float %9.0g
Height (inches)
male
byte
%8.0g
smoker
byte
%18.0g
smokerlab
Smoking Status
age5
byte
%8.0g
5 quantiles of age
agecat1
byte
%8.0g
age5==
1.0000
agecat2
byte
%8.0g
age5==
2.0000
agecat3
byte
%8.0g
age5==
3.0000
agecat4
byte
%8.0g
age5==
4.0000
agecat5
byte
%8.0g
age5==
5.0000
-------------------------------------------------------------------------------
Notice it created agecat1 through agecat5, where the suffix denotes what age category the
variable is an indicator for.
Chapter 2-11 (revision 16 May 2010)
p. 16
Now, leaving out the second category indicator, making it the referent
logistic smoker male agecat1 agecat3 agecat4 agecat5
note: agecat1 != 0 predicts failure perfectly
agecat1 dropped and 215 obs not used
Logistic regression
Log likelihood = -149.22261
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
=
=
=
=
439
69.73
0.0000
0.1894
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.4899082
.1460738
-2.39
0.017
.2730962
.878848
agecat3 |
8.402648
8.775284
2.04
0.042
1.085091
65.06784
agecat4 |
12.828
13.91722
2.35
0.019
1.529968
107.5563
agecat5 |
56.45461
57.88341
3.93
0.000
7.567545
421.1569
------------------------------------------------------------------------------
This model looks a lot better.
Notice Stata warns that it dropped agecat1 subjects from the analysis. When there is no
variability for the variable to explain (since all category 1’s were nonsmokers), the variable has
to be dropped or Stata cannot converge on a solution to the regression problem.
We might try some more meaningful age groups, such as school attended. The social pressure to
smoke is probably different depending where you on in school system. The following age
categories roughly approximate the school system.
3-8 before grade 4
9-11 elementary school (grades 4th through 6th)
12-14 junior high (grades 7th through 9th)
15-19 high school (grades 10th through 12th)
gen age3to8 = cond(age>=3 & age<=8,1,0)
gen age9to11 = cond(age>=9 & age<=11,1,0)
gen age12to14 = cond(age>=12& age<=14,1,0)
gen age15to19 = cond(age>=15 & age<=19,1,0)
replace age3to8=. if age==.
replace age9to11=. if age==.
replace age12to14=. if age==.
replace age15to19=. if age==.
tab age if age3to8==1 , missing
tab age if age9to11==1 , missing
tab age if age12to14==1 , missing
tab age if age15to19==1 , missing
Chapter 2-11 (revision 16 May 2010)
p. 17
. tab age if age3to8==1 , missing
Age (years) |
Freq.
Percent
Cum.
------------+----------------------------------3 |
2
0.93
0.93
4 |
9
4.19
5.12
5 |
28
13.02
18.14
6 |
37
17.21
35.35
7 |
54
25.12
60.47
8 |
85
39.53
100.00
------------+----------------------------------Total |
215
100.00
. tab age if age9to11==1 , missing
Age (years) |
Freq.
Percent
Cum.
------------+----------------------------------9 |
94
35.47
35.47
10 |
81
30.57
66.04
11 |
90
33.96
100.00
------------+----------------------------------Total |
265
100.00
. tab age if age12to14==1 , missing
Age (years) |
Freq.
Percent
Cum.
------------+----------------------------------12 |
57
45.60
45.60
13 |
43
34.40
80.00
14 |
25
20.00
100.00
------------+----------------------------------Total |
125
100.00
. tab age if age15to19==1 , missing
Age (years) |
Freq.
Percent
Cum.
------------+----------------------------------15 |
19
38.78
38.78
16 |
13
26.53
65.31
17 |
8
16.33
81.63
18 |
6
12.24
93.88
19 |
3
6.12
100.00
------------+----------------------------------Total |
49
100.00
From the frequency tables, we see we created the new variables correctly.
Chapter 2-11 (revision 16 May 2010)
p. 18
For illustration of what will happen, let’s put all of the indicator variables in the model,
logistic smoker male age3to8 age9to11 age12to14 age15to19
note: age3to8 != 0 predicts failure perfectly
age3to8 dropped and 215 obs not used
note: age15to19 dropped due to collinearity
Logistic regression
Log likelihood =
-153.7973
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
439
60.58
0.0000
0.1645
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.4899327
.1461839
-2.39
0.017
.2729975
.8792535
age9to11 |
.0636365
.0252995
-6.93
0.000
.0291945
.1387114
age12to14 |
.2913906
.1068836
-3.36
0.001
.1419875
.5979993
------------------------------------------------------------------------------
Notice the statement “note: age15to19 dropped due to collinearity”.
This occurred because we did not leave a category out. All the forms of regression models have
this requirement.
Collinearity is a term to denote that at a predictor variable is highly correlated with some linear
combination of some of the other predictor variables.
Regression models use a variable of all 1’s for the constant, or intercept term. After age3to8
were excluded, we had a dataset that looked like
Constant age9to11 age12to14 age15to19
1
1
0
0
1
1
0
0
1
0
1
0
1
0
1
0
1
0
0
1
1
0
0
1
Then we had
Constant = age9to11 + age12to14 + age15to19
so the linear combination of the dummy variables predicts the constant perfectly. This makes the
model fitting routines go nuts (the “least squares” algorithm cannot compute the inverse of the
data matrix in linear regression, and the “maximum likelihood” algorithm used in logistic
regression cannot converge on a solution) so one dummy variable must be kicked out of the
regression equation.
Chapter 2-11 (revision 16 May 2010)
p. 19
Let’s throw out, or use as the referent, the age9to11 category instead, by simply leaving it out the
list of predictor variables.
logistic smoker male age3to8 age12to14 age15to19
note: age3to8 != 0 predicts failure perfectly
age3to8 dropped and 215 obs not used
Logistic regression
Log likelihood =
-153.7973
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
439
60.58
0.0000
0.1645
-----------------------------------------------------------------------------smoker | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.4899327
.1461839
-2.39
0.017
.2729975
.8792535
age12to14 |
4.578983
1.582391
4.40
0.000
2.32602
9.014148
age15to19 |
15.71425
6.247394
6.93
0.000
7.209212
34.25306
------------------------------------------------------------------------------
This looks like a very believable model.
References
Bergstrom L, Yocum DE, Ampel NM, et al. (2004). Increased risk of coccidioidomycosis in
patients treated with tumor necrosis factor α antagonists. Arthritis & Rheumatism
50(6):1959-1966.
Finney DJ. (1947). The estimation from original records of the relationship between dose
and quantal response. Biometrika 1947;34:320-334.
Chapter 2-11 (revision 16 May 2010)
p. 20
Download