ch 23

advertisement
Chapter 23 Logistic Regression Analysis
In the multiple regression, the dependent variables have been continuous such as
weight, sbp, price and so on. If the dependent variable, Y, is one of the binary response
or dichotomous variables, such as Male/Female, Yes/No, Success/Fail, Present/Absent or
Smoking/Nonsmoking, logistic regression can be used to describe its relationship with
several predictor variables, X1 , X 2 ,...., X k and an (adjusted) odds ratio can be estimated.
Logistic function
f (z) 
1
1  e z
and its graph has a sigmoid shape
1
1
f( z)
0.5
0
5
0
-4
5
z
10
5
This function is well-suited for modeling a probability because the values of f(z) ranges
from 0 to 1 as z varies from   to   .
The logistic model
Let Y be a dichotomous variable which is defined as
for those have lung cancer
1
Y
0 for those do not have lung cancer
and p = Pr(Y=1| X1 ,..., X k ).
p=
and p̂ =
1
1  exp[ ( 0  1 X 1   2 X 2  ...   k X k )]
1
1  exp[ (ˆ 0  ˆ 1X1  ˆ 2 X 2  ...  ˆ k X k )]
n
Note: With no predictors, p̂ 
Y
i 1
n
i
Y
(1)
The logit form of the logistic model
The relationship of a dichotomous variable with its predictors is quantified with
Pr( D)
p
the odds ratio. Since odds (D) =
, odds(Y=1) =
.
1  Pr(D)
1 p
The “logit” is the natural log odds of the event, Y=1, that is,
 p 
logit [p] = ln [odds( Y = 1)] = ln 

1  p 
logit [p] =  0  1X1  ....   k X k
(2)
Note: The logits can take on any values between   to  while Pr(Y=1) can only take
on values between 0 and 1.
Odds(Y=1) = e 0 1X1 ... k X k
(3)
This formulation helps in clarifying the meaning of the maximum likelihood coefficients:
e i gives the change in the odds for Y when there is a unit change in the predictor X i , i
= 1,..,k
An adjusted odds ratio is an odds ratio comparing two categories of the variable after
controlling for the other variables in the model. For example, an adjusted odds ratio
comparing two categories of the variable, smoking status( X1 ) is
^
^
OR X1 1 vs X1 0 
Odds(Y  1 | X1  1, X 2 ,..., X k )
^
Odds(Y  1 | X1  0, X 2 ,..., X k )

e
ˆ 0 ˆ 1 ˆ 2 X 2 ... ˆ , X k
e
ˆ 0 ˆ 2 X 2 ... ˆ , X k
ˆ
 e 1
and its (1  ) * 100% confidence interval is
e
ˆ1  Z1 / 2 * Sˆ
More specifically, its 95% confidence interval is
1
e
ˆ 1  1.96*Sˆ
1
.
Suppose we want to find an adjusted odds ratio for a continuous variable such as age
(X2). Most often the increase by “1” will not be interesting. For example, an increase of
1 year in age may be too small to be considered important.
A change of 10 years might be more useful. Then
^
^
OR X 2 10 vs X 2  20 
Odds(Y  1 | X1 , X 2  20,..., X k )
^

Odds(Y  1 | X1 , X 2  10,..., X k )
and its 95% confidence interval is
10ˆ 2 1.96*10sˆ
e
e
e
2
ˆ 0 ˆ 1X1 ˆ 2 *20... ˆ , X k
ˆ 0 ˆ 1X1 ˆ 2 *10... ˆ , X k
ˆ
ˆ
 e ( 2010)*2  e102
Inference for logistic regression
Criteria for Assessing Model Fit
The logistic procedure fits linear logistic regression models for dichotomous
variables by the method of maximum likelihood estimation.
Let –2 log L A = log-likelihood statistic of model A with “p” predictors
and –2 log L B = log-likelihood statistic of model B with “k” predictors and k > p.
Then the likelihood ratio Chi-square, G 2 , is
G 2  (2 ln L A )  (2 ln L B ) ~  2 k p
If Model A is the model with intercept only, then G 2 plays the role of the overall F,
testing H 0 : 1   2  ...   k  0 with “k” degrees of freedom.
If Model A has “p” predictors and model B has “k” predictors, then G 2 plays the role of
the (multiple) partial F with “k – p “ degrees of freedom.
Analysis of Maximum Likelihood Estimates (MLEs)
The Wald Chi-Square Test is
2
ˆ i
W=
^
2
,
i = 1,...,k
SE(ˆ i )
It tests H 0 : i  0 for i  1,..., k vs H A : Not H 0 .
Example 1 (When the independent variable is nominal)
Let Y be a dichotomous variable which is defined as
for those have lung cancer
1
Y
0 for those do not have lung cancer
and X be a dichotomous variable such as smoking status with
1
X
0
The logit form of the logistic model is
for smo ker s
for nonsmo ker s
logit[p] =  0  1X
(4)
logit(lung cancer|smokers(X=1)) =  0  1 *1 =  0  1
logit(lung cancer|nonsmokers (X=0)) =  0  1 * 0 =  0
Thus
Odds(lung cancer|smokers) = e 0 1
Odds(lung cancer|nonsmokers) = e  0
and the odds ratio comparing the odds of smokers getting lung cancer to the odds of
nonsmokers getting lung cancer is
odds(lung cancer | smo ker s)
e 0 1
OR S vs NS =
=   e 1
odds(lung cancer | nonsmo ker s)
e0
In other words, the estimate of OR is

ˆ
OR  e 1
where ̂1 is the maximum likelihood estimate(MLE) of 1 in the equation (4).
Testing whether H 0 : OR S vs NS =1 is the same as testing whether H 0 : 1  0
since OR  e 1 and e 0  1
Suppose we want to analyze the following data using the logistic regression
Factor B
Factor A
Oral
Used
Contraceptive Never Used
Heart Attacks
Yes
No
23
34
35
132
Model
logit [p] =  0  1X
SAS program
/* The single-trial syntax is used exclusively
when independent variables are continuous or
mixed*/
Data heart;
input contra $ attack $;
if contra = ‘used’ then X = 1; else X = 0;
if attack = ‘yes’ then Y = 1; else Y = 0;
lines;
used yes
:
used yes:
used yes
23 times
used no
used no:
:
used no
34 times
never yes
:
never yes:
never yes
35 times
never no
:
never no:
never no
132 times
run;
proc logistic descending;
model Y = X / link = logit;
run;
/* The single-trial syntax with weight */
Data heart;
input contra $ attack $ wt;
if contra = ‘used’ then X = 1; else X = 0;
if attack = ‘yes’ then Y = 1; else Y = 0;
lines;
used yes 23
used no 34
never yes 35
never no 132
run;
proc logistic descending;
weight wt;
model Y = X / link = logit;
run;
/*The events/trials syntax */
Data heart;
input contra $ yes no;
n = yes+no;
if contra = 'used' then X=1; else X=0;
lines;
used 23 34
never 35 132
run;
proc logistic;
model yes/n = X /link=logit;
run;
/* doing logistic regression using PROC
GENMOD –GENMOD is like GLM in
categorical data analysis*/
Data heart;
input contra $ yes no;
n = yes+no;
/*added 1 before ‘used’ and 2 before ‘never’ in
order to make ‘used’ as an event of interest */
lines;
1used 23 34
2never 35 132
run;
proc genmod;
class contra;
model yes/n = contra
/dist=bin link=logit;
run;
Ouput
The LOGISTIC Procedure
Model Information
Data Set
WORK.HEART
Response Variable
Y
Number of Response Levels
2
Number of Observations
4
Weight Variable
wt
Sum of Weights
224
Link Function
Logit
Optimization Technique
Fisher's scoring
Response Profile
Total
Y
Frequency
Ordered
Value
1
2
1
0
Total
Weight
2
2
58.00000
166.00000
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Only
258.226
257.612
256.226
Criterion
AIC
SC
-2 Log L
Intercept
and
Covariates
252.358
251.131
248.358
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
7.8676
8.3288
8.0449
DF
1
1
1
Pr > ChiSq
0.0050
0.0039
0.0046
256.226
-248.358
7.868
The LOGISTIC Procedure
Parameter
Intercept
X
Analysis of Maximum Likelihood Estimates
Standard
DF
Estimate
Error
Chi-Square
1
1
-1.3275
0.9366
0.1901
0.3302
48.7488
8.0449
Pr > ChiSq
<.0001
0.0046
Odds Ratio Estimates
Effect
X
Point
Estimate
2.551
95% Wald
Confidence Limits
1.336
4.873
Interpretation:
 p̂ 
  1.3275  .9366 * X
1. The logistic regression equation is logit[ p̂ ] = ln 
 1  p̂ 
1
2. equivalently, p̂ 
 ( 1.3275.9366*X )
1 e
2
3. G , Likelihood Ratio Chi-square statistic, test H 0 : 1  0
which is equivalent to testing H 0 : OR Used vs Never  1 .
4.
Overall, the model is significant because Likelihood Ratio Chi-Square statistic is
7.8676 with 1 degree of freedom and p-value = .0050.
5. The odds ratio OR Used vs Never estimate is e .9366  2.551 and its 95% Wald
confidence limits (1.336, 4.873) do not contain 1. (just as the estimate of 1 =
.9366 and its p-value is .0046.)
Suppose we want to find out the relationship between heart attacks and BMI and the
following data have been collected.
BMI
Above 30 (Obese)
25 – 30 (Overweight)
Below 25 (Normal)
Heart attacks
Yes
25
10
5
No
5
20
25
Create dummy variables
BMI
Obese
Overweight
Normal
X1
1
0
0
X2
0
1
0
Dependent variable Y = 1 if a subject had a heart attack; Y = 0 if a subject does not.
Model
log it p    0  1X1   2 X 2
SAS program
Data cancer;
input bmi $ attack $ wt;
if bmi= ‘obese’
then X1 = 1; else X1=0;
if bmi= ‘overwt’
then X2 = 1; else X2 = 0;
if attack = ‘yes’
then Y = 1 ; else Y = 0;
lines;
obese yes 25
obese no 5
overwt yes 10
overwt no 20
normal yes 5
normal no 25
run;
proc logistic descending;
weight wt;
model Y = X1 X2 /link=logit;
run;
Output
The LOGISTIC Procedure
Model Information
Data Set
WORK.CANCER
Response Variable
Y
Number of Response Levels
2
Number of Observations
6
Weight Variable
wt
Sum of Weights
90
Link Function
Logit
Optimization Technique
Fisher's scoring
Response Profile
Total
Y
Frequency
1
3
0
3
Ordered
Value
1
2
Total
Weight
40.000000
50.000000
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept
and
Criterion
Only
Covariates
AIC
125.653
98.258
SC
125.445
97.633
-2 Log L
123.653
92.258
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
31.3949
29.2500
23.3637
DF
2
2
2
Pr > ChiSq
<.0001
<.0001
<.0001
The LOGISTIC Procedure
Parameter
Intercept
X1
X2
Analysis of Maximum Likelihood Estimates
Standard
DF
Estimate
Error
Chi-Square
1
1
1
-1.6093
3.2186
0.9162
0.4899
0.6928
0.6245
10.7921
21.5842
2.1523
Odds Ratio Estimates
Effect
X1
X2
Point
Estimate
24.994
2.500
95% Wald
Confidence Limits
6.429
0.735
97.170
8.500
Pr > ChiSq
0.0010
<.0001
0.1424
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
33.3
33.3
33.3
9
Somers' D
Gamma
Tau-a
c
0.000
0.000
0.000
0.500
1. The logistic regression equation is log it (p̂)  1.6093  3.2186 * X1  .9162 * X2
1
2. equivalently, p̂ 
 ( 1.6093 3.2186*X1.9162*X 2 )
1 e
2
3. G , Likelihood Ratio Chi-square statistic, test H 0 : 1   2  0
which is equivalent to testing H 0 : OR Obese vs Normal  OR Overwight vs Normal  1 .
4.
Overall, the model is significant because Likelihood
31.3949
5.
Ratio Chi-Square statistic is
whose df = 2 and p-value = < .0001
OR Obese vs Normal  25 , and its confidence interval (6.429,
97.170)does
not include 1.
That shows that the odds of getting heart attack between obese group and normal
group is significantly different. On the other hand, OR Overweight vs Normal  2.5 and
its confidence interval (0.735, 8.500 ) includes 1. That means that the odds of
getting heart attack between overweight people and normal people is not
significantly different.
Example 2 (When independent variables are mixed: nominal and continuous)
Let Y be a dichotomous variable which is defined as
for those have lung cancer
1
Y
0 for those do not have lung cancer
and X1 be smoking status, X 2 be sbp and X 3 be age.
The logit form of the logistic model is
logit(Y=1) =  0  1X1   2 X 2  3 X 3
(5)
logit(lung cancer|smokers, sbp=160, age = 40) =  0  1 *1   2 *160   3 * 40
logit(lung cancer|smokers, sbp=120, age =40) =  0  1 *1   2 *120   3 * 40
Thus
Odds(lung cancer|smokers, sbp=160, age = 40) = e 0 1 1602  403
Odds(lung cancer|smokers, sbp=120, age = 40) = e 0 1 1202  403
and the odds ratio comparing the odds of those who smoke, are 40 years old and whose
sbp is 160 getting lung cancer to the odds of those who smoke, are 40 years old and
whose sbp is 120 getting lung cancer is
OR sbp160 vs sbp120 =
odds(lung cancer | smokers, sbp  160, age  40) e 0 1 1602  403
=   120  40  e (160120)2  e 402
2
3
odds(lung cancer | smokers, sbp  120, age  40) e 0 1
In other words, the estimate of OR is

ˆ
OR  e 402
where ̂ 2 is the MLE of  2 in the equation (5).
Testing whether H 0 : OR sbp160
H 0 :  2  0 since OR  e
401
and e
vs sbp120
40*0
 1 is the same as testing whether
 e 1
0
Example: Logistic regression with a continuous independent variable
data heart;
input sbp chd wt @@;
lines;
110 0 153 110 1
3
121 0 235 121 1 17
131 0 272 131 1 12
141 0 255 141 1 16
151 0 127 151 1 12
161 0 77 161 1
8
177 0 83 177 1 16
190 0 35 190 1
8
run;
proc logistic descending;
weight wt;
model chd = sbp /link=logit;
output out=heartout p=pred;
run;
data heart2;
set heartout;
input sbp no yes;
total=yes+no;
prob= yes/total;
lines;
110 153 3
121 235 17
131 272 12
141 255 16
151 127 12
161 77 8
177 83 16
190 35 8
run;
proc print;
var sbp no yes prob pred;
run;
proc plot;
plot prob*sbp
pred*sbp='*' /overlay;
run;
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Weight Variable
Sum of Weights
Link Function
Optimization Technique
WORK.HEART
chd
2
16
wt
1329
Logit
Fisher's scoring
Response Profile
Ordered
Value
chd
Total
Frequency
Total
Weight
1
2
1
0
8
8
92.0000
1237.0000
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
Intercept
Only
Intercept
and
Covariates
670.831
671.604
668.831
648.520
650.066
644.520
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
24.3110
26.6394
25.3529
1
1
1
<.0001
<.0001
<.0001
Likelihood Ratio
Score
Wald
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Chi-Square
Pr > ChiSq
Intercept
sbp
1
1
-6.0631
0.0243
0.7195
0.00482
71.0212
25.3529
<.0001
<.0001
Odds Ratio Estimates
Effect
sbp
Point
Estimate
1.025
95% Wald
Confidence Limits
1.015
1 unit difference in sbp
10-unit difference in sbp
20-unit difference in sbp
1.034
Odds ratio
1.024598
1.275069
1.6258
Obs
sbp
no
yes
prob
pred
1
2
3
4
5
6
7
8
110
121
131
141
151
161
177
190
153
235
272
255
127
77
83
35
3
17
12
16
12
8
16
8
0.01923
0.06746
0.04225
0.05904
0.08633
0.09412
0.16162
0.18605
0.032576
0.032576
0.042134
0.042134
0.053104
0.053104
0.066732
0.066732
Plot of prob*sbp.
Plot of pred*sbp.
Legend: A = 1 obs, B = 2 obs, etc.
Symbol used is '*'.
prob ‚
0.200 ˆ
‚
‚
‚
A
‚
0.175 ˆ
‚
‚
‚
A
‚
0.150 ˆ
‚
‚
‚
‚
0.125 ˆ
‚
‚
‚
‚
0.100 ˆ
‚
A
‚
‚
A
‚
0.075 ˆ
‚
‚
A
*
*
‚
A
‚
*
*
0.050 ˆ
‚
‚
A
*
‚*
*
‚
0.025 ˆ
‚A
‚
‚
‚
0.000 ˆ
‚
Šˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆ
110
120
130
140
150
160
170
180
190
sbp
NOTE: 1 obs hidden.
Model Selection in Logistic Regression
Example 3: Cancer Remission Data (When independent variables are all continuous)
The data, taken from Lee (1974), consist of patient characteristics and whether or
not cancer remission occured.
Data remiss;
input remiss cell smear infil li blast temp;
label remiss
cards;
1
.8
.83
1
.9
.36
0
.8
.88
0 1
.87
1
.9
.75
0 1
.65
1
.95 .97
0
.95 .87
0 1
.45
0
.95 .36
0
.85 .39
0
.7
.76
0
.8
.46
0
.2
.39
0 1
.9
1 1
.84
0
.65 .42
0 1
.75
0
.5
.44
1 1
.63
0 1
.33
0
.9
.93
1 1
.58
0
.95 .32
1 1
.6
1 1
.69
0 1
.73
run;
= 'complete remission';
.66
.32
.7
.87
.68
.65
.92
.83
.45
.34
.33
.53
.37
.08
.9
.84
.27
.75
.22
.63
.33
.84
.58
.3
.6
.69
.73
1.9
1.4
.8
.7
1.3
.6
1
1.9
.8
.5
.7
1.2
.4
.8
1.1
1.9
.5
1
.6
1.1
.4
.6
1
1.6
1.7
.9
.7
1.1
.74
.176
1.053
.519
.519
1.23
1.354
.322
0
.279
.146
.38
.114
1.037
2.064
.114
1.322
.114
1.072
.176
1.591
.531
.886
.964
.398
.398
.996
.992
.982
.986
.98
.982
.992
1.02
.999
1.038
.988
.982
1.006
.99
.99
1.02
1.014
1.004
.99
.986
1.01
1.02
1.002
.988
.99
.986
.986
proc logistic;
Title 'Stepwise Regression on Cancer Remission Data';
model remiss=cell smear infil li blast temp
/ selection = stepwise
slentry = .3 slstay = .3 details;
run;
proc logistic;
title 'Backward Elimination Using the Fast Option';
model remiss = temp cell li smear blast
/ selection = backward
fast slstay = .2;
run;
proc logistic;
title 'Best Subsets Regession';
model remiss = temp cell li smear blast
/ selection = score;
run;
Output (Edited)
Stepwise Regression on Cancer Remission Data
Stepwise Selection Procedure
Step
0. Intercept entered:
Analysis of Maximum Likelihood Estimates
Variable
DF
INTERCPT
1
Step
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Standardized
Estimate
0.6931
0.4082
2.8827
0.0895
.
Odds
Ratio
.
1. Variable LI entered:
Analysis of Maximum Likelihood Estimates
Variable
DF
INTERCPT
LI
1
1
Step
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Standardized
Estimate
Odds
Ratio
3.7771
-2.8973
1.3786
1.1868
7.5064
5.9594
0.0061
0.0146
.
-0.747230
.
0.055
2. Variable TEMP entered:
Analysis of Maximum Likelihood Estimates
Variable
DF
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Standardized
Estimate
Odds
Ratio
INTERCPT
LI
TEMP
1
1
1
-47.8559
-3.3020
52.4331
46.4416
1.3594
47.4934
1.0618
5.9005
1.2188
0.3028
0.0151
0.2696
.
-0.851626
0.429597
.
0.037
999.000
Step
3. Variable CELL entered:
Analysis of Maximum Likelihood Estimates
Variable
DF
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Standardized
Estimate
Odds
Ratio
INTERCPT
CELL
LI
TEMP
1
1
1
1
-67.6339
-9.6522
-3.8671
82.0738
56.8875
7.7511
1.7783
61.7124
1.4135
1.5507
4.7290
1.7687
0.2345
0.2130
0.0297
0.1835
.
-0.993231
-0.997359
0.672450
.
0.000
0.021
999.000
Summary of Stepwise Procedure
Step
1
2
3
Variable
Entered
Removed
Number
In
Score
Chi-Square
Wald
Chi-Square
Pr >
Chi-Square
1
2
3
7.9311
1.2591
1.4701
.
.
.
0.0049
0.2618
0.2253
LI
TEMP
CELL
Backward Elimination Using the Fast Option
Step
0. The following variables were entered:
INTERCPT
TEMP
CELL
LI
SMEAR
BLAST
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Criterion
AIC
SC
-2 LOG L
Score
Step
Intercept
Only
Intercept
and
Covariates
36.372
37.668
34.372
.
33.857
41.632
21.857
.
Chi-Square for Covariates
.
.
12.515 with 5 DF (p=0.0284)
9.330 with 5 DF (p=0.0966)
1. Fast Backward Elimination:
Analysis of Variables Removed by Fast Backward Elimination
Variable
Removed
Chi-Square
Pr >
Chi-Square
Residual
Chi-Square
DF
Pr >
Residual
Chi-Square
0.0008
0.0951
1.5135
0.9768
0.7578
0.2186
0.0008
0.0959
1.6094
1
2
3
0.9768
0.9532
0.6573
BLAST
SMEAR
CELL
The LOGISTIC Procedure
Analysis of Variables Removed by Fast Backward Elimination
Variable
Removed
Chi-Square
Pr >
Chi-Square
Residual
Chi-Square
DF
Pr >
Residual
Chi-Square
0.6535
0.4189
2.2629
4
0.6875
TEMP
Summary of Backward Elimination Procedure
Step
1
1
1
1
Variable
Removed
Number
In
Wald
Chi-Square
Pr >
Chi-Square
4
3
2
1
0.000844
0.0951
1.5135
0.6535
0.9768
0.7578
0.2186
0.4189
BLAST
SMEAR
CELL
TEMP
Analysis of Maximum Likelihood Estimates
Variable
DF
INTERCPT
LI
1
1
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Standardized
Estimate
Odds
Ratio
3.7771
-2.8973
1.3786
1.1868
7.5064
5.9594
0.0061
0.0146
.
-0.747230
.
0.055
Best Subsets Regession
The LOGISTIC Procedure
Data Set: WORK.REMISS
Response Variable: REMISS
Response Levels: 2
Number of Observations: 27
Link Function: Logit
complete remission
Response Profile
Ordered
Value
REMISS
Count
1
2
0
1
18
9
Regression Models Selected by Score Criterion
Number of
Variables
Score
Value
Variables Included in Model
1
7.9311
LI
1
3.5258
BLAST
1
1.8893
CELL
1
1.0745
SMEAR
1
0.6591
TEMP
------------------------------2
8.6611
CELL LI
2
8.3648
TEMP LI
2
7.9807
LI BLAST
2
7.9537
LI SMEAR
2
5.0826
TEMP BLAST
2
3.9013
CELL BLAST
2
3.5456
SMEAR BLAST
2
2.8228
TEMP CELL
2
2.3308
CELL SMEAR
2
1.5641
TEMP SMEAR
------------------------------------3
9.2502
TEMP CELL LI
3
8.6817
CELL LI BLAST
3
8.6652
CELL LI SMEAR
3
8.5691
TEMP LI BLAST
3
8.3720
TEMP LI SMEAR
3
7.9817
LI SMEAR BLAST
3
5.4816
TEMP CELL BLAST
3
5.4018
TEMP SMEAR BLAST
3
3.9272
CELL SMEAR BLAST
3
3.0976
TEMP CELL SMEAR
-----------------------------------------4
9.2791
TEMP CELL LI SMEAR
4
9.2572
TEMP CELL LI BLAST
4
8.6819
CELL LI SMEAR BLAST
4
8.6315
TEMP LI SMEAR BLAST
4
5.8305
TEMP CELL SMEAR BLAST
-----------------------------------------------5
9.3295
TEMP CELL LI SMEAR BLAST
---------------------------------------------------
Based on the stepwise selection, the backward elimination method and the best
subset method, the candidates for the best model may be the model with CELL LI or
with TEMP, CELL, LI. Once again, the best model selection is part statistical methods,
and part experience and common sense.
Example 4: Conditional Logistic Regression for 1-1 Matched Data
The data is a subset of data from the Los Angeles Study of the Endometrial Cancer Data
described in Breslow and Day (1980). There are 63 matched pairs, each consisting of a case of
endometrical cancer (OUTCOME=1) and a control (OUTCOME=0). The case and the corresponding control
have the same ID. The explanatory variables include GALL (an indicator for gall bladder disease)
and HYPER (an indicator for hypertension).
The goal of the analysis is to determine the relative risk of having the endometrial cancer
for those who have gall bladder disease controlling the effect of hypertension.
data;
drop
id1 gall1 hyper1;
retain id1 gall1 hyper1 0;
input id outcome gall hyper @@ ;
if (id = id1) then do;
gall=gall1-gall; hyper=hyper1-hyper;
output;
end;
else do;
id1=id; gall1=gall; hyper1=hyper;
end;
cards;
1
2
3
:
55
56
57
58
59
60
61
62
63
run;
(Edited)
1
0
1
0
1
0
:
1
1
1
0
1
1
1
0
1
0
1
1
1
1
1
0
1
1
0
0
1
1
2
3
0
0
0
0
0
0
0
0
1
0
0
1
0
0
1
0
1
0
55
56
57
58
59
60
61
62
63
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
proc logistic;
model outcome = gall / noint;
run ;
proc logistic;
model outcome = gall hyper / noint ;
run ;
Output
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Link Function
Optimization Technique
WORK.DATA1
outcome
1
63
Logit
Fisher's scoring
Response Profile
Ordered
Value
outcome
Total
Frequency
1
0
63
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
Without
Covariates
With
Covariates
87.337
87.337
87.337
85.654
87.797
83.654
AIC
SC
-2 Log L
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
3.6830
3.5556
3.2970
1
1
1
0.0550
0.0593
0.0694
Analysis of Maximum Likelihood Estimates
Parameter
gall
DF
Estimate
Standard
Error
Chi-Square
Pr > ChiSq
1
0.9555
0.5262
3.2970
0.0694
The LOGISTIC Procedure
Odds Ratio Estimates
Effect
gall
Point
Estimate
2.600
95% Wald
Confidence Limits
0.927
7.293
NOTE: Since there is only one response level, measures of association between the observed and
predicted values were not calculated.
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Link Function
Optimization Technique
WORK.DATA1
outcome
1
63
Logit
Fisher's scoring
Response Profile
Ordered
Value
outcome
Total
Frequency
1
0
63
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Without
With
Criterion
Covariates
Covariates
AIC
SC
-2 Log L
87.337
87.337
87.337
86.788
91.074
82.788
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
4.5487
4.3620
4.0060
DF
2
2
2
Pr > ChiSq
0.1029
0.1129
0.1349
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
gall
hyper
DF
1
1
Estimate
0.9704
0.3481
Standard
Error
0.5307
0.3770
Chi-Square
3.3432
0.8526
Pr > ChiSq
0.0675
0.3558
Odds Ratio Estimates
Effect
gall
hyper
Point
Estimate
2.639
1.416
95% Wald
Confidence Limits
0.933
7.468
0.677
2.965
NOTE: Since there is only one response level, measures of association between the
observed and predicted values were not calculated.
Example 5. Conditional Logistic Regression for m:n Matching
Conditional logistic regression is used to investigate the relationship between an
outcome and a set of prognostic factors in matched case-control studies. The outcome is
whether the subject is a case or a control. If there is only one case and one control, the
matching is 1:1. M:n matching refers to the situation where there is a varying number of
cases and controls in the matched sets. You can perform conditional logistic regression
with the PHREG procedure by using the discrete logistic model and forming a stratum for
each matched set. In addition, you need to create dummy survival times so all the cases
in a matched set have the same event time value and the corresponding controls are
censored at later times.
Consider the following set of low infant birth data extracted from Hosmer and
Lemeshow (1989). These data represent 189 women of whom 59 had low birth-weight
babies and 130 had normal weight babies. Under investigation are the following risk
factors: weight in pounds at the last menstrual period (LWT), presence of hypertension
(HT), smoking status during pregnancy (SMOKE), and presence of uterine irritability
(UI). For HT, SMOKE, and UI, a value of 1 indicates a "yes" and a value of zero
indicates a "no". The woman's age (AGE) is used as the matching variable. The SAS
data set LBW contains subset of the data corresponding to women between the ages of 16
and 32.
data lbw;
input id age low lwt smoke ht ui @@;
time=2-low;
cards; (Edited)
25
16
1
130
0
0
0
143
166
16
0
112
0
0
0
167
189
16
0
135
1
0
0
206
216
16
0
95
0
0
0
37
:
:
203
30
0
112
0
0
0
56
107
31
0
100
0
0
1
126
163
31
0
150
1
0
0
222
22
32
1
105
1
0
0
106
134
32
0
132
0
0
0
170
175
32
0
170
0
0
0
207
16
16
16
17
0
0
0
1
110
135
170
130
0
1
0
1
0
0
0
0
0
0
0
1
31
31
31
32
32
32
1
0
0
0
0
0
102
215
120
121
134
186
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
;
title 'Example 5. Conditional Logistic Regression for m:n Matching';
proc phreg data=lbw;
strata age;
model time*low(0)= lwt smoke ht ui / ties=discrete;
run;
Output
The PHREG Procedure
Data Set: WORK.LBW
Dependent Variable: TIME
Censoring Variable: LOW
Censoring Value(s): 0
Ties Handling: DISCRETE
Summary of the Number of Event and Censored Values
Stratum
1
2
3
4
5
6
7
8
9
10
11
12
13
14
AGE
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Total
Event
Censored
Percent
Censored
7
12
10
16
18
12
13
13
13
15
8
3
9
7
1
5
2
3
8
5
2
5
5
6
4
2
2
1
6
7
8
13
10
7
11
8
8
9
4
1
7
6
85.71
58.33
80.00
81.25
55.56
58.33
84.62
61.54
61.54
60.00
50.00
33.33
77.78
85.71
15
30
7
1
6
85.71
16
31
5
1
4
80.00
17
32
6
1
5
83.33
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Total
174
54
120
68.97
Testing Global Null Hypothesis: BETA=0
Criterion
-2 LOG L
Score
Wald
Without
Covariates
159.069
.
.
With
Covariates
Model Chi-Square
141.108
17.961 with 4 DF (p=0.0013)
.
17.315 with 4 DF (p=0.0017)
.
15.558 with 4 DF (p=0.0037)
The PHREG Procedure
Analysis of Maximum Likelihood Estimates
Variable
LWT
SMOKE
HT
UI
DF
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Risk
Ratio
1
1
1
1
-0.014985
0.808047
1.751430
0.883410
0.00706
0.36797
0.73932
0.48032
4.50021
4.82216
5.61199
3.38266
0.0339
0.0281
0.0178
0.0659
0.985
2.244
5.763
2.419
HW for ch23 is #1 (except (e))
Download