1.Motivation and Theoritical Background

advertisement
Stat 562 Term Project
Conditional logistic regression for binary matched pairs
response
By Xiufang Ye
Content
I: Motivation and theoretical background
II: Basic Theory
III: Data analysis
Iv: Main Reference
V:Acknowledgements
Keyword: Marginal model , logit link , conditional logistic regression, sufficient
statistics, ML analysis, matched case-control study
1.Motivation and Theoritical Background
1.1 What is binary matched pairs?
For comparing categorical responses for two samples, when each observation in one
sample pairs with an observation in the other, the data is called matched–pairs data.
Thus, the responses in the two samples are statistically dependent. “Binary” here means
that the responses are binary.
1.2 Why “logistic regression”?
Firstly, logistic regression is the most important model for categorical response data,
especially for binary data. It has been widely used in biomedical studies, social science
research marketing ,business and even genetics.
1.3 Why “conditional logistic regression”?
In all of the examples so far, the observations have been independent. But what if the
observations were matched? You might think that it would possible to include dummy
coded variables to indicate the matching. For example, if you had 56 matched pairs you
could include 55 dummy variables to account for non-independence along with whatever
covariates you wanted to have in the model. Logistic regression has problems when the
number of degrees of freedom is close to the total degrees of freedom available. In a
situation, such as this, the conditional logistic model is recommended. In matched casecontrol studies, conditional logistic regression can be used to investigate the relationship
between an outcome and a set of prognostic factors.
1.4 Exact Inference for Logistic Regression
Maximum likelihood estimators of model parameters work best when the sample size is
large compared to the number of parameters in the model. When the sample size is small,
or when there are many parameters relative to the sample size, improved inference results
using the method of conditional maximum likelihood. The conditional maximum
likelihood method bases inference for the primary parameters of interest on a conditional
likelihood function that eliminates other parameters. The technique uses a conditional
probability distribution defined over data sets in which the values of certain “sufficient
statistics" for the other parameters are fixed. This distribution is defined for potential
samples that provide the same information about the other parameters that occurs in the
observed sample. The distribution and the related conditional likelihood function depend
only on the parameters of interest.
For binary data, conditional likelihood methods are especially useful when a logistic
regression model contains a large number of “nuisance” parameters. They are also useful
for small samples. One can perform exact inference for a parameter by using the
conditional likelihood function that eliminates all the other parameters. Since that
conditional likelihood does not involve unknown parameters, one can calculate
probabilities such as p-values exactly rather than use crude approximations.
2. Basic theory
2.1 Marginal versus conditional models for binary matched
pairs
2.1.1 Two marginal models
Let ( Y1 , Y2 ) denote the pair of observations of a randomly selected subject, where a “1”
outcome denotes category 1(success) and “0” outcome denotes category 2. We can fit the
model
(1)
P(Yt =1)=α+ x t
Where x1 =0, x 2 =1.
Since P( Y2 =1)= α+ x 2 = α+ and P( Y1 =1)= α+ x1 = α ,  = P( Y2 =1)  P( Y1 =1).
Interpretaion of the parameter: It is the difference between marginal probabilities.
Alternatively, the logit model can be written as
logit(p( Yt =1))= α+ x t
Then,
(2)
log[P(Y1 =1) /P(Y1 =0)]=
log[P(Y2 =1) /P(Y2 =0)]=  
  log[
P(Y2 =1) /P(Y2 =0)
]
P(Y1 =1) /P(Y1 =0)
Interpretation of the parameter: log odds ratio with the marginal distributions.
The two models focus on the marginal distributions of responses for the two observations.
For instance, in terms of the population-averaged table, the ML estimate of  in (2) is
the log odds ratio of marginal proportions,
2.1.2 One conditional model
By contrast, the subject –specific table having strata implicitly allows probabilities to
vary by subject. Let (Yi1,Yi2 ) denote the ith pair of observations, i=1,2, …,n. The model
has the form
link[P(Yit =1)]=α i +βx t
This is called a conditional model, since the effect  is defined conditional on the
subject.
When compared with marginal model, its estimate describes conditional association for
the three-way table stratified by subject. The effect is subject-specific. But for marginal
models (1) and (2), the effects are population-averaged, since they refer to averaging over
the entire population rather than to individual specific. In fact, they are identical for the
identity link, but differ for non-linear links.
For example: For logit link,
Logit P(Yit =1)=αi +βx t
(3)
Take the average of this for the population, we can not obtain the same form.
2.2 A Logit Model with subject –specific Probabilities
By permitting subjects to have their own probability distributions, the conditional model
(3) for Yit , observation t for subject i, is
Logit [P(Yit =1)]=αi +βx t
then,
exp(α i +βx t )
1+exp(α i +βx t )
Where x1 =0,x 2  1 . Here, we assume a common effect β . For subject i,
exp(α i )
exp(α i + )
P(Yi1 =1)=
, P(Yi2 =1)=
1+exp(α i )
1+exp(α i + )
P(Yi2 =1)
P(Yi1 =1)
 exp(α i +β) and
 exp(α i ) .
Then,
P(Yi2 =0)
P(Yi1 =0)
Interpretation of the parameter :The parameter  compares the response distribution. For
each subject, the odds of success for observation 2 are exp(  ) times the odds for
observation 1.
P(Yit =1)=
The dependence in matched pairs can be accounted for in the conditional logistic
regression model. Given the parameters, with (3), we normally assumes independent of
responses for different subjects and for the two observations on the same subject.
However, averaging over all subjects, the responses are nonnegatively associated.
Suppose|  | is small compared to | i |, P( Yit =1) and P() are increasing functions of
| i |, a subject with a larger positive i has high P( Y it =1) for each t and is likely to have a
success each time, with a larger negative i has lower P( Yit =1)for each t and is likely
to have a failure. For any  , the greater the variability in { i } , the greater the overall
positive association between responses, success(failure) for observation 1 tending to
occur with success (failure) for observation 2. The positive association reflects the shared
value of i for each observation in a pair. Specially, when i is identical, no association
occurs.
Question:When there are a large number of parameters { i } , this conditional model (3)
causes some difficulty with the fitting process and the properties of ordinary ML
estimators.
Unconditional ML estimator is inconsistent. This result was shown firstly by Anderson
in 1973.
Outline of proof:
Step 1:Assuming independent of responses for different subjects and different
observations by the same subject, we can find the log likelihood equations are
y t   i P(Yit  1) and yi    t P(Yit  1) .
exp(α i )
exp(α i + )
+
in the second likelihood equation, we can
1+exp(α i ) 1+exp(α i + )
prove that α̂ i =- for the n22 subjects wit yi+ =0 , α̂ i =+ for the n11 subjects with yi+ =2 ,
and α̂ =-ˆ / 2 for the n +n subjects with y =1 .
Step 2:Substituting
21
i
 P(Y
Step 3: By breaking
it
i
i+
12
 1) into components for the sets of subjects having
yi+ =0 , yi+ =2 and yi+ =1 , we can find that the first likelihood equation is , for
t=1, y1  n22 (0)+n11 (1)+(n21 +n12 )exp(-ˆ / 2) /[1  exp(-ˆ / 2)] .Then, y1 = n11 +n12 , And
P
 2 .
solve the first equation , we can show that β̂  2 log(n 21 / n12 ) . Hence, ˆ 
There is a remedy which is called conditional ML. It treats { i } as nuisance parameters and
maximizes the likelihood function for a conditional distribution that eliminates them.
2.3 Conditional ML inference for binary matched pairs
2.3.1 Estimate of parameters
For model (3), we assume the independence as mentioned before, the joint mass function
for {(y11,y12 ),(y21 ,y22 )...(yn1 ,yn2 )} is
n
exp(α i )
 (1+exp(α ) )
i=1
n
= (
i=1

i
yi 1
(
1
exp(α i +β) yi 2
1
)1 yi 1 (
) (
)1 yi 2
1+exp(α i )
1+exp(α i +β)
1+exp(α i +β)
exp(α i y i1 ) exp[(α i +β)y i2 ]
)
1+exp(α i ) 1+exp(α i +β)
n
n
I=1
I=1
exp(  [α i (yi1 +yi2 )]+  yi2 )
n
 [1+exp(α )][1+exp(α +β)]
I 1
i
i
n
n
I=1
I=1
So, it is proportional to exp(  [α i (y i1 +y i2 )]+  y i2 ) .
To eliminate { α i }, we condition on their sufficient statistics, the pairwise success totals
{Si =yi1 +yi2 } . Then, we have
P(Yi1 =Yi2 =0)=1 , given Si = 1; P(Yi1 =Yi2 =1)=1, given Si = 2.
P(Yi1 =yi1 ,Yi2 =y12 )
P(Yi1 =yi1 ,Yi2 =y12 |Si =1)=
P(Yi1 =1,Yi2 =0)+P(Yi1 =0,Yi2 =1)
=
exp( )
if yi1 =0,yi2 =1
1+exp( )
=
1
if yi1 =1,yi2 =0
1+exp( )
Let { nab } denote the counts for the four possible sequences, for subjects having Si =1 ,

i
yi1  n12 ,the number of subjects having success for observation 1 and failure for
observation 2. Silmilarly, for those subjects,

i
yi 2  n21 and
S
i
i
 n*  n12  n21 .
Since n21 is the sum of n* independent , identical Bernoulli variates, its conditional
exp( )
distribution is binomial with parameter
.
1+exp( )
Hence, to make inference about β, or testing marginal homogeneity (β=0), we only need
to know the information about the pairs in which either yi1 =1,yi2 =0 or yi1 =0,yi2 =1 .
Alternatively, we can obtain this result through maximum likelihood method.
Conditional on Si =1 , the joint distribution of matched pairs is
(
Si =1
[exp(β)]n21
1
exp(β) yi 2
) yi1 (
) =
[1+exp(β)]n
1+exp( )
1+exp(β)
where the product refers to all pairs having Si =1 .
log (
Si =1
1
exp(β) yi 2
) yi1 (
) = n21  (n 21  n12 ) log(1  exp(β))
1+exp( )
1+exp(β)
Differentiating the log of this conditional likelihood and equating to 0.And solving yields
the conditional ML estimators.
exp(β)
n21  (n 21  n12 )
0
1  exp(β)
β̂  log(n21 / n12 )
By the delta method which is similarly applied in 2x2 contingincy tables, we can obtain
that SE  1/ n 21  1/ n12 .
2.3.2 The Consistent property of estimate β
Referring to problem 10.23 in the textbook, we can easily prove that
P
ˆ = ln ( n21 / n12 ) 
 .
Outline of the proof: For a random sample of n pairs ,we can easily prove that
n
exp(αi +β)
1 n
1
by the definition of n21 and independence of
E
(
n
/
n
)



21
n i 1 1+exp(αi ) 1+exp(αi +β)
i=1
responses for different observations by the same subject.
n
1 n exp(αi )
1
Similarly,  E (n12 / n)  
. Then we apply the law of large
n i 1 1+exp(αi ) 1+exp(αi +β)
i=1
numbers(WLLN) and obtain that
n
exp(α i +β)
1
p
n21 

i 1 1+exp(α i ) 1+exp(α i +β)
n
exp(α i )
1
p
and n12 
.

i 1 1+exp(α i ) 1+exp(α i +β)
P
Therefore , n21 / n12 
 exp(  ) .
2.4 Random effects in Binary matched-pairs Model
There is an alternative remedy to handling the huge number of nuisance parameters in
logit model(3). One can treat { i } as random effects and regard { i } as an unobserved
random sample from a probability distribution, usually assumed to be N (  ,  2 ) with
unknown  and  2 . This will eliminates { i } by averaging with respect to their
distribution, yielding a marginal distribution. For matched pairs with non negative sample
log odds ratio, this approach also yields the estimate β̂  log(n21 / n12 ) .
2.5 Conditional ML for Matched Pairs with Multiple redictors
Generally, We can extend the model (3) to the general model with multiple predictors
as follows.
logit[ P(Yit =1)]=αi +β1x1it +β2 x 2it +β3x 3it ...+βp x pit
(4)
where x hit denotes the value of predictor h for observation t in pair i, t=1,2.Typically , one
predictor is an explanatory variable of interest, when the others are covariates being
controlled, in addition to those already controlled by virtue of using them to form the
matched pairs. We can also apply conditional ML to eliminate α i and get estimate β j .
Let x it =(x1it, ...,x pit )' , and β=(β1 ,β2, ...,βp ) , them the conditional distribution are
,
P(Yi1  0, Yi 2  1| Si  1) =
P(Yi1  1, Yi 2  0 | Si  1)
exp(x'i2  )
exp(x'i1 )+exp(x'i2  )
exp(x'i1 )
exp(x'i1 )+exp(x'i2  )
By some mathematical technique, it shows that the first equation has the form of logistic
regression with no intercept and with predictor values xi *  xi 2  xi1 (difference between
the two levels of some predictor variable).In fact, one can obtain conditional ML
estimates for models (4) by fitting a logistics regression model those pairs alone, using
artificial response y * =1 when ( yi1  0, yi 2  1 ), y*  0 ,when ( yi1  1, yi 2  0 ), no
intercept, and predictor values xi * .This addresses the same likelihood as the
conditional likelihood.(see Breslow et al, 1978 ; Chamberlain) 1980)
Let us illustrate it with the Table 10.3 in textbook.
Let xi *  xi 2  xi1 and yi *  yi 2  yi1 .
Let t=1 refers to the control and t=2 to the case, then yi *  1 always.
Since xit  1
represents “yes” for diabetes and xit  0 represents “no”,( yi  1, xi  1 ) for 16
*
*
observations, ( yi *  1, xi *  0 ) for 9+82=91 observations , and( yi *  1, xi *  1 ) for 37
observations. The logit model that forces ˆ  0 has ˆ  0.84 .With a single binary
predictor, the estimate is identical to log( n21 / n12 ) .
2.6 Extensions:
The discussion of marginal model in section 10.1 and conditional model in this section
can be generalized to multinomial response and to matched set clusters. For matchedset cluster, the conditional ML approach is restricted to estimating β j that are within
cluster effects, such as occur in case control and crossover studies. But an advantage of
using the random effects approach instead of conditional ML with the conditional model
is that it has no this kind of restriction.
3. Data analysis:
3.1 Conditional Logistic Regression for Matched Pairs Data
In matched case-control studies, conditional logistic regression is used to investigate the
relationship between an outcome of being a case or a control and a set of prognostic
factors. When each matched set consists of a single case and a single control, the
conditional likelihood is given by  (1  exp(   ' ( xi1  xi 0 ))) 1 .
i
where xi1 and xi0 are vectors representing the prognostic factors for the case and control,
respectively, of the ith matched set. This likelihood is identical to the likelihood of fitting
a logistic regression model to a set of data with constant response, where the model
contains no intercept term and has explanatory variables given by di = xi1 - xi0 (Breslow
1982).
The table 10.3 in the textbook illustrate a case-control study of acute myocardial
Infarction (MI) among Navajo Indians, which matched 144 victims of MI according to
age and gender with 144 people free of heart disease. Subjects were asked whether they
had ever been diagnosed as having diabetes (x=0,no;x=1,yes).For subject t in matched
pairs I, we consider the model (3).
The case and corresponding control have the same ID. The prognostic factor is diabetes
(an indicator variable for whether having diagnosed diabetes). The goal of the casecontrol analysis is to determine the relative risk for diabetes.
Before PROC LOGISTIC is used for the logistic regression analysis, each matched pair
is transformed into a single observation, where the variable diabetes contains the
differences between the corresponding values for the case and the control (case - control).
The variable Outcome, which will be used as the response variable in the logistic
regression model, is given a constant value of 1. Note that there are 144 observations in
the data set, one for each matched pair. The variable Outcome has a constant value of 0.
In the following SAS statements, PROC LOGISTIC is invoked with the NOINT option to
obtain the conditional logistic model estimates. Two models are fitted. The first model
contains diabetes as the only predictor variable. Because the option CLODDS=PL is
specified, PROC LOGISTIC computes a 95% profile likelihood confidence interval for
the odds ratio for the predictor variable.
SAS code
data Data;
input diabetes outcome @@ ;
output;
datalines;
11111111111111111111
11111111111111111111
11111111111111111111
11111111111111
01010101010101010101
01010101010101010101
01010101010101010101
01010101010101010101
01010101010101010101
01010101010101010101
01010101010101010101
01010101010101010101
01010101010101010101
01
–1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1
–1 1 –1 1 –1 1 –1 1 –1 1 –1 1;
proc logistic data=Data;
model outcome=diabetes / noint CLODDS=PL;
run;
Results from the conditional logistic analyses are shown as follows. Note that there is
only one response level listed in the "Response Profile" tables and there is no intercept
term in the "Analysis of Maximum Likelihood Estimates" tables.
Output of Anlysis:
The SAS System
The LOGISTIC Procedure
Model Information
Data Set
WORK.DATA
Response Variable
outcome
Number of Response Levels
1
Number of Observations
144
Model
binary logit
Optimization Technique
Fisher's scoring
Response Profile
Ordered
Value
1
outcome
1
Total
Frequency
144
Probability modeled is outcome=1.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Without
Covariates
With
Covariates
AIC
199.626
193.073
SC
199.626
196.043
-2 Log L
199.626
191.073
Criterion
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
8.5534
1
0.0034
Score
8.3208
1
0.0039
Wald
7.8501
1
0.0051
Analysis of Maximum Likelihood Estimates
Parameter
diabetes
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
1
0.8383
0.2992
7.8501
0.0051
Odds Ratio Estimates
Effect
Point Estimate
diabetes
2.312
95% Wald
Confidence Limits
1.286
4.157
NOTE: Since there is
only one response level, measures of association between the
observed and predicted values were not calculated.
Profile Likelihood Confidence Interval for Adjusted
Odds Ratios
Effect
diabetes
Unit
Estimate
1.0000
2.312
95% Confidence Limits
1.310
4.272
In this model, where diabetes is the predictor variable. The odds ratio estimate for
diabetes is 2.312, which is an estimate of the relative risk for diabetes. since The 95%
profile likelihood confidence interval for the odds ratio for diabetes is(1.310, 4.272),
which does not contain unity, the prognostic factor diabetes is statistically significant.
Conditional Logistic Regression for m:n Matching
Conditional logistic regression is used to investigate the relationship between an outcome
and a set of prognostic factors in matched case-control studies. The outcome is whether
the subject is a case or a control. If there is only one case and one control, the matching is
1:1. The m:n matching refers to the situation in which there is a varying number of cases
and controls in the matched sets. You can perform conditional logistic regression with the
PHREG procedure by using the discrete logistic model and forming a stratum for each
matched set. In addition, you need to create dummy survival times so that all the cases in
a matched set have the same event time value, and the corresponding controls are
censored at later times.
Consider the following set of low infant birth-weight data extracted from Appendix 1 of
Hosmer and Lemeshow (1989). These data represent 189 women, of whom 59 had low
birth-weight babies and 130 had normal weight babies. Under investigation are the
following risk factors: weight in pounds at the last menstrual period (LWT), presence of
hypertension (HT), smoking status during pregnancy (Smoke), and presence of uterine
irritability (UI). For HT, Smoke, and UI, a value of 1 indicates a "yes" and a value of 0
indicates a "no." The woman's age (Age) is used as the matching variable. The SAS data
set LBW contains a subset of the data corresponding to women between the ages of 16
and 32.
data LBW;
input id Age Low LWT Smoke HT UI @@;
Time=2-Low;
datalines;
25
175
;
16
32
1
0
……
170
0
0
0
207
32
0
186
0
0
0
The variable Low is used to determine whether the subject is a case (Low=1, low birthweight baby) or a control (Low=0, normal weight baby). The dummy time variable
Time takes the value 1 for cases and 2 for controls.
The following SAS statements produce a conditional logistic regression analysis of the
data. The variable Time is the response, and Low is the censoring variable. Note that the
data set is created so that all the cases have the same event time, and the controls have
later censored times. The matching variable Age is used in the STRATA statement so
each unique age value defines a stratum. The variables LWT, Smoke, HT, and UI are
specified as explanatory variables. The TIES=DISCRETE option requests the discrete
logistic model.
proc phreg data=LBW;
model Time*Low(0)= LWT Smoke HT UI / ties=discrete;
strata Age;
run;
The procedure displays a summary of the number of event and censored observations for
each stratum. These are the number of cases and controls for each matched set shown in
Output1. Results of the conditional logistic regression analysis are shown in Output 2.
Based on the Wald test for individual variables, the variables LWT, Smoke, and HT are
statistically significant while UI is marginal.
The hazards ratios, computed by exponentiating the parameter estimates, are useful in
interpreting the results of the analysis. If the hazards ratio of a prognostic factor is larger
than 1, an increment in the factor increases the hazard rate. If the hazards ratio is less than
1, an increment in the factor decreases the hazard rate. Results indicate that women were
more likely to have low birth-weight babies if they were underweight in the last
menstrual cycle, were hypertensive, smoked during pregnancy, or suffered uterine
irritability. For matched case-control studies with one case per matched set (1:n
matching), the likelihood function for the conditional logistic regression reduces to that
of the Cox model for the continuous time scale. For this situation, you can use the default
TIES=BRESLOW.
Output 2: Summary of Number of Case and Controls
The PHREG Procedure
Model Information
Data Set
WORK.LBW
Dependent Variable
Time
Censoring Variable
Low
Censoring Value(s)
0
Ties Handling
DISCRETE
Summary of the Number of Event and Censored Values
Stratum
1
Age
16
Total
Event
Censored
Percent
Censored
7
1
6
85.71
2
17
12
5
7
58.33
3
18
10
2
8
80.00
4
19
16
3
13
81.25
5
20
18
8
10
55.56
6
21
12
5
7
58.33
7
22
13
2
11
84.62
8
23
13
5
8
61.54
9
24
13
5
8
61.54
10
25
15
6
9
60.00
11
26
8
4
4
50.00
12
27
3
2
1
33.33
13
28
9
2
7
77.78
14
29
7
1
6
85.71
15
30
7
1
6
85.71
16
31
5
1
4
80.00
17
32
6
1
5
83.33
174
54
120
68.97
Total
Output 3 Conditional Logistic Regression Analysis for the Low BirthWeight Study
The PHREG Procedure
Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
Without
Covariates
With
Covariates
-2 LOG L
159.069
141.108
AIC
159.069
149.108
SBC
159.069
157.064
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
Likelihood Ratio
17.9613
4
0.0013
Score
17.3152
4
0.0017
Wald
15.5577
4
0.0037
Analysis of Maximum Likelihood Estimates
DF
Parameter
Estimate
Standard
Error
ChiSquare
Pr > ChiSq
Hazard
Ratio
LWT
1
-0.01498
0.00706
4.5001
0.0339
0.985
Smoke
1
0.80805
0.36797
4.8221
0.0281
2.244
HT
1
1.75143
0.73932
5.6120
0.0178
5.763
UI
1
0.88341
0.48032
3.3827
0.0659
2.419
Variable
4 .Main Reference
Categorical data analysis, second edition , by Alan Agresti.
Thank you for listening.
Download