Matched Pair Data Stat 557 Heike Hofmann

advertisement
Matched Pair Data
Stat 557
Heike Hofmann
Outline
•
•
•
•
•
Marginal Homogeneity - review
Subject-specific vs Marginal Model
Binary Response
•
•
conditional logistic regression
with covariates
Ordinal response
Symmetric Models
Matched Pair Data
2nd Rating
1st Rating
Assumptions
Approve
Disapprove
Approve
794
150
Disapprove
86
570
• Diagonal heavily loaded
• Association usually strongly positive (most
people don’t change their opinion)
• Distinguish between movers & stayers
Marginal Homogeneity
• logit P(Y = 1| x ) = α + β x
t
t
t
• x is dummy variable for time points
t
x1 = 0, x2 = 1
Then β is log odds ratio based on overall population
RAND -American Life Panel
https://mmicdata.rand.org/alp/?page=election#electionforecast
Panel of 3500 US citizens above 18 tracked
since July
Data isn’t published
on individual basis,
but from change and
overall margins we
can (almost) work
out change pattern
1 week after 1st debate
•
before 1st debate
Obama Romney
Obama
1585
121
Romney
162
1432
3300
> mswitch <- glm(I(candidate=="Obama")~time, data=votem, family=binomial(), weight=votes)
> summary(mswitch)
Call:
glm(formula = I(candidate == "Obama") ~ time, family = binomial(),
data = votem, weights = votes)
Deviance Residuals:
Min
1Q
Median
-46.462 -22.929
-0.435
3Q
21.992
Max
45.733
Coefficients:
Estimate Std. Error z value
(Intercept) 0.11771
0.03488
3.375
timevote2
-0.04981
0.04929 -1.010
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|z|)
0.000738 ***
0.312299
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9135.4
Residual deviance: 9134.3
AIC: 9138.3
on 7
on 6
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 3
Subject Specific Model
• link P(Y = 1) = α + β x
• x is dummy variable for time points
it
i
t
t
x1 = 0, x2 = 1
• then
αi = link P(Yi1 = 1)
β = link P(Yi2 = 1) - link P(Yi1 = 1)
painful to fit ...
Marginal vs SubjectSpecific Model
Estimates for β
• is identical for marginal model and subject
specific model in case of identity link
• are different for logit link
•
marginal model:
β = logit P(Y2 = 1| x2 ) - logit P(Y1 = 1| x1 )
•
subject specific, for all i:
β = logit P(Yi2 = 1| x2 ) - logit P(Yi1 = 1| x1 )
Subject-Specific Model
• logit P(Y = 1) = α + β x
• Assumptions generally:
• responses from different subjects
it
i
t
independent (for all i)
• responses for different time-points
independent
Subject-Specific Model
• logit P(Y = 1) = α + β x
• Assumptions generally:
• responses from different subjects
it
i
t
independent (for all i)
• responses for different time-points
independent
Subject-Specific Model
•
Violation of independence taken care of by model
structure:
•
•
Generally, |αi| >> |β|
•
When |αi| is small, we have the most variability
between responses of the same individual - i.e.
least dependence. That’s the records, on which
estimation of β is based on.
For large |αi|, probability of P(Yit = 1) is either
close to 0 or close to 1 (largest dependence in
the data)
Subject Specific Model
• link P(Y = 1) = α + β x
• but: estimation α of becomes problematic
it
i
t
i
for large numbers of subjects
• idea: condition on sufficient statistic for α
i
leads to conditional (logistic) regression
Likelihood for αi
Fitting the Subject
Specific Model
• Let S = y +y then S in {0,1,2}
• S are sufficient statistics for α only values
i
i1
i
i2,
i
i
of 1 contribute to the estimation of β
• logit P(Y
it
= 1 | Si = 1) = αi + β xt
Estimating β
• MLE for β is log n /n
• standard deviation of estimate is then
21
12
sqrt(1/n12 + 1/n21)
• Use clogit from the survival package
to fit model
Navajo Indians
•
144 victims of myocardiac infarcts (MI cases) are matched
with 144 control subjects (disease free) according to
gender and age.
•
All participants of the study are asked about whether
they ever were diagnosed with diabetes:
Controls Diabetes
no
Cases
Diabetes
no
9
16
37
82
> myo.ml <- clogit(MI ~ diabetes + strata(pair), data=t103)
> summary(myo.ml)
Call:
coxph(formula = Surv(rep(1, 288L), MI) ~ diabetes + strata(pair),
data = t103, method = "exact")
n= 288
coef exp(coef) se(coef)
z Pr(>|z|)
diabetes 0.8383
2.3125
0.2992 2.802 0.00508 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
diabetes
2.312
0.4324
1.286
4.157
Rsquare= 0.029
(max possible=
Likelihood ratio test= 8.55 on
Wald test
= 7.85 on
Score (logrank) test = 8.32 on
0.5 )
1 df,
1 df,
1 df,
p=0.003449
p=0.005082
p=0.003919
nditional Logistic Regression as GLM
Conditional
Logistic
Conditional Logistic Regression as GLM
Regression
GLMj ∈ {1, ..., p},
X , ..., X covariates
with x = valueas
for predictor
Conditional Logistic Regression as GLM
1
p
jit
ividual
iX
∈p {1,
..., n} with
at time
= {1,
Let
X1 , ...,
covariates
xjit =tvalue
for 2}
predictor j ∈ {1, ..., p},
individual i Logistic
∈ {1, ..., n}
at time t = {1, 2}
nditional
Regression
Model
Conditional Logistic
Regression
•
logit(P(Yit = 1)) = αi + β1 x1it + β2 x2it + ... + βp xpit
logit(P(Yit = 1)) = αi + β1 x1it + β2 x2it + ... + βp xpit
•
nditioned
onesuccess:
success: on one success:
Conditionedon
on Conditioned
one
P(Yi1 =
= 1,
YY
0 |0Si| =
1)= =
i2 = =
P(Y
1,
S
1)
i1
i2
i
P(Yi1 = 0, Yi2 = 1 | Si = 1) =
P(Yi1 = 0, Yi2 = 1 | Si = 1)
1
1
1=
+ exp ((xi2 − xi1 )� β)
� β)
1 + exp ((x
−
x
)
i2
i1
exp ((xi2 − xi1 )� β)
� β)
exp
((x
−
x
)
�
i2
i1
1=
+ exp ((xi2 − xi1 ) β)
1 + exp ((xi2 − xi1 )� β)
Conditional Logistic
onditional
Logistic
Conditional
LogisticRegression
Regressionas
as
GLM
Regression
asGLM
GLM
Conditional Logistic Regression as GLM
Conditional Logistic Regression as GLM
• Rewrite
Re-write
Re-write
��
1 if ifYY
= 1,
1,
i1i1==0,0,YYi2i2=
∗Y ∗ = 1
and
Y
=
and
= 1, Y = 0.
0 0 if ifYY
i1i1= 1, Yi2i2= 0.
∗
Xi∗Xi
Then
Then
• Then
Xi1for
forall
alli.i.
= =Xi2Xi2−−Xi1
∗
∗
∗
∗
logit(P(y
=
1))
=
β
x
+
β
x
+
...
+
β
x
1∗ 1i
2 ∗2i
p pi
∗ i
∗
logit(P(yi = 1)) = β1 x1i + β2 x2i + ... + βp xpi
Note: the above logistic regression does not have an intercept
Note: the above
regression
does
not have an intercept
nologistic
intercept
logistic
regression
Extensions: longitudinal studies, i.e. more than two observations per
xtensions:
i.e.come
moreback
thantotwo
per
individuallongitudinal
or clustered studies,
data (we’ll
thatobservations
later),
ndividual or clustered data (we’ll come back to that later),
> table(ystar)
ystar
1
144
> table(xstar)
xstar
-1 0 1
16 91 37
glm(formula = ystar ~ xstar - 1, family = binomial(logit))
Deviance Residuals:
Min
1Q Median
0.8478 0.8478 1.1774
3Q
1.1774
Max
1.5477
Coefficients:
Estimate Std. Error z value Pr(>|z|)
xstar
0.8383
0.2992
2.802 0.00508 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 199.63
Residual deviance: 191.07
AIC: 193.07
on 144
on 143
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
Models for Square Contingency Tables
l for Ordinal Response
Matched Pairs: Ordinal
Models for Square Contingency Tables
Model for Ordinal Response
Y and Y are ordinal variables with J>2
•
be ordinal with J categories.
1
2
categories
proportional
oddswith
model:
Let Yt be ordinal
J categories.
POLR
model
(marginal):
Then proportional odds model:
•
•
logit(P(Yt ≤ j)) = αj + βxt
logit(P(Yt ≤ j)) = αj + βxt
Cumulative
odds ratio: odds ratios are constant for all j:
ative
odds cumulative
ratio:
P(Y2 ≤ j)/P(Y2 > j)
log θj =P(Y
log 2 ≤ j)/P(Y2 > j)
= β(x2 − x1 ) = β,
P(Y1 ≤ j)/P(Y1 > j)
log θj = log
= β(x2 − x1 )
P(Y1 ≤ j)/P(Y1 > j)
for x2 = 1 and x1 = 0, independent of j.
= β,
Marginal
Homogeneity
Marginal Homogeneity in Ordinal Model
Models for Square Contingency Tables
Marginal homogeneity is equivalent to zero
•
Marginal homogeneity:
log odds ratio:
β=0
⇐⇒ logit(P(Y1 ≤ j)) = logit(P(Y2 ≤ j)) ∀j
⇐⇒ P(Y1 ≤ j) = P(Y2 ≤ j) ∀j
•
•
•
⇐⇒ πj+ = π+j ∀j
Model Fit: Model Fit based on 1+ (J-1) parameters
based on marginal probabilities πj+ , π+j , j= 1, ..., J,
Overall
we
have
2(J-1)
degrees
of
freedom
overall 2 · (J − 1) degrees of freedom;
proportionalModel
odds model
has (J
− 1) + ����
1 freedom
= J parameters
has J-2
degrees
of
� �� �
αj
model fit is based on df = J − 2.
β
Matched Pairs: Nominal
• Baseline Logistic Regression
log P(Yt = j)/P(Yt = J) = alphaj + betaj xt
• Then beta =0 is test for marginal
j
homogeneity POLR model (marginal):
Models for Square
Contingency Tables
• For nominal Y with J ≥ 3 categories, use J as
baseline
• Baseline Logistic Regression
log P(Yt = j)/P(Yt = J) = αj + βj xt
• Then β =0 is test for marginal homogeneity
j
Example: Migration Data
Migration Data
95% of the data is on the diagonal.
Residence in 1985
Residence 80
NE
MW
S
W
NE 11607
100
366
124
MW
87 13677
515
302
S
172
225 17819
270
W
63
176
286 10192
Total 11929 14178 18986 10888
Total
12197
14581
18486
10717
55981
• 95% of data is on diagonal
• marginal homogeneity seems given,
is data even symmetric?
Stat 557 ( Fall 2008)
Matched Pair Data
November 4, 2008
10 / 10
Symmetry Model
• H : π = π for all a,b
• as logistic regression:
0
ab
ba
log πab/πba = 1
• as loglinear model
log mab = µ + µaX + µbY + µabXY
with µaX = µaY and µabXY= µbaXY
Migration Data
• Symmetry seems to be violated:
e.g. fewer people move MW -> S than vice
versa
Download