Matched Pair Data Stat 557 Heike Hofmann Outline • • • • • Marginal Homogeneity - review Subject-specific vs Marginal Model Binary Response • • conditional logistic regression with covariates Ordinal response Symmetric Models Matched Pair Data 2nd Rating 1st Rating Assumptions Approve Disapprove Approve 794 150 Disapprove 86 570 • Diagonal heavily loaded • Association usually strongly positive (most people don’t change their opinion) • Distinguish between movers & stayers Marginal Homogeneity • logit P(Y = 1| x ) = α + β x t t t • x is dummy variable for time points t x1 = 0, x2 = 1 Then β is log odds ratio based on overall population RAND -American Life Panel https://mmicdata.rand.org/alp/?page=election#electionforecast Panel of 3500 US citizens above 18 tracked since July Data isn’t published on individual basis, but from change and overall margins we can (almost) work out change pattern 1 week after 1st debate • before 1st debate Obama Romney Obama 1585 121 Romney 162 1432 3300 > mswitch <- glm(I(candidate=="Obama")~time, data=votem, family=binomial(), weight=votes) > summary(mswitch) Call: glm(formula = I(candidate == "Obama") ~ time, family = binomial(), data = votem, weights = votes) Deviance Residuals: Min 1Q Median -46.462 -22.929 -0.435 3Q 21.992 Max 45.733 Coefficients: Estimate Std. Error z value (Intercept) 0.11771 0.03488 3.375 timevote2 -0.04981 0.04929 -1.010 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|z|) 0.000738 *** 0.312299 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 9135.4 Residual deviance: 9134.3 AIC: 9138.3 on 7 on 6 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 Subject Specific Model • link P(Y = 1) = α + β x • x is dummy variable for time points it i t t x1 = 0, x2 = 1 • then αi = link P(Yi1 = 1) β = link P(Yi2 = 1) - link P(Yi1 = 1) painful to fit ... Marginal vs SubjectSpecific Model Estimates for β • is identical for marginal model and subject specific model in case of identity link • are different for logit link • marginal model: β = logit P(Y2 = 1| x2 ) - logit P(Y1 = 1| x1 ) • subject specific, for all i: β = logit P(Yi2 = 1| x2 ) - logit P(Yi1 = 1| x1 ) Subject-Specific Model • logit P(Y = 1) = α + β x • Assumptions generally: • responses from different subjects it i t independent (for all i) • responses for different time-points independent Subject-Specific Model • logit P(Y = 1) = α + β x • Assumptions generally: • responses from different subjects it i t independent (for all i) • responses for different time-points independent Subject-Specific Model • Violation of independence taken care of by model structure: • • Generally, |αi| >> |β| • When |αi| is small, we have the most variability between responses of the same individual - i.e. least dependence. That’s the records, on which estimation of β is based on. For large |αi|, probability of P(Yit = 1) is either close to 0 or close to 1 (largest dependence in the data) Subject Specific Model • link P(Y = 1) = α + β x • but: estimation α of becomes problematic it i t i for large numbers of subjects • idea: condition on sufficient statistic for α i leads to conditional (logistic) regression Likelihood for αi Fitting the Subject Specific Model • Let S = y +y then S in {0,1,2} • S are sufficient statistics for α only values i i1 i i2, i i of 1 contribute to the estimation of β • logit P(Y it = 1 | Si = 1) = αi + β xt Estimating β • MLE for β is log n /n • standard deviation of estimate is then 21 12 sqrt(1/n12 + 1/n21) • Use clogit from the survival package to fit model Navajo Indians • 144 victims of myocardiac infarcts (MI cases) are matched with 144 control subjects (disease free) according to gender and age. • All participants of the study are asked about whether they ever were diagnosed with diabetes: Controls Diabetes no Cases Diabetes no 9 16 37 82 > myo.ml <- clogit(MI ~ diabetes + strata(pair), data=t103) > summary(myo.ml) Call: coxph(formula = Surv(rep(1, 288L), MI) ~ diabetes + strata(pair), data = t103, method = "exact") n= 288 coef exp(coef) se(coef) z Pr(>|z|) diabetes 0.8383 2.3125 0.2992 2.802 0.00508 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 diabetes 2.312 0.4324 1.286 4.157 Rsquare= 0.029 (max possible= Likelihood ratio test= 8.55 on Wald test = 7.85 on Score (logrank) test = 8.32 on 0.5 ) 1 df, 1 df, 1 df, p=0.003449 p=0.005082 p=0.003919 nditional Logistic Regression as GLM Conditional Logistic Conditional Logistic Regression as GLM Regression GLMj ∈ {1, ..., p}, X , ..., X covariates with x = valueas for predictor Conditional Logistic Regression as GLM 1 p jit ividual iX ∈p {1, ..., n} with at time = {1, Let X1 , ..., covariates xjit =tvalue for 2} predictor j ∈ {1, ..., p}, individual i Logistic ∈ {1, ..., n} at time t = {1, 2} nditional Regression Model Conditional Logistic Regression • logit(P(Yit = 1)) = αi + β1 x1it + β2 x2it + ... + βp xpit logit(P(Yit = 1)) = αi + β1 x1it + β2 x2it + ... + βp xpit • nditioned onesuccess: success: on one success: Conditionedon on Conditioned one P(Yi1 = = 1, YY 0 |0Si| = 1)= = i2 = = P(Y 1, S 1) i1 i2 i P(Yi1 = 0, Yi2 = 1 | Si = 1) = P(Yi1 = 0, Yi2 = 1 | Si = 1) 1 1 1= + exp ((xi2 − xi1 )� β) � β) 1 + exp ((x − x ) i2 i1 exp ((xi2 − xi1 )� β) � β) exp ((x − x ) � i2 i1 1= + exp ((xi2 − xi1 ) β) 1 + exp ((xi2 − xi1 )� β) Conditional Logistic onditional Logistic Conditional LogisticRegression Regressionas as GLM Regression asGLM GLM Conditional Logistic Regression as GLM Conditional Logistic Regression as GLM • Rewrite Re-write Re-write �� 1 if ifYY = 1, 1, i1i1==0,0,YYi2i2= ∗Y ∗ = 1 and Y = and = 1, Y = 0. 0 0 if ifYY i1i1= 1, Yi2i2= 0. ∗ Xi∗Xi Then Then • Then Xi1for forall alli.i. = =Xi2Xi2−−Xi1 ∗ ∗ ∗ ∗ logit(P(y = 1)) = β x + β x + ... + β x 1∗ 1i 2 ∗2i p pi ∗ i ∗ logit(P(yi = 1)) = β1 x1i + β2 x2i + ... + βp xpi Note: the above logistic regression does not have an intercept Note: the above regression does not have an intercept nologistic intercept logistic regression Extensions: longitudinal studies, i.e. more than two observations per xtensions: i.e.come moreback thantotwo per individuallongitudinal or clustered studies, data (we’ll thatobservations later), ndividual or clustered data (we’ll come back to that later), > table(ystar) ystar 1 144 > table(xstar) xstar -1 0 1 16 91 37 glm(formula = ystar ~ xstar - 1, family = binomial(logit)) Deviance Residuals: Min 1Q Median 0.8478 0.8478 1.1774 3Q 1.1774 Max 1.5477 Coefficients: Estimate Std. Error z value Pr(>|z|) xstar 0.8383 0.2992 2.802 0.00508 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 199.63 Residual deviance: 191.07 AIC: 193.07 on 144 on 143 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 Models for Square Contingency Tables l for Ordinal Response Matched Pairs: Ordinal Models for Square Contingency Tables Model for Ordinal Response Y and Y are ordinal variables with J>2 • be ordinal with J categories. 1 2 categories proportional oddswith model: Let Yt be ordinal J categories. POLR model (marginal): Then proportional odds model: • • logit(P(Yt ≤ j)) = αj + βxt logit(P(Yt ≤ j)) = αj + βxt Cumulative odds ratio: odds ratios are constant for all j: ative odds cumulative ratio: P(Y2 ≤ j)/P(Y2 > j) log θj =P(Y log 2 ≤ j)/P(Y2 > j) = β(x2 − x1 ) = β, P(Y1 ≤ j)/P(Y1 > j) log θj = log = β(x2 − x1 ) P(Y1 ≤ j)/P(Y1 > j) for x2 = 1 and x1 = 0, independent of j. = β, Marginal Homogeneity Marginal Homogeneity in Ordinal Model Models for Square Contingency Tables Marginal homogeneity is equivalent to zero • Marginal homogeneity: log odds ratio: β=0 ⇐⇒ logit(P(Y1 ≤ j)) = logit(P(Y2 ≤ j)) ∀j ⇐⇒ P(Y1 ≤ j) = P(Y2 ≤ j) ∀j • • • ⇐⇒ πj+ = π+j ∀j Model Fit: Model Fit based on 1+ (J-1) parameters based on marginal probabilities πj+ , π+j , j= 1, ..., J, Overall we have 2(J-1) degrees of freedom overall 2 · (J − 1) degrees of freedom; proportionalModel odds model has (J − 1) + ���� 1 freedom = J parameters has J-2 degrees of � �� � αj model fit is based on df = J − 2. β Matched Pairs: Nominal • Baseline Logistic Regression log P(Yt = j)/P(Yt = J) = alphaj + betaj xt • Then beta =0 is test for marginal j homogeneity POLR model (marginal): Models for Square Contingency Tables • For nominal Y with J ≥ 3 categories, use J as baseline • Baseline Logistic Regression log P(Yt = j)/P(Yt = J) = αj + βj xt • Then β =0 is test for marginal homogeneity j Example: Migration Data Migration Data 95% of the data is on the diagonal. Residence in 1985 Residence 80 NE MW S W NE 11607 100 366 124 MW 87 13677 515 302 S 172 225 17819 270 W 63 176 286 10192 Total 11929 14178 18986 10888 Total 12197 14581 18486 10717 55981 • 95% of data is on diagonal • marginal homogeneity seems given, is data even symmetric? Stat 557 ( Fall 2008) Matched Pair Data November 4, 2008 10 / 10 Symmetry Model • H : π = π for all a,b • as logistic regression: 0 ab ba log πab/πba = 1 • as loglinear model log mab = µ + µaX + µbY + µabXY with µaX = µaY and µabXY= µbaXY Migration Data • Symmetry seems to be violated: e.g. fewer people move MW -> S than vice versa