lect17

Lecture 17: Regression for Case-control Studies BMTRY 701 Biostatistical Methods II Old business: Comparing AUCs  Good reference: Hanley and McNeill “Comparing AUCs for ROC curves based on the same data” See class website for pdf. Additional Reading in Logistic REgression  Hosmer and Lemeshow, Applied Logistic Regression  http://en.wikipedia.org/wiki/Logistic_regression  http://luna.cas.usf.edu/~mbrannic/files/regressio n/Logistic.html  http://www.statgun.com/tutorials/logisticregression.html  http://www.bus.utk.edu/stat/Stat579/Logistic%20 Regression.pdf  Etc: Google “logistic regression” Case Control Studies in Logistic Regression  http://www.oxfordjournals.org/our_journals/tropej /online/ma_chap11.pdf  How is a case-control study performed?  What is the outcome and what is the predictor in the regression setting? Recall the simple 2x2 example  Odds ratio for 2x2 table can be used in casecontrol studies  Similarly, the logistic regression model can be used treating ‘case’ status as the outcome.  It has been shown that the results do not depend on the sampling (i.e., cohort vs. case-control study). Example: Case control study of HPV and Oropharyngeal Cancer  Gillison et al. (http://content.nejm.org/cgi/content/full/356/19/1 944)  100 cases and 200 controls with oropharyngeal cancer  How was the sampling done? Data on Case vs. HPV > table(data$hpv16ser, data$control) 0 0 186 1 14 1 43 57 > epitab(data$hpv16ser, data$control) $tab Outcome Predictor 0 p0 1 p1 oddsratio lower upper p.value 0 186 0.93 43 0.43 1.00000 NA NA NA 1 14 0.07 57 0.57 17.61130 8.99258 34.49041 4.461359e-21 Multiple Logistic Regression  This is not ‘randomized’ study  there are lots of other predictors that may be associated with the cancer  Examples: • • • • smoking alcohol age gender Fit the model:  Write down the model • assume main effects of tobacco, alcohol and their interaction  What is the likelihood function?  What are the MLEs? How do we interpret the results?  Is there an effect of tobacco?  Is there an effect of alcohol?  Is there an interaction? Interpreting the interaction  What is the OR for smoker/non-drinker versus a non-smoker/non-drinker?  What is the OR for a smoker/drinker versus a non-smoker/drinker? How can we assess if the effect of smoking differs by HPV status?  How likely is it that someone who smokes and drinks will get oropharyngeal cancer?  How can we estimate the chance? Matched case control studies  References: • Hosmer and Lemeshow, Applied Logistic Regression • http://staff.pubhealth.ku.dk/~bxc/SPE.2002/Slides/mc c.pdf • http://staff.pubhealth.ku.dk/~bxc/Talks/NestedMatched-CC.pdf • http://www.tau.ac.il/cc/pages/docs/sas8/stat/chap49/s ect35.htm • http://www.ats.ucla.edu/stat/sas/library/logistic.pdf (beginning page 5) Matched design  Matching on important factors is common  OP cancer: • age • gender  Why? • forces the distribution to be the same on those variables • removes any effects of those variables on the outcome • eliminates confounding 1-to-M matching  For each ‘case’, there is a matched ‘control  Process usually dictates that the case is enrolled, then a control is identified  For particularly rare diseases or when large N is required, often use more than one control per case Logistic regression for matched case control studies  Recall independence iid yi ~ Bern( pi )  e  0  1 xi ~ Bern  0  1 xi 1  e  iid     But, if cases and controls are matched, are they still independent? Solution: treat each matched set as a stratum  one-to-one matching: 1 case and 1 control per stratum  one-to-M matching: 1 case and M controls per stratum  Logistic model per stratum: within stratum, independence holds. e k  xi pk ( xi )  1  e k  xi  We assume that the OR for x and y is constant across strata How many parameters is that?  Assume sample size is 2n and we have 1-to-1 matching:  n strata + p covariates = n+p parameters  This is problematic: • as n gets large, so does the number of parameters • too many parameters to estimate and a problem of precision  but, do we really care about the strata-specific intercepts?  “NUISANCE PARAMETERS” Conditional logistic regression  To avoid estimation of the intercepts, we can condition on the study design.  Huh?  Think about each stratum: • how many cases and controls? • what is the probability that the case is the case and the control is the control? • what is the probability that the control is the case and the case the control?  For each stratum, the likelihood contribution is based on this conditional probability Conditioning  For 1 to 1 matching: with two individuals in stratum k where y indicates case status (1 = case, 0 = control) P( y1k  1, y2 k  0) P( y1k  1, y2 k  0)  P( y1k  1, y2 k  0)  P( y1k  0, y2 k  1)  Write as a likelihood contribution for stratum k: P( y1k  1 | x1k ) P( y2 k  0 | x2 k ) Lk  P( y1k  1 | x1k ) P( y2 k  0 | x2 k )  P( y1k  0 | x1k ) P( y2 k  1 | x2 k ) Likelihood function for CLR Substitute in our logistic representation of p and simplify: P( y1k  1 | x1k ) P( y2 k  0 | x2 k ) Lk  P( y1k  1 | x1k ) P( y2 k  0 | x2 k )  P( y1k  0 | x1k ) P( y2 k  1 | x2 k )   e k  x1k  1     k  x1k   k  x2 k   1 e  1  e   e k  x2 k 1 1       k   x2 k     k  x1 k   k   x2 k 1  e 1  e 1  e        e k  x1k   k  x1 k 1  e  e k  x1k   k  x1k e  e k  x2 k e x1k  x1k e  e  x2 k    Likelihood function for CLR  Now, take the product over all the strata for the full likelihood n L( )   Lk  k 1 n  e k 1 e x1k x1k e  x2 k  This is the likelihood for the matched case-control design  Notice: • there are no strata-specific parameters • cases are defined by subscript ‘1’ and controls by subscript ‘2’  Theory for 1-to-M follows similarly (but not shown here) Interpretation of β  Same as in ‘standard’ logistic regression  β represents the log odds ratio comparing the risk of disease by a one unit difference in x When to use matched vs. unmatched?  Some papers use both for a matched design  Tradeoffs: • bias • precision  Sometimes matched design to ensure balance, but then unmatched analysis  They WILL give you different answers  Gillison paper Another approach to matched data  use random effects models  CLR is elegant and simple  can identify the estimates using a ‘transformation’ of logistic regression results  But, with new age of computing, we have other approaches  Random effects models: • • • • allow strata specific intercepts not problematic estimation process additional assumptions: intercepts follow normal distribution Will NOT give identical results . xi: clogit control hpv16ser, group(strata) or Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -72.072957 -71.803221 -71.798737 -71.798736 Conditional (fixed-effects) logistic regression Log likelihood = -71.798736 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 300 76.12 0.0000 0.3465 -----------------------------------------------------------------------------control | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hpv16ser | 13.16616 4.988492 6.80 0.000 6.26541 27.66742 ------------------------------------------------------------------------------ . xi: logistic control hpv16ser Logistic regression Log likelihood = -145.8514 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 300 90.21 0.0000 0.2362 -----------------------------------------------------------------------------control | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hpv16ser | 17.6113 6.039532 8.36 0.000 8.992582 34.4904 ------------------------------------------------------------------------------ . xi: gllamm control hpv16ser, i(strata) family(binomial) number of level 1 units = 300 number of level 2 units = 100 Condition Number = 2.4968508 OR = 17.63 gllamm model log likelihood = -145.8514 -----------------------------------------------------------------------------control | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hpv16ser | 2.868541 .3429353 8.36 0.000 2.1964 3.540681 _cons | -1.464547 .1692104 -8.66 0.000 -1.796193 -1.1329 ------------------------------------------------------------------------------ Variances and covariances of random effects ------------------------------------------------------------------------------ ***level 2 (strata) var(1): 4.210e-21 (2.231e-11) ------------------------------------------------------------------------------

lect17

Related documents

Products

Support

lect17

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib