Case control studies 1/20 Example: breast cancer study of Richardson et al. (2006) “Frozen tissue samples of 43 primary, sporadic, clinically and pathologically annotated breast tumors and four tumors from BRCA1 mutation carriers were obtained as anonymous samples from the Harvard Breast SPORE blood and tissue repository.” “Gene expression array data from 11 samples of normal breast organoid preparations (collagenase digested and enriched for epithelial elements) were obtained from Dr. Kornelia Polyak.” 2/20 Example continued I The study is an observational study. Subjects were not randomized in the study. I It is a retrospective sampling. Samples with disease and without disease are selected at the beginning of their study. I The gene expression were measured after the selection of the samples. 3/20 Prospective sampling Prospective sampling: the predictors are fixed and then the outcomes is to be observed. The study usually involves taking a cohort of subjects and watching them over a long period. Source of figure: Wikipedia. 4/20 Retrospective sampling Retrospective sampling: the outcomes are fixed and then the predictors are recorded. A retrospective study looks backwards and examines exposures to suspected risk or factors in relation to an outcome of interests. Source of figure: Wikipedia. 5/20 Prospective cohort study I Outcome is measured after exposure I Best for common outcomes I Expensive I Requires large sample size I Takes a long time to complete I Yields true incidence rates and relative risks 6/20 Case control study I Outcome is measured before exposure I Controls are selected on the basis of not having the outcome I Good for rare outcomes I Relatively inexpensive I Smaller sample size required I Quicker to complete I Selection bias 7/20 Example: children respiratory disease Consider the following data from a study on infant respiratory disease (Payne, 1987). The table includes the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding and gender. Bottle only Breast with supplement Breast only Boys 77/458 19/147 47/494 Girls 48/384 16/127 31/464 8/20 Relation between feeding types and disease Let X be the indicator of breast feeding. Namely, 0 if breast feeding; X = 1 if bottle feeding. Let Y be the indicator of having a respiratory disease. 0 if no disease; Y = 1 if having disease. 9/20 Logistic regression model Yi ∼ Bernoulli(pi ) p i log = β0 + β1 Xi 1 − pi I β1 : a unit increase in X increases the log-odds of success by β1 . Namely, β1 is the difference between the log-odds of having a respiratory disease incurred by bottle feeding and breast feeding. I How to estimate β1 under prospective and retrospective sampling? 10/20 Estimation of β1 using prospective sampling Let B={the boy is breast fed} and B c means that the boy is bottle fed. Let D={the boy has a respiratory disease}. For prospective sampling, β1 = log n P(D|B) o n P(D|B c ) o − log . 1 − P(D|B) 1 − P(D|B c ) 11/20 Estimation of β1 using retrospective sampling For retrospective sampling, we do not know P(D|B) and P(D|B c ). But we know P(B|D) and P(B|D c ). Applying the Bayes’ formula, P(D|B) = P(B) − P(B|D)P(D) P(B|D)P(D) and 1−P(D|B) = . P(B) P(B) Therefore, log n P(D|B) o n o P(B|D)P(D) = log 1 − P(D|B) P(B) − P(B|D)P(D) n P(B|D)P(D) o = log P(B|D c )P(D c ) n P(B|D) o n P(D) o = log + log . P(B|D c ) P(D c ) 12/20 Estimation of β1 using retrospective sampling Similarly, applying the Bayes’ formula, P(D|B c ) = P(B c |D)P(D) P(B c ) − P(B c |D)P(D) c and 1−P(D|B ) = . P(B c ) P(B c ) Therefore, log n P(D|B c ) o n o P(B c |D)P(D) = log 1 − P(D|B c ) P(B c ) − P(B c |D)P(D) n P(B c |D)P(D) o = log P(B c |D c )P(D c ) n P(B c |D) o n P(D) o = log + log . P(B c |D c ) P(D c ) 13/20 Estimation of β1 using retrospective sampling In summary, the estimation of β1 using retrospective sampling is n P(B|D) o n P(B c |D) o − log P(B|D c ) P(B c |D c ) n P(B|D) o n P(B|D c ) o = log − log . P(B c |D) P(B c |D c ) β1 = log 14/20 Example: estimate of β1 under prospective sampling I Given the boy is breast feeding, the log-odds of having a respiratory disease are log I 47 47 = log = −2.25. 494 − 47 447 Given the boy is bottle feeding, the log-odds of having a respiratory disease are log I 77 77 = log = −1.60. 458 − 77 381 The difference between the above two log-odds is β1 = −1.60 − (−2.25) = 0.65. 15/20 Example: estimate of β1 under retrospective sampling I Given the boy having the disease, the log-odds of two feeding types are log I 77 47 = 0.49. Given the boy who does not have the disease, the log-odds of two feeding types are log I 458 − 77 494 − 47 = log 381 447 = −0.16. The difference between the above two log-odds is β1 = 0.49 − (−0.16) = 0.65. 16/20 Logistic regression for prospective samples I Let p(x) be the unconditional probability that he or she has the disease. I If the data are collected from a prospective sampling, we could model the data using a logistic regression as following Y ∼ Bernoulli(p(x)) and log I n p(x) o = β T x. 1 − p(x) However, in a retrospective study, the sample is not representative of the population. We can not use the above the model. 17/20 Logistic regression for retrospective samples I To use the logistic regression model for retrospective samples, we could use conditional probability to replace the unconditional probability. I Let p∗ (x) be the conditional probability that an individual has the disease given he or she was included in the study. I We then model the data using a logistic regression as following Y ∼ Bernoulli(p∗ (x)) and log n p∗ (x) o = β∗T x. 1 − p∗ (x) 18/20 Effect of sampling I Let I be the event that the individual is included into the study and let D be the event the subject has the disease. I Let π0 be the inclusion probability of an individual who does not have disease and π1 be the inclusion probability of an individual having disease. I Applying Bayes’ formula, P(I|D)P(D) P(I|D)P(D) + P(I|D c )P(D c ) π1 p(x) = . π1 p(x) + π0 {1 − p(x)} p∗ (x) = P(D|I) = 19/20 Effect of sampling I Using the above relationship, we have log I n π p(x) o n p∗ (x) o 1 = log ∗ 1 − p (x) π0 {1 − p(x)} n p(x) o π 1 = log . + log π0 1 − p(x) Therefore, we obtain that β∗T x I = log π 1 π0 + β T x. The difference is in the intercept term. We are not able to estimate the intercept β0 using a retrospective study but we can still estimate the coefficients associated with other predictors. 20/20