Case control studies 1/20

advertisement
Case control studies
1/20
Example: breast cancer study of Richardson et al.
(2006)
“Frozen tissue samples of 43 primary, sporadic, clinically and
pathologically annotated breast tumors and four tumors from
BRCA1 mutation carriers were obtained as anonymous
samples from the Harvard Breast SPORE blood and tissue
repository.”
“Gene expression array data from 11 samples of normal breast
organoid preparations (collagenase digested and enriched for
epithelial elements) were obtained from Dr. Kornelia Polyak.”
2/20
Example continued
I
The study is an observational study. Subjects were not
randomized in the study.
I
It is a retrospective sampling. Samples with disease and
without disease are selected at the beginning of their study.
I
The gene expression were measured after the selection of
the samples.
3/20
Prospective sampling
Prospective sampling: the predictors are fixed and then the
outcomes is to be observed. The study usually involves taking
a cohort of subjects and watching them over a long period.
Source of figure: Wikipedia.
4/20
Retrospective sampling
Retrospective sampling: the outcomes are fixed and then the
predictors are recorded. A retrospective study looks backwards
and examines exposures to suspected risk or factors in relation
to an outcome of interests.
Source of figure: Wikipedia.
5/20
Prospective cohort study
I
Outcome is measured after exposure
I
Best for common outcomes
I
Expensive
I
Requires large sample size
I
Takes a long time to complete
I
Yields true incidence rates and relative risks
6/20
Case control study
I
Outcome is measured before exposure
I
Controls are selected on the basis of not having the
outcome
I
Good for rare outcomes
I
Relatively inexpensive
I
Smaller sample size required
I
Quicker to complete
I
Selection bias
7/20
Example: children respiratory disease
Consider the following data from a study on infant respiratory
disease (Payne, 1987). The table includes the proportions of
children developing bronchitis or pneumonia in their first year of
life by type of feeding and gender.
Bottle only
Breast with supplement
Breast only
Boys
77/458
19/147
47/494
Girls
48/384
16/127
31/464
8/20
Relation between feeding types and disease
Let X be the indicator of breast feeding. Namely,

 0 if breast feeding;
X =
 1 if bottle feeding.
Let Y be the indicator of having a respiratory disease.

 0
if no disease;
Y =
 1 if having disease.
9/20
Logistic regression model
Yi ∼ Bernoulli(pi )
p i
log
= β0 + β1 Xi
1 − pi
I
β1 : a unit increase in X increases the log-odds of success
by β1 . Namely, β1 is the difference between the log-odds of
having a respiratory disease incurred by bottle feeding and
breast feeding.
I
How to estimate β1 under prospective and retrospective
sampling?
10/20
Estimation of β1 using prospective sampling
Let B={the boy is breast fed} and B c means that the boy is
bottle fed. Let D={the boy has a respiratory disease}.
For prospective sampling,
β1 = log
n P(D|B) o
n P(D|B c ) o
− log
.
1 − P(D|B)
1 − P(D|B c )
11/20
Estimation of β1 using retrospective sampling
For retrospective sampling, we do not know P(D|B) and
P(D|B c ). But we know P(B|D) and P(B|D c ). Applying the
Bayes’ formula,
P(D|B) =
P(B) − P(B|D)P(D)
P(B|D)P(D)
and 1−P(D|B) =
.
P(B)
P(B)
Therefore,
log
n P(D|B) o
n
o
P(B|D)P(D)
= log
1 − P(D|B)
P(B) − P(B|D)P(D)
n P(B|D)P(D) o
= log
P(B|D c )P(D c )
n P(B|D) o
n P(D) o
= log
+
log
.
P(B|D c )
P(D c )
12/20
Estimation of β1 using retrospective sampling
Similarly, applying the Bayes’ formula,
P(D|B c ) =
P(B c |D)P(D)
P(B c ) − P(B c |D)P(D)
c
and
1−P(D|B
)
=
.
P(B c )
P(B c )
Therefore,
log
n P(D|B c ) o
n
o
P(B c |D)P(D)
=
log
1 − P(D|B c )
P(B c ) − P(B c |D)P(D)
n P(B c |D)P(D) o
= log
P(B c |D c )P(D c )
n P(B c |D) o
n P(D) o
= log
+
log
.
P(B c |D c )
P(D c )
13/20
Estimation of β1 using retrospective sampling
In summary, the estimation of β1 using retrospective sampling is
n P(B|D) o
n P(B c |D) o
−
log
P(B|D c )
P(B c |D c )
n P(B|D) o
n P(B|D c ) o
= log
− log
.
P(B c |D)
P(B c |D c )
β1 = log
14/20
Example: estimate of β1 under prospective sampling
I
Given the boy is breast feeding, the log-odds of having a
respiratory disease are
log
I
47 47
= log
= −2.25.
494 − 47
447
Given the boy is bottle feeding, the log-odds of having a
respiratory disease are
log
I
77 77
= log
= −1.60.
458 − 77
381
The difference between the above two log-odds is
β1 = −1.60 − (−2.25) = 0.65.
15/20
Example: estimate of β1 under retrospective sampling
I
Given the boy having the disease, the log-odds of two
feeding types are
log
I
77 47
= 0.49.
Given the boy who does not have the disease, the
log-odds of two feeding types are
log
I
458 − 77 494 − 47
= log
381 447
= −0.16.
The difference between the above two log-odds is
β1 = 0.49 − (−0.16) = 0.65.
16/20
Logistic regression for prospective samples
I
Let p(x) be the unconditional probability that he or she has
the disease.
I
If the data are collected from a prospective sampling, we
could model the data using a logistic regression as
following
Y ∼ Bernoulli(p(x)) and log
I
n p(x) o
= β T x.
1 − p(x)
However, in a retrospective study, the sample is not
representative of the population. We can not use the above
the model.
17/20
Logistic regression for retrospective samples
I
To use the logistic regression model for retrospective
samples, we could use conditional probability to replace
the unconditional probability.
I
Let p∗ (x) be the conditional probability that an individual
has the disease given he or she was included in the study.
I
We then model the data using a logistic regression as
following
Y ∼ Bernoulli(p∗ (x)) and log
n p∗ (x) o
= β∗T x.
1 − p∗ (x)
18/20
Effect of sampling
I
Let I be the event that the individual is included into the
study and let D be the event the subject has the disease.
I
Let π0 be the inclusion probability of an individual who
does not have disease and π1 be the inclusion probability
of an individual having disease.
I
Applying Bayes’ formula,
P(I|D)P(D)
P(I|D)P(D) + P(I|D c )P(D c )
π1 p(x)
=
.
π1 p(x) + π0 {1 − p(x)}
p∗ (x) = P(D|I) =
19/20
Effect of sampling
I
Using the above relationship, we have
log
I
n π p(x) o
n p∗ (x) o
1
=
log
∗
1 − p (x)
π0 {1 − p(x)}
n p(x) o
π 1
= log
.
+ log
π0
1 − p(x)
Therefore, we obtain that
β∗T x
I
= log
π 1
π0
+ β T x.
The difference is in the intercept term. We are not able to
estimate the intercept β0 using a retrospective study but
we can still estimate the coefficients associated with other
predictors.
20/20
Download