36331

advertisement
Bias in Studies of the
Human Genome
Thomas A. Pearson, MD, PhD
University of Rochester
School of Medicine
Visiting Scientist, NHGRI
Lecture 6: Bias in Studies of the
Human Genome
1. Consider the causes of heterogeneity of
results in gene association studies.
2. Review the types and sources of bias
relevant to human genomic research.
3. Provide examples from genome-wide
association studies to illustrate biases or
potential for bias.
4. Identify strategies in study design, data
collection, statistical analysis, and
interpretation which could prevent or
minimize bias in human genome research.
Larson, G. The Complete Far Side. 2003.
PLoS Med. 2005 Aug;2(8):e124.
WSJ. 2004Sep14.
Only 6/600 Gene-Disease
Associations Significant in >75%
of Studies (Hirschhorn J et al,
Genet Med 2002; 4:45-61)
Disease/Trait
Gene
Polymorph.
Freq.
DVT
F5
Arg506Gln
0.015
Graves’ Disease
CTLA4
Thr17Ala
0.62
Type 1 DM
HIV/AIDS
Alzheimer’s
Creutzfeldt-Jakob
Disease
INS
CCR5
APOE
5’ VNTR
0.67
32 bp Ins/Del 0.05-0.07
Epsilon 2/3/4 0.16-0.24
PRNP
Met129Val
0.37
Hirschhorn J et al, Genet Med 2002; 4:45-61
Possible Explanations of
Heterogeneity of Results in
Genetic Association Studies
• Biologic mechanisms
– Genetic heterogeneity
– Gene-gene interactions
– Gene-environment interactions
• Spurious mechanisms
– Inadequacies of genomic markers
– Type 1 error
– Limited sample sizes and power
– Cohort, age, period (secular) effects
– Bias
Definition of Bias in Human
Research
• Sackett (1975): “Any process at any stage
of inference which tends to produce
results or conclusions that differ
systematically from the truth.”
• Gordis (2004): “Any systematic error in
the design, conduct, or analysis of a study
that results in a mistaken estimate of an
exposure’s effect on risk or disease.”
Effects of Bias on GWAS Results
• False negatives
• False positives
• Inaccurate effect sizes
– Underestimates
– Overestimates
Larson, G. The Complete Far Side. 2003.
Types of Bias in Genome
Association Studies
•
•
•
•
Selection of cases and controls
Information on genotype or phenotype
Analysis and presentation of results
Interpretation of results
20 Types of Biases Potentially
Encountered in GWAS
• Common to all human observational studies
(N=12)
• Unique or common in GWAS (N=8)
–
–
–
–
–
–
–
Supercase or supercontrol biases
Latent case bias
Population stratification
Hardy-Weinberg disequilibrium
Genotyping quality bias
Transmission disequilibrium bias
Winner’s Curse
Systematic Review of GWAS:
NHGRI Catalog of GWAS in Print*
• 109 studies from 3/05 to 3/08.
• Genotyping platforms of density>100,000 SNPs
• Each study reviewed for:
–
–
–
–
–
Study design
Description of case and comparison groups
Collection of genotype and other risk factors
Presentation of study results
Interpretation of study results
*http://www.genome.gov/gwastudies/
Characteristics of 109 GWAS
• Phenotypes
– Discrete outcomes or traits: 91 in 83 studies
– Quantitative traits: 40 in 26 studies
• Design of discovery study
N
– Case-control
– Trio
– Nested case-control
– Cross-sectional/Cohort
77
4
4
24
%
70.6
3.7
3.7
22.0
Four Key Requirements for a
Bias-Free Case-Control Study
Selection Bias
– Cases are representative of all those who develop the
disease being studied.
– Controls are representative of all those at risk of
developing the disease and eligible to become cases
and be included in the study.
– Ancestral geographical origins and predominant
environmental exposures of cases do not differ
dramatically from controls.
Information Bias
- Collection of risk factor and exposure
information is the same for cases and
controls.
Selection Biases in GWAS:
Criteria for Classification
• Misclassification bias: Absence of description or
use of adequate means to define cases and/or
controls.
• Nonresponse bias: Absence of description of
rates of recruitment and participation in cases
and/or controls.
• Prevalence-incidence bias: Use of prevalent
cases of disease which have sizable short term
case-fatality or remission rates.
Larson, G. The Complete Far Side. 2003.
Characteristics of 109 GWAS:
Selection of Study Subjects
• Methods of selection/recruitment frequently
(30%) described in supplement or other
publication.
• Few baseline descriptors or cases/controls
– Tables comparing cases vs. controls: 36.0%
• Statistical comparison of cases/controls:
– Participation rates (cases or controls):
3.5%
9.0%
• Comparison of participants/nonparticipants: 2.0%
• Most cases (67%) prevalent cases derived from
clinical sources, rather than population-based or
incident cases.
GWAS of Type II Diabetes in
Mexican-Americans*
• Case-control study design
– 281 cases with diabetes defined by current Dx/RX or
fasting blood glucose or 2 hour GTT
– 280 persons from a random population sample whose
T2DM status is unknown
• 112,541 SNPs assayed in each person
• 4 genes identified
• ?Misclassification: Substantial prevalence (714%) of T2DM likely in controls.
*Hayes MG et al. Diabetes, 9/10/07.
Selection Biases in GWAS:
Criteria for Classification
• Supercase bias: Use of additional criteria
in case selection that increases the
chance of a genetic etiology.
• Supercontrol bias: Use of additional
criteria in control selection that decreases
the chance of a genetic etiology.
• Latent case bias: Inclusion as controls of
persons who could never develop the
disease even if a gene carrier.
A Case-Control GWAS
of Prostate Cancer*
• Discovery Study
– 1854 cases with symptomatic prostate cancer and
diagnosis <60years or positive family history.
– 1894 controls with age>50 years and PSA<0.5 ng/ml.
– Genotyping of 541,129 SNPs
– 11 new SNPs associated (P<E-6)
• Replication Study
– 3268 cases/3354 controls
– Genotyping of 11 SNPs
– 7 SNPs independently associated (P<E-7)
*Eeles RA et al. NatGen 2/10/08
Prostate Cancer: 7 Novel SNPs in
Discovery and Replication Studies
Discovery
Replication
SNP
OR
95%CI
OR
95%CI
rs2660753 1.52 1.30-1.77 1.18 1.06-1.31
rs9364554 1.28 1.16-1.41 1.17 1.08-1.26
rs6465657 1.30 1.19-1.43 1.12 1.05-1.20
rs7920517 1.39 1.27-1.53 1.22 1.14-1.31
rs10993994 1.62 1.47-1.78 1.25 1.17-1.34
rs7931342 0.79 0.72-0.86 0.84 0.79-0.90
rs7931342 1.39 1.23-1.57 1.03 0.94-1.14
Eeles RA et al: Nat Gen 2/10/08
Latent Cases in a GWAS of
Prostate Cancer*
Cases
Controls
Male Female
9312 12060
Discovery Study
Iceland
1890
Replications
Netherlands
998 1004
1017
Spain
548
742
874
Sweden
2893 1781
US-Baltimore
1545
576
US-Chicago
665
368
184
US-Nashville
526
613
US-Rochester
1140
503
*Gudmundsson J et al. Nat Gen 2008; 40:281-3
Selection Biases in GWAS:
Criteria for Classification
• Membership bias: Membership in a group may
imply a degree of health which differs
systematically from that of the general
population.
• Population Stratification: Genetic differences
between cases and controls unrelated to
disease but due to sampling from populations of
different ancestries.
• Phenotypic variation bias: The use of different
definitions of cases or controls between
discovery study and subsequent replications.
Wellcome Trust Case-Control
(WTCC) Consortium*
Genotyping: 500,000 SNPs (Affymetrix)
Cases:
2000 persons from each of 7
diseases: (bipolar disorder,coronary
artery disease, Crohn disease,
rheumatoid arthritis, T1DM,
T2DM, hypertension)
Controls: 3000 persons without disease
1500 in 1958 British Birth Cohort
1500 UK blood donors
*WTCC, Nature 2007; 447:661-678.
Population Stratification*
Each population has unique genetic and social
history; ancestral patterns of migration, mating,
expansions/bottlenecks, stochastic variation all
yield differences in allele frequencies between
populations.
Population stratification: cases and controls have
different allele frequencies due to diversity in
populations of origin and unrelated to outcome,
requiring:
1) differences in disease prevalence
2) differences in allele frequencies
*Cardon LR, Palmer LJ, Lancet 2003
Downloaded from: StudentConsult (on 11 May 2008 06:40 PM)
© 2005 Elsevier
Population Stratification and
Allelic Association
Full heritage Am. Indian population
Gm3;5,13,14 prevalence: 1%
NIDDM prevalence: 40%
Caucasian population
Gm3;5,13,14 prevalence: 66%
NIDDM prevalence: 15%
Gm3;5,13,14
haplotype
NIDDM +
NIDDM -
+
7.8%
29.0%
-
92.2%
71.0%
Index of Indian
heritage
0
4
8
Gm3;5,13,14
+
17.8%
28.3%
OR = 0.27
[0.18,0.40]]
Gm3;5,13,14
19.9%
28.8%
35.9%
39.3%
Cardon LR and Palmer LJ, Lancet 2003; 361:598-604, after
Knowler et al 1988.
Unlinked Genetic Markers
in Population Stratification
• Population stratification (or any non-random
mating) allows marker-allele frequencies to
vary among population segments.
• Disease more prevalent in one
subpopulation will be associated with any
alleles in high frequency in that
subpopulation.
• If population stratification exists, can often
be detected by analysis of unlinked marker
loci. [Pritchard JD, Rosenberg NA; AJHG
. 1999; 65:220-228]
Adjusting for Population
Stratification in a GWAS of T2DM*
• Case-control study of 661 cases of T2DM and
614 controls from France.
• Genotyping assayed 392,935 SNPs
• SNP 200kb from lactase gene on 2q21:
– Strong association with T2DM
– Strong north-south prevalence gradient in France
• Used 20,323 SNPs not related to T2DM as
measure of population stratification.
• After adjustment for stratification, most of the
association was removed.
*Sladek R et al. Nature 2007; 445: 881-885.
Phenotypic Variation Bias: Are
the case homogeneous?
• GWAS of Atrial Fibrillation*
– Sample 1: hospital diagnosis of AF “confirmed by 12lead ECG”.
– Sample 2: patients with ischemic stroke or TIA,
diagnosis of AF “based on 12-lead ECG.”
– Sample 3: patients hospitalized with acute stroke
“diagnosed with AF.”
– Sample 4: patients with lone AF of AF plus
hypertension referred to arrythmia service, “AF
documented by ECG.”
Gudbjartsson et al, Nature 2007; 448: 353-357
Information Bias: Systematic
differences in data collection
between cases and controls
• Genotyping quality bias: Lack of
genotyping protocol for exclusion of SNPs
for quality control criteria or publication of
call rate.
– Testing for Hardy-Weinberg disequilibrium
– Transmission disequilibrium testing:
differential rate of genotyping error leading to
distortion of allele frequency in cases/controls
Is DNA Collected and Handled
Identically in Cases and Controls?
• T1DM gene association study: cases from
GRID Study, controls from 1958 British Birth
Cohort Study examining 6322 SNPs.
• Samples from lymphoblastoid cell lines extracted
using same protocol in two different laboratories.
• Case and control DNAs randomly ordered with
teams masked to case/control status.
• Some extreme associations could not be
replicated by second genotyping method.
Clayton DG et, Nat Genet 2005; 37: 1243-46.
Biases in the Analysis and
Presentation of Data
Environmental exposure information bias:
Lack of collection or presentation of known
environmental causes of the disease or
comparisons between cases and controls.
Confounding control bias: Lack of statistical
adjustment or stratified analysis in
presence of potential confounding.
Characteristics of 109 GWAS:
Confounding
• Few comparisons of environmental
exposures known to predispose to disease
between cases and controls.
– Table comparing cases and controls: 36%
– Statistical comparison of cases/controls: 3.5%
– Statistical adjustment for differences: 16%
– Stratified analysis by confounder group: 16%
Distribution of Three Known Risk
Factors for Neovascular AMD in a
GWA
[DeWan A et al, Science 2006]
Covariate
Cases Controls
(n = 96) (n = 130)
Male sex (%)
68
33
Age (yrs)
75
74
Smokers (%)
63
26
DeWan A et al, Science 2006; 314:989-992.
Confounding
• Confounder: “A factor that distorts the apparent
magnitude of the effect of a study factor on risk.
Such a factor is a determinant of the outcome of
interest and is unequally distributed among the
exposed and the unexposed” (Last, 1983).
– Associated with exposure
– Independent cause or predictor of disease
– Not an intermediate step in causal pathway
E
E
C
C
D
D
Aschengrau and Seage, Essentials of Epidemiology in Public Health, 2003.
FTO Variants, Type 2 Diabetes,
and Obesity*
Cohort
WTCCC phase 1
WTCCC phase 2
DGI
Diabetes Association
OR [ 95% CI ]
P
1.2 [1.16-1.37] 2xE-8
1.22 [1.12-1.32] 5xE-7
1.03 [0.91-1.71] 0.25
Frayling, 2007 and Zeggini, 2007
FTO Variants, Type 2 Diabetes,
and Obesity*
WTCC Cases
WTCC Controls
BMI Association
TT
AT
30.2
30.5
26.3
26.3
*Frayling 2007 and Zeggini 2007
AA
32.0
27.1
FTO Variants, Type 2 Diabetes,
and Obesity*
Cohort
WTCCC phase 1
WTCCC phase 2
DGI
Diabetes Association
OR [+/-95%]
P
1.27 [1.16-1.37] 2xE-8
1.22 [1.12-1.32] 5xE-7
1.03 [0.91-1.71] 0.25
WTCCC phase 2
Diabetes Association
Adjusted for BMI
1.03 [0.96-1.10] 0.44
Frayling TM,et al. Science 2007; 316: 889-894.
Zeggini E, et al. Science 2007; 316: 1336-1341.
Dealing with Confounders
• In design
– Randomize
– Restrict: confine study subjects to those within
specified category of confounder
– Match: select cases and controls so confounders
equally distributed
• In analysis
– Standardize: for age, gender, time
– Stratify: separate sample into subsamples according
to specified criteria
– Multivariate analysis: adjust for many confounders
Aschengrau and Seage, Essentials of Epidemiology in
Public Health, 2003
Biases in the Analysis and
Presentation of Data (Cont.)
• Alpha error control bias: Lack of correction of
level of alpha error accepted as significant.
• Data dredging bias: Lack of replication studies
testing hypotheses identified in a discovery
study.
• The winner’s curse: The overestimation of the
effect size in discovery GWAS at the extremes
of their range with inability to replicate the odds
ratios due to lack of adequate power to identify
the true odds ratio of smaller magnitude.
Prostate Cancer: 7 Novel SNPs in
Discovery and Replication Studies
Discovery
Replication
SNP
OR
95%CI
OR
95%CI
rs2660753 1.52 1.30-1.77 1.18 1.06-1.31
rs9364554 1.28 1.16-1.41 1.17 1.08-1.26
rs6465657 1.30 1.19-1.43 1.12 1.05-1.20
rs7920517 1.39 1.27-1.53 1.22 1.14-1.31
rs10993994 1.62 1.47-1.78 1.25 1.17-1.34
rs7931342 0.79 0.72-0.86 0.84 0.79-0.90
rs7931342 1.39 1.23-1.57 1.03 0.94-1.14
Eeles RA et al: Nat Gen 2/10/08
Larson, G. The
Complete Far Side.
2003.
Interpretation Biases in
Genomic Research*
• Confirmation bias: evaluating evidence that
supports one’s preconceptions differently from
evidence that challenges these convictions.
• Rescue bias: discounting data by finding
selective faults in the experiments
• Mechanism bias: being less skeptical when
underlying science furnishes credibility for the
data.
*Kaptchuk TJ. BMJ 2003; 326: 1453-5.
Information to be Included in
Initial Report
• Study information:
– Source of cases and controls
– Methods used for defining disease or trait
– Participation rates and flow chart of selection
– Standard “Table 1,” including rates of missing data
– Success rate of DNA acquisition, comparability
• Genotyping and quality control procedures
• Results
– Analysis methods in sufficient detail to understand
and reproduce what was done
– Simple single-locus and multi-marker (haplotype)
association analyses
– Significance of any known 'positive controls'
Chanock, Manolio et al, Nature 2007; 447: 655-660
Controlling Bias in Genomic
Research: Design
• Define population to be studied
• Maximize representativeness
• Use standard, reproducible methods for
assignment of case/control status
• Use incident cases
• Select controls eligible to become cases
• Estimate and maximize participation rates
• Apply standard genotyping QC methods
• Replicate positive findings on different
genotyping platform
Controlling Bias in Genomic
Research: Analysis
• Describe sources and methods of
ascertaining cases and controls
• Compare participants and non-participants
• Compare cases and controls
• Stratify and adjust for important confounders
(including population stratification)
• Stratify and test for important interactions
• Report results of genotyping QC
• Report results of prior known associations
Larson, G. The Complete Far Side. 2003.
Download