Andrea Baccarelli, MD, PhD, MPH
Laboratory of Environmental Epigenetics
Harvard School of Public Health abaccare@hsph.harvard.edu
Genetics
• Candidate gene approach
• A priori knowledge → candidate genes
• test for association with disease/phenotype
• Genome-wide approach (GWAS)
• Agnostic approach → entire genome
• test for association with disease/phenotype
Graphical representation of GWAS findings
Manhattan plot
Systemic Sclerosis (auto-immune disease) Radstake et al., Nature Genetics 2010
Published Genome-Wide Associations through 12/2013
Published GWA at p≤5X10 -8 for 17 trait categories
NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/
Epigenetics
• Candidate gene (gene-specific) approach
• A priori knowledge → candidate genes
• test for association with exposure/risk factor
• test for association with disease/phenotype
• Global (average) level of methylation (5mC content)
• Average methylation of all CpG sites across the genome
• test for association with exposure/risk factor
• test for association with disease/phenotype
• Epigenome-wide approach (EWAS)
• Agnostic approach → entire genome
• test for association with exposure/risk factor
• test for association with disease/phenotype
Examples for DNA methylation
• Candidate gene approach
– AAB’s blood has 26% methylation in the IL6 promoter
(N.B.: any other region of interest can be targeted, e.g.,
CpGi shore, shelf, etc.)
• Global methylation approach
– AAB’s blood has 4.5% methylation (i.e., 4.5% of all cytosines found in blood are methylated; no information on where the methylated cytosines are located)
• Genome-wide approach
– Methylation in AAB’s blood is measured at a high number of CpG sites (e.g, if we use Illumina Infinium 450K beadchip → we will get ≈486,000 numbers [one for each
CpG site] for AAB’s blood)
GWAS/EWAS
• Screen for 100Ks to millions of loci:
– GWAS: Single nucleotide polymorphisms (SNPs)
– EWAS: CpG sites
• The EWAS field is relatively new
• Several tools are methods are inferred from
GWAS
Features covered in the 450k Infinium BeadChip
The 450K BeadChip covers a total of 77,537 CpG Islands and CpG Shores (N+S)
Region Type Regions
CpG Island
N Shore
S Shore
N Shelf
S Shelf
Remote/Unassigned
Total
26,153
25,770
25,614
23,896
23,968
-
CpG sites covered on
450K BeadChip array
139,265
73,508
71,119
49,093
48,524
104,926
485,553
Average # of CpG sites per region
5.08
2.74
2.66
1.97
1.94
-
N Shelf N Shore CpG Island S Shore S Shelf
TSS1500 TSS200 5’ UTR
The 450K BeadChip covers a total of 20,617 genes
3’ UTR
GWAS vs. EWAS
• Type of data
– GWAS: SNP can assume only 3 values: 0 (wt/wt); 1
(wt/var); 2 (var/var)
– EWAS: measures are quantitave: e.g.: Illumina infinium β value between 0 and 1
• Changes over time
– GWAS: SNPs (almost) never change
– EWAS: epigenetic marks change over time
• Tissue specificity
– GWAS: SNPs are not tissue specific
– EWAS: epigenetic marks are tissue specific
Vulcano plot
Differences between liver cancer cases and controls
Shen Hepatology 2012
Multiple comparisons
• Infinium 450K methylation BeadChip
– Methylation measured at 485,553 CpG sites
– We will do 485,553 statistical tests
– Any problem with that?
• If you conduct 20 tests at α=0.05
– one significant (positive) by chance at p<0.05
• If you conduct 485,553 tests
– 24,277 significant (positives) by chance at p<0.05
Statistical corrections for multiple comparisons
• Bonferroni correction
– Multiple tests inflate the cumulative α
– Dividing α/ 485,553 solves the problem
– Threshold for significance commonly set at p =
0.05/485,553 = 1.0e-7
• False discovery rate (FDR)
– Focuses on positive (significant) findings at a ‘nominal’ uncorrected p-value
– FDR is the proportion of false positives among all positive findings
– FDR controlling procedures have been developed to control the expected proportion of false positives (e.g., Benjamini
Hockberg)
YES
True association
NO
True
Positive
False
Positive
False
Negative
True
Negative
P-value =
FP
TN + FP
Probability of a false positive finding under the null hypothesis (i.e., no true association)
FDR =
FP
TP + FP
If I have a number X of significant p-values, how many are false positives?
(Proportion of false positives)
Learning from past experience (in genetics)
Relative odds of alcohol dependency associated with Taq1A polymorphism
1990
Original
1995 Odds Ratio as a Function of Publication Year
1999
2004
Smith et al. (2008)
American Journal of Epidemiology, 167(2): 125-138.
Final OR=1.4
The winner’s curse
• On ebay – Given the lack of information on the true value of the item being auctioned
– High variance in the estimated (dollar) values
• many over-and many under-estimates (bids)
– The “winner” is likely to have made the largest overestimate of value
• i.e., he or she is paying (way) too much
• In genetics – The winner’s curse has been common
– the first report of an association of genetic variation with disease is likely to overestimate the effect size
• In epigenetics : Does the same apply?
Replication is needed
Replication
Replication
Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005
NCI-NHGRI Working Group on Replication Nature 447: 655, 2007
Strategies for discovery and Replication
• We will review different approaches for discovery and replication
• Examples from published studies
– Examples from EWAS when available
– Same concepts apply to both EWAS and GWAS
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
-66 cases of Hepatocellular carcinoma (HCC) assessed using 450K BeadChip
-Differences in methylation in cancer tissues vs. adjacent non cancer tissues
-Bonferroni-corrected p value ≤ 0.05; corresponds to a raw p value of ≤ 1.06 × 10−7
-After Bonferroni adjustment, a total of 130,512 CpG sites significantly differed in methylation level in tumor compared with non-tumor tissues, with 28,017 CpG sites
hypermethylated and 102,495 hypomethylated in tumor tissues.
Additional filtering
• Hypermethylated sites
– mean difference in methylation tumor vs normal > 20%
– > 70% of the tumor tissues methylation >2SDs above mean methylation level of all 66 adjacent tissues
– mean methylation for adjacent tissues < 25%
• Hypomethylated sites:
• mean difference in methylation tumor vs normal > 20%
• > 70% of the tumor tissues methylation >2SDs below mean methylation level of all 66 adjacent tissues
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
• Internal Replication
– Sample two or more groups from the same population
– Group 1: EWAS; Other groups: candidate gene analysis
– Overall power lower than same-size discovery only
(Skol AD, Nat Genet 2006).
• All subjects from the ESTHER cohort in
Germany
• Internal Replication
– Discovery on 177 participants from ESTHER
(27K Infinium methylation BeadChip analysis)
– Replication on 316 participants from ESTHER
(Sequenom MASS-ARRAY)
Discovery and replication groups
Discovery
Discovery → validation → replication (top gene)
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
• Internal Replication
– Sample two or more groups from the same population
– Group 1: EWAS; Other groups: candidate gene analysis
– Overall power lower than same-size discovery only
(Skol AD, Nat Genet 2006).
• Discovery > External (Independent) Replication
– Two (or more) independent studies
– Ensure validation + generalizability
(1) Discovery: Cord blood and peripheral blood samples from 1018 ALSPAC childmother pairs (450K Infinium methylation BeadChip analysis)
(2) External Replication:
• The WMHP and CANDLE cohort (27K Infinium methylation BeadChip analysis)
• The NB and MoBa cohort (450K Infinium methylation BeadChip analysis)
• And a case–control study (450K Infinium methylation BeadChip analysis)
Discovery → Replication
Gestational Age:
• 224 top hits: GA had a negative association with methylation at 188 probes and a positive association at 36 probes
• 129 replicated in the NB cohort and 5 were replicated in the WMHP and
CANDLE
• 72 previously reported in the case-control study
Birth Weight:
• 23 associations observed between birth weight and cord blood methylation in the discovery study
• 2 out of 23 replicated in the MoBa cohort
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
• Internal Replication
– Sample two or more groups from the same population
– Group 1: EWAS; Other groups: candidate gene analysis
– Overall power lower than same-size discovery only
(Skol AD, Nat Genet 2006).
• Discovery > Replication
– Two (or more) independent studies
– Ensure validation + generalizability
• Meta-analysis
– Uses estimates from multiple populations
– Needed to achieve large sample size
– Allows for evaluating generalizability
• 44,494 participants of European ancestry
– from nine large studies participating in the
Cohorts for Heart and Aging Research in Genomic
Epidemiology (CHARGE) Consortium.
– seven additional studies
• Each study computes association statistics
(e.g., ORs and p-values), then results are meta-analyzed
• Only results (not data) are shared
Results for intima media thickness
Forest plot for ZHX2 – rs11781551
(zinc fingers and homeoboxes 2)
Population Stratification*
Each population has unique genetic and social history; ancestral patterns of migration, mating, expansions/bottlenecks, stochastic variation all yield differences in allele frequencies between populations.
Population stratification: cases and controls have different allele frequencies due to diversity in populations of origin and unrelated to outcome, requiring:
1) differences in disease prevalence
2) differences in allele frequencies
*Cardon LR, Palmer LJ, Lancet 2003
What is population stratification?
Balding, Nature Reviews Genetics 2010
Unlinked Genetic Markers in Population Stratification
.
• Population stratification (or any non-random mating) allows marker-allele frequencies to vary among population segments.
• Disease more prevalent in one subpopulation will be associated with any alleles in high frequency in that subpopulation.
• If population stratification exists, can often be detected by analysis of unlinked marker loci.
[Pritchard JD, Rosenberg NA; AJHG 1999; 65:220-
228]
Adjusting for Population Stratification in a GWAS of T2DM*
• Case-control study of 661 cases of T2DM and 614 controls from France.
• Genotyping assayed 392,935 SNPs
• SNP 200kb from lactase gene on 2q21:
– Strong association with T2DM
– Strong north-south prevalence gradient in France
• Used 20,323 SNPs not related to T2DM as measure of population stratification.
• After adjustment for stratification, most of the association was removed.
*Sladek R et al. Nature 2007; 445: 881-885.
Sources of analytical variability for methylation EWAS
• Several factors can affect results
– DNA/sample quality
– Plate effects
– Batch effect
– Row/column effect
• How to handle this
– Best laboratory practice
– Randomize/balance samples
– Universal DNA/Replicates
– Bioinformatics/Statistical analysis
40
Is DNA Collected and Handled Identically in Cases and Controls?
• T1DM gene association study: cases from GRID
Study, controls from 1958 British Birth Cohort Study examining 6322 SNPs.
• Samples from lymphoblastoid cell lines extracted using same protocol in two different laboratories.
• Case and control DNAs randomly ordered with teams masked to case/control status.
• Some extreme associations could not be replicated by second genotyping method.
Clayton DG et, Nat Genet 2005; 37: 1243-46.
In-class Readings
Papers
• Lee et al. Quantitative promoter hypermethylation analysis of RASSF1A in lung cancer: Comparison with methylation-specific PCR technique and clinical significance. Mol Med Report 2011.
• Joubert et al. 450K Epigenome-Wide Scan Identifies
Differential DNA Methylation in Newborns Related to
Maternal Smoking during Pregnancy. Environ Health
Perspect 2012
In-class Readings
Questions
•DNA methylation analysis:
•Which technique was used?
•How much DNA was used?
•Did it involve bisulfite treatment?
•Aim of the study:
•What was measured?
•Why?
•Results:
•How were DNA methylation results reported?
•Which statistical analysis was used?
Guest Lectures: Reproductive Epigenetics and
Prenatal Influences on the Epigenome
Karin Michels, PhD, ScD
Co-Director, Ob/Gyn Epidemiology Center, BWH
Heather Herson Burris, MD, MPH
Neonatology, BIDMC