Lecture7_2015

advertisement

Andrea Baccarelli, MD, PhD, MPH

Laboratory of Environmental Epigenetics

Harvard School of Public Health abaccare@hsph.harvard.edu

Lecture 7

From GWAS to EWAS &

Interpretation of epigenetic data

Genetics

• Candidate gene approach

• A priori knowledge → candidate genes

• test for association with disease/phenotype

• Genome-wide approach (GWAS)

• Agnostic approach → entire genome

• test for association with disease/phenotype

Graphical representation of GWAS findings

Manhattan plot

Systemic Sclerosis (auto-immune disease) Radstake et al., Nature Genetics 2010

Published Genome-Wide Associations through 12/2013

Published GWA at p≤5X10 -8 for 17 trait categories

NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/

Epigenetics

• Candidate gene (gene-specific) approach

• A priori knowledge → candidate genes

• test for association with exposure/risk factor

• test for association with disease/phenotype

• Global (average) level of methylation (5mC content)

• Average methylation of all CpG sites across the genome

• test for association with exposure/risk factor

• test for association with disease/phenotype

• Epigenome-wide approach (EWAS)

• Agnostic approach → entire genome

• test for association with exposure/risk factor

• test for association with disease/phenotype

Examples for DNA methylation

• Candidate gene approach

– AAB’s blood has 26% methylation in the IL6 promoter

(N.B.: any other region of interest can be targeted, e.g.,

CpGi shore, shelf, etc.)

• Global methylation approach

– AAB’s blood has 4.5% methylation (i.e., 4.5% of all cytosines found in blood are methylated; no information on where the methylated cytosines are located)

• Genome-wide approach

– Methylation in AAB’s blood is measured at a high number of CpG sites (e.g, if we use Illumina Infinium 450K beadchip → we will get ≈486,000 numbers [one for each

CpG site] for AAB’s blood)

GWAS/EWAS

• Screen for 100Ks to millions of loci:

– GWAS: Single nucleotide polymorphisms (SNPs)

– EWAS: CpG sites

• The EWAS field is relatively new

• Several tools are methods are inferred from

GWAS

Features covered in the 450k Infinium BeadChip

The 450K BeadChip covers a total of 77,537 CpG Islands and CpG Shores (N+S)

Region Type Regions

CpG Island

N Shore

S Shore

N Shelf

S Shelf

Remote/Unassigned

Total

26,153

25,770

25,614

23,896

23,968

-

CpG sites covered on

450K BeadChip array

139,265

73,508

71,119

49,093

48,524

104,926

485,553

Average # of CpG sites per region

5.08

2.74

2.66

1.97

1.94

-

N Shelf N Shore CpG Island S Shore S Shelf

TSS1500 TSS200 5’ UTR

The 450K BeadChip covers a total of 20,617 genes

3’ UTR

GWAS vs. EWAS

• Type of data

– GWAS: SNP can assume only 3 values: 0 (wt/wt); 1

(wt/var); 2 (var/var)

– EWAS: measures are quantitave: e.g.: Illumina infinium β value between 0 and 1

• Changes over time

– GWAS: SNPs (almost) never change

– EWAS: epigenetic marks change over time

• Tissue specificity

– GWAS: SNPs are not tissue specific

– EWAS: epigenetic marks are tissue specific

Vulcano plot

Differences between liver cancer cases and controls

Shen Hepatology 2012

Multiple comparisons

• Infinium 450K methylation BeadChip

– Methylation measured at 485,553 CpG sites

– We will do 485,553 statistical tests

– Any problem with that?

• If you conduct 20 tests at α=0.05

– one significant (positive) by chance at p<0.05

• If you conduct 485,553 tests

– 24,277 significant (positives) by chance at p<0.05

Statistical corrections for multiple comparisons

• Bonferroni correction

– Multiple tests inflate the cumulative α

– Dividing α/ 485,553 solves the problem

– Threshold for significance commonly set at p =

0.05/485,553 = 1.0e-7

• False discovery rate (FDR)

– Focuses on positive (significant) findings at a ‘nominal’ uncorrected p-value

– FDR is the proportion of false positives among all positive findings

FDR controlling procedures have been developed to control the expected proportion of false positives (e.g., Benjamini

Hockberg)

YES

True association

NO

True

Positive

False

Positive

False

Negative

True

Negative

P-value =

FP

TN + FP

Probability of a false positive finding under the null hypothesis (i.e., no true association)

FDR =

FP

TP + FP

If I have a number X of significant p-values, how many are false positives?

(Proportion of false positives)

Learning from past experience (in genetics)

Relative odds of alcohol dependency associated with Taq1A polymorphism

1990

Original

1995 Odds Ratio as a Function of Publication Year

1999

2004

Smith et al. (2008)

American Journal of Epidemiology, 167(2): 125-138.

Final OR=1.4

The winner’s curse

• On ebay – Given the lack of information on the true value of the item being auctioned

– High variance in the estimated (dollar) values

• many over-and many under-estimates (bids)

– The “winner” is likely to have made the largest overestimate of value

• i.e., he or she is paying (way) too much

• In genetics – The winner’s curse has been common

– the first report of an association of genetic variation with disease is likely to overestimate the effect size

• In epigenetics : Does the same apply?

Replication is needed

Replication

Replication

Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005

NCI-NHGRI Working Group on Replication Nature 447: 655, 2007

Strategies for discovery and Replication

• We will review different approaches for discovery and replication

• Examples from published studies

– Examples from EWAS when available

– Same concepts apply to both EWAS and GWAS

EWAS validation – Study design

• Discovery only (Single study)

– Prone to false positive findings (negative too)

-66 cases of Hepatocellular carcinoma (HCC) assessed using 450K BeadChip

-Differences in methylation in cancer tissues vs. adjacent non cancer tissues

-Bonferroni-corrected p value ≤ 0.05; corresponds to a raw p value of ≤ 1.06 × 10−7

-After Bonferroni adjustment, a total of 130,512 CpG sites significantly differed in methylation level in tumor compared with non-tumor tissues, with 28,017 CpG sites

hypermethylated and 102,495 hypomethylated in tumor tissues.

Additional filtering

• Hypermethylated sites

– mean difference in methylation tumor vs normal > 20%

– > 70% of the tumor tissues methylation >2SDs above mean methylation level of all 66 adjacent tissues

– mean methylation for adjacent tissues < 25%

• Hypomethylated sites:

• mean difference in methylation tumor vs normal > 20%

• > 70% of the tumor tissues methylation >2SDs below mean methylation level of all 66 adjacent tissues

EWAS validation – Study design

• Discovery only (Single study)

– Prone to false positive findings (negative too)

• Internal Replication

– Sample two or more groups from the same population

– Group 1: EWAS; Other groups: candidate gene analysis

– Overall power lower than same-size discovery only

(Skol AD, Nat Genet 2006).

• All subjects from the ESTHER cohort in

Germany

• Internal Replication

– Discovery on 177 participants from ESTHER

(27K Infinium methylation BeadChip analysis)

– Replication on 316 participants from ESTHER

(Sequenom MASS-ARRAY)

Discovery and replication groups

Discovery

Discovery → validation → replication (top gene)

EWAS validation – Study design

• Discovery only (Single study)

– Prone to false positive findings (negative too)

• Internal Replication

– Sample two or more groups from the same population

– Group 1: EWAS; Other groups: candidate gene analysis

– Overall power lower than same-size discovery only

(Skol AD, Nat Genet 2006).

• Discovery > External (Independent) Replication

– Two (or more) independent studies

– Ensure validation + generalizability

(1) Discovery: Cord blood and peripheral blood samples from 1018 ALSPAC childmother pairs (450K Infinium methylation BeadChip analysis)

(2) External Replication:

• The WMHP and CANDLE cohort (27K Infinium methylation BeadChip analysis)

• The NB and MoBa cohort (450K Infinium methylation BeadChip analysis)

• And a case–control study (450K Infinium methylation BeadChip analysis)

Discovery → Replication

Gestational Age:

• 224 top hits: GA had a negative association with methylation at 188 probes and a positive association at 36 probes

• 129 replicated in the NB cohort and 5 were replicated in the WMHP and

CANDLE

• 72 previously reported in the case-control study

Birth Weight:

• 23 associations observed between birth weight and cord blood methylation in the discovery study

2 out of 23 replicated in the MoBa cohort

EWAS validation – Study design

• Discovery only (Single study)

– Prone to false positive findings (negative too)

• Internal Replication

– Sample two or more groups from the same population

– Group 1: EWAS; Other groups: candidate gene analysis

– Overall power lower than same-size discovery only

(Skol AD, Nat Genet 2006).

• Discovery > Replication

– Two (or more) independent studies

– Ensure validation + generalizability

• Meta-analysis

– Uses estimates from multiple populations

– Needed to achieve large sample size

– Allows for evaluating generalizability

• 44,494 participants of European ancestry

– from nine large studies participating in the

Cohorts for Heart and Aging Research in Genomic

Epidemiology (CHARGE) Consortium.

– seven additional studies

• Each study computes association statistics

(e.g., ORs and p-values), then results are meta-analyzed

• Only results (not data) are shared

Results for intima media thickness

Forest plot for ZHX2 – rs11781551

(zinc fingers and homeoboxes 2)

Population Stratification*

Each population has unique genetic and social history; ancestral patterns of migration, mating, expansions/bottlenecks, stochastic variation all yield differences in allele frequencies between populations.

Population stratification: cases and controls have different allele frequencies due to diversity in populations of origin and unrelated to outcome, requiring:

1) differences in disease prevalence

2) differences in allele frequencies

*Cardon LR, Palmer LJ, Lancet 2003

What is population stratification?

Balding, Nature Reviews Genetics 2010

Unlinked Genetic Markers in Population Stratification

.

• Population stratification (or any non-random mating) allows marker-allele frequencies to vary among population segments.

• Disease more prevalent in one subpopulation will be associated with any alleles in high frequency in that subpopulation.

• If population stratification exists, can often be detected by analysis of unlinked marker loci.

[Pritchard JD, Rosenberg NA; AJHG 1999; 65:220-

228]

Adjusting for Population Stratification in a GWAS of T2DM*

• Case-control study of 661 cases of T2DM and 614 controls from France.

• Genotyping assayed 392,935 SNPs

• SNP 200kb from lactase gene on 2q21:

– Strong association with T2DM

– Strong north-south prevalence gradient in France

• Used 20,323 SNPs not related to T2DM as measure of population stratification.

• After adjustment for stratification, most of the association was removed.

*Sladek R et al. Nature 2007; 445: 881-885.

Sources of analytical variability for methylation EWAS

• Several factors can affect results

– DNA/sample quality

– Plate effects

– Batch effect

– Row/column effect

• How to handle this

– Best laboratory practice

– Randomize/balance samples

– Universal DNA/Replicates

– Bioinformatics/Statistical analysis

40

Is DNA Collected and Handled Identically in Cases and Controls?

• T1DM gene association study: cases from GRID

Study, controls from 1958 British Birth Cohort Study examining 6322 SNPs.

• Samples from lymphoblastoid cell lines extracted using same protocol in two different laboratories.

• Case and control DNAs randomly ordered with teams masked to case/control status.

• Some extreme associations could not be replicated by second genotyping method.

Clayton DG et, Nat Genet 2005; 37: 1243-46.

Interpretation of epigenetic data

In-class Readings

Papers

• Lee et al. Quantitative promoter hypermethylation analysis of RASSF1A in lung cancer: Comparison with methylation-specific PCR technique and clinical significance. Mol Med Report 2011.

• Joubert et al. 450K Epigenome-Wide Scan Identifies

Differential DNA Methylation in Newborns Related to

Maternal Smoking during Pregnancy. Environ Health

Perspect 2012

In-class Readings

Questions

•DNA methylation analysis:

•Which technique was used?

•How much DNA was used?

•Did it involve bisulfite treatment?

•Aim of the study:

•What was measured?

•Why?

•Results:

•How were DNA methylation results reported?

•Which statistical analysis was used?

Next lecture

Guest Lectures: Reproductive Epigenetics and

Prenatal Influences on the Epigenome

Karin Michels, PhD, ScD

Co-Director, Ob/Gyn Epidemiology Center, BWH

Heather Herson Burris, MD, MPH

Neonatology, BIDMC

Download