SNPs in association studies

advertisement
Ivan P. Gorlov, Olga Y. Gorlova
& Christopher I. Amos
Colon cancer results from genetic
alterations in multiple genes
Inherited mutations in the APC gene dramatically increase
risk of colon cancer
Cancer development reflects early germline susceptibility as
well as subsequent mutations that promote cancer
Lu et al., 2014
The continuum of variation
Linkage
Sequencing
Association
Chung et al.Carcinogenesis 31:111-120, 2010
SNPs are the most common type of polymorphism
in the human genome
 SNPs occur ~ every 400 bases
 Minor allele frequency describes prevalence of rarer variant
(0,0.5]
 Very Few (<0.01%) SNPs are associated with diseases
 SNPs are the bread of Genome Wide Association Studies
(GWAS)
 Success requires inference about effects from Causal
Variants
GWAS Have Been Highly Successful in Finding Novel Assocations
Manhattan plot of all lung cancers
from 1000 Genomes Imputation
CHRNA5
hTERT
BRCA2
TP63
hMSH5
CHEK2
There are multiple signals on Chr
15q.25 - sequencing of 96 people
Exon 4
76,667,807
GG->GA
rs2229961 exon regionVal134->ILE134
Exon 5
76,669,980
GG->AA
rs16969968exon regionAsp398->Asn398 (negatively charged to positively charged)
76,669,876
TT->TA
exon regionLeu363->Glu363 (hydrophobic to hydrophilic)
76,669,781
CC->CT
exon regionThr331->Thr331
76,669,288
AA->GA
exon regionLys167->Arg167
76,669,665-6
AC deletion
exon regionReading frame shift after codon292
Intron 4
76668145
Observed in 4 cases
Additional variation (small microsatellites and
in/dels) in the promoters of both CHRNA5
and CHRNA3
sdSNPs are deleterious enough to impair gene
function and increase disease risk.
Their effects, however, are not strong enough
for selection to eliminate them from
population (genetic drift, founder effects, and
population bottlenecks are factors that help
retain sdSNPs).
Proportions of functional SNPs in different MAF
categories
We estimated the proportion of nonsynonymous SNPs predicted to be
functional by PolyPhen and SIFT in different MAF categories
PolyPhen
SIFT
The Erudite Hypothesis:
Large fraction of the genetic susceptibility influenced by
rare (<5%) variants with relatively strong effect size
Targeting rarer variants for analysis may detect causal
variants when sample size is large.
The majority of the significant SNPs detected by GWAS
are SNPs tagging untyped causal variants – greatly
reduces power – but indicates region for further study
Brute force analysis is not going to be sufficient to
overcome power loss for multiple comparison – have to
upweight analyses for most likely causal variants
MAF and a Required Sample Size
The numbers of cases and controls that are required in an
association study to detect disease variants with allelic odds ratios
of 1.2 (red), 1.3 (blue), 1.5 (yellow), and 2 (black). Numbers shown
are for a statistical power of 80% at a significance level of P <10-6.
SNP Replication Rate
 GWAS replication rate is low: about 5%
 This suggest that GWAS produce a
considerable number of false discoveries
 Development of methods to predict
which SNP will be replicated and which
will not is important – relevant for seq.
 We evaluated this approach using the
results of published GWASs
Evaluating SNP replications
 Retrieved all data from the Catalog of Published GWA
Studies (http://www.genome.gov/26525384/), Sep 13
 Restricted analysis to SNPs mapping to a single gene
and associated with a disease rather than variation
 SNPs related to 106 diseases were studied, 2659 SNPs
from 512 studies
 SNP findings sorted by date with the first report
denoted as a discovery and subsequent studies may be
replications
 Reproducibility score modeled as the ratio of
successful replications over the total number of
subsequent studies.
Characteristics Studied
Name
Conservation
index
eQTL
Gene Size
Growth factor
Kinase
-Log(P)
MAF
Nuclear
Localization
OMIM
Plasma
Membrane
Receptor
SNP type
Tissue specific
Transcription
factor
Description
The level of evolutionary conservation of the
protein based on the most distant homolog of
the human gene
SNP reported as an eQTL for HapMap Data
Size of the gene region in nucleotides
The protein encoded by the linked gene is a
growth factor
The protein encoded by the linked gene is a
kinase
Minus LOG(P) where P is the P-value
reported in CPG
Minor allele frequency
The protein encoded by the linked gene is
localized in the nucleus
Was associated gene in OMIM
The protein encoded by the linked gene is
localized in plasma membrane
The protein encoded by the linked gene is a
receptor
Type of the SNP. For the details see
materials and methods
The expression of the linked gene is tissue
specific
The protein encoded by the linked gene is a
transcription factor
Source of the data
NCBI HomoloGene database:
http://www.ncbi.nlm.nih.gov/homologene
eQTL SNPs identified in lymphoblastic cell lines from HapMap
project [16]
NCBI RefSeq database: http://www.ncbi.nlm.nih.gov/refseq/
Gene Ontology (GO) database http://geneontology.org/
Gene Ontology (GO) database http://geneontology.org/
Catalog of Published GWAS (CPG):
http://www.genome.gov/26525384
Catalog of Published GWAS (CPG):
http://www.genome.gov/26525384 . MAF reported in control
group were used.
Gene Ontology (GO) database http://geneontology.org/
http://www.ncbi.nlm.nih.gov/omim
Gene Ontology (GO) database http://geneontology.org/
Gene Ontology (GO) database http://geneontology.org/
Catalog of the Published GWAS (CPG):
http://www.genome.gov/26525384
Tissue specific Gene Expression and Regulation (TiGER)
database: http://bioinfo.wilmer.jhu.edu/tiger/
Gene Ontology (GO) database http://geneontology.org/
3' UTR, 3' Downstream, 5' Upstream, 5' UTR, Coding nonsynonymous, Coding synonymous,
Intergenic, Intronic, Non-coding, and Non-coding intronic.
Univariate Test Results
Predictor
Test
Statistics
P-value
Conservation Index
eQTL
Gene Size
Growth Factor
Kinase
-Log(P)
MAF
Nuclear Localization
OMIM
Plasma Membrane
Receptor
SNP Type
Tissue Specific
Transcription Factor
Spearman correlation
Mann-Whitney U test
Spearman correlation
Mann-Whitney U test
Mann-Whitney U test
Spearman correlation
Spearman correlation
Mann-Whitney U test
Mann-Whitney U test
Mann-Whitney U test
Mann-Whitney U test
Kruskal-Wallis (KW) test
Mann-Whitney U test
Mann-Whitney U test
Spearman R = 0.09
(M-W U test) Z adjusted = -5.39
Spearman R = -0.11
(M-W U test) Z adjusted = -2.79
(M-W U test) Z adjusted = -7.84
Spearman R = 0.36
Spearman R = 0.09
(M-W U test) Z adjusted = -7.18
(M-W U test) Z adjusted = 8.16
(M-W U test) Z adjusted = -6.23
(M-W U test) Z adjusted = -5.17
(KW test) Chi-Square = 391. 8, df = 9
(M-W U test) Z adjusted = -3.97
(M-W U test) Z adjusted = 3.32
0.0007
1.90E-07
0.00001
0.008
2.40E-14
2.30E-08
0.0007
2.20E-12
2.20E-15
1.80E-09
8.90E-07
5.50E-79
0.0001
0.008
Further evaluating SNP Type
 (1) All pair-wise comparisons inside the group should
be insignificant; and (2) All pairwise comparisons
between the groups should be significant.
 SNP reproducibility was lowest in the group 1 (5’ UTR,
Intergenic, 5’ Upstream, Non-coding intronic,
Intronic), intermediate in the group 2 (Coding
synonymous, 3’ Downstream, Non-coding), and
highest in the group 3 (Coding nonsynonymous,
5’UTR).
 43 SNPs had >1 annotation
Univariate
Predictors of
Replication
Multivariable Analysis of Predictors
Characteristic
SNP_TYPE
Pvalue_mlog
5_MAF_Groups
B
se(B)
p-value
0.034687
0.000936
0.005509
0.006443
0.000187
0.001293
OMIM
0.021424
0.007765
Nuclear localization
0.025213
0.008232
Kinases
-0.01213
0.014694
Conservation index
Growth factor
Tissue_specific
eQTL HapMap
0.052759
-0.01789
-0.00375
-0.00122
0.04749
0.020355
0.007339
0.007148
Receptors
0.019558
0.012282
Transcription factors
-0.02182
0.012491
Gene Size
0.011653
0.007745
7.31E-08
5.34E-07
2.03E-05
0.0058
0.002192
0.409105
0.266581
0.379
0.609778
0.864
0.111304
0.080629
0.132435
Interactions among factors
Identification of Causal Variants
 Showing a variants has a causal effect on a disease process
is challenging
 Good candidate for the causal variant should be linked to
relevant biological mechanism. For example it may have
effect on gene expression or protein structure/folding.
 Functional analysis is needed to prove that (1) SNP change
the important biological function, and that (2) alteration
of function is associated with disease risk.
Conclusions
 Multivariable analysis showed that 11% of variability
was explained by SNP predictors with 5% attributed to
–log(P) value alone.
 SNP characteristics are second most prominent,
followed by MAF (in wrong direction for weighting)
 Could also define profile measurement
Number of GWA Studies
Evaluated
Number of Different Diseases Studied
35
30
25
20
15
10
5
0
1
2
3
4
5
6
7
8
9
Number of Replicates
10
11
12
13
14
15
16
17
18
Selecting SNPs for Validation
 Selection SNPs for validation stage is tricky
 The common approach is based on the ranking SNPs
based on P-values from the discovery stage and select
the top SNPs
 People also take into account the region: SNPs located
in the region detected by linkage analysis of associated
with genes from candidate pathway are considered to
be a good candidate for replication
Are GWAS-Detected SNPs Tagging or Causal Variants
 Even large genotyping studies e.g. genotyping 1,000,000
SNPs cover only a fraction of genetic variation in the
human genome: there are more than 70,000,000 SNPs in
the human genome
 Causal variant may or may not be on a genotyping
platform but is likely to be captured by sequencing
studies
 Short repeats are not detectable
 Some regions of the genome are not yet mapped
 Inversions are difficult to identify
Improving power to detect
uncommon variants
 Single variant as disease locus*
 Hardy-Weinberg Equilibrium in controls
 Affection status modeled by a penetrance table.
 case-control samples
 High risk allele A with frequency p
 Low risk allele B with frequency q
 Power calculation
 actual sample size needed for genotype-phenotype
relationship to risk
Peng et al. Hum Genet. 2010 Jun;127(6):699-704.
*Multiple variants in a region can be modeled using burden tests
Power = 0.8 Relative risk A: 4, Disease prevalence: 0.05,Disease allele frequency:
p=0.01, MAF 0.01, significance 1 x 10-7
GAME-ON Integrative Projects
Discovery, Expansion,
and Replication
• Find new associations through pooled or
meta-analysis
• Independent replication studies to
confirm genotype-phenotype
associations
• Fine-map of association signals
Biological Studies
• Identify risk-enhancing genetic variants
• Examine functional consequences of a
genetic variant
• Determine biological mechanisms of risk
enhancement
Epidemiologic Studies
• Evaluate gene-gene and geneenvironment interactions
• Assess penetrance and population
attributable risk
• Develop complex risk models
• Evaluate clinical/analytic validity of risk
models in observational studies
GAME-ON Organizational Structure
Lung
Prostate
Breast
Colon
Ovarian
U19- based on Cancer Sites
External
Advisory
Committee
Executive
Committee
(decisional body)
5 P.I.s
NCI program Officers
Steering Committee
(policies and processes)
2 voting members from each U19
Working Groups Chairs
NCI Program Officers
Working Groups in areas of interest
Next Generation
Genomics
Technologies
Analytic and Risk
Modeling
Functional Assays
Epigenetics
Epidemiology and
Clinical
TERT-CLPTM1L
Clinical and Epidemiological
Working Group
• Established July 2010; Chairs – Roz Eeles, Rayjean Hung
• Major initiatives
‒ Pan-cancer meta-analysis across sites,
‒ Inflammation pathway,
‒ pleiotropy analysis (with PAGE)
• Inventory of data and biospecimens
‒ 96,205 cases/241,880 controls
‒ whole blood derived DNA for 70585 cases and 81,673 controls
‒ 3317 fresh frozen tumors
‒ 14,150 FFPE tumors
• Virtual pathology network
• Managing data harmonization across sites
Analytical and Risk Modeling
Working Group
• Established July 2010, Chair – Peter Kraft
• Held regular webinars about innovative methods
• Provided leadership for integration of data from
HapMap 2
• Provided guidelines for ongoing imputations of 1000
Genomes imputation
• Developed and implemented novel meta-analytical
methods
• Evaluated approaches for detecting gene x environment
effects
Sporadic Lung Cancer
Genome Wide Chr 5p Association
Wang, et al. Nature Genetics
40(12): 1407–1409 (2008)
There is Marked Heterogeneity in
Associations for Chromosome 5p region
Evaluation of 5p Genes as Potential Lung Tumorigenesis Modifiers
in a shRNA/ KRASLSL-G12D/+ Mouse Model
Loss of CLPTM1L sensitizes lung tumor cells to
genotoxic agents
Prevention and Cessation
The Public Health Mission
Initiation
Cigarette Use
Nicotine Dependence
Cessation and Remission
Chromosome 15q25 Is Important for Smoking
CHRNA5A3-B4
The Tobacco and Genetics Consortium (2010) Nature Genetics
Genetics of Smoking Cessation
 Though the strong genetic effect of the a5
nicotinic receptor contributes to heavy smoking,
little to no effect is seen for smoking cessation.
Survival Analysis
Smoking Cessation N=1,015
University of Wisconsin
Timothy Baker and Colleagues
Chen et al., 2012
Genetic Effect Seen in the
Placebo Group
Entire Sample
(N=1015)
Placebo group
(N=117)
UW-TTURC study.
Haplotypes based on 2 SNPs
(rs16969968, rs680244)
H1=GC
20.8%
H2=GT
43.7%
H3=AC
35.5%
Chen et al., 2012
Treatment group
(N=898)
A Significant Genotype by Treatment Interaction
OR
(Abstinence)
1.6
1.4
1.2
1.0
1.11
Haplotypes (rs16969968,
rs680244)
H1=GC(20.8%)
H2=GT(43.7%)
1.13 H3=AC(35.5%)
1.00 0.98
Placebo
0.6
0.4
Reference
0.8
0.62
Treatment
0.37
0.2
0.0
H1
Chen et al., 2012
H2
Haplotypes
H3
Haplotypes predict abstinence in individuals receiving
placebo medication
OR
(Abstinence)
1.6
1.4
1.2
1.00
1.0
Placebo
0.6
0.4
Reference
0.8
0.62
Treatment
0.37
0.2
0.0
H1
Chen et al., 2012
H2
Haplotypes
H3
A Significant Genotype by Treatment Interaction
(X2=8.97, df=2, p=0.
OR
(Abstinence)
1.6
1.4
1.2
1.0
1.13
1.11
1.00 0.98
Placebo
0.6
0.4
Reference
0.8
0.62
Treatment
0.37
0.2
0.0
H1
Chen et al., 2012
H2
Haplotypes
H3
Haplotypes do NOT predict abstinence in individuals
receiving pharmacologic treatment
OR
(Abstinence)
1.6
1.4
1.2
1.0
1.11
1.13
1.00 0.98
Placebo
0.6
0.4
Reference
0.8
Treatment
0.2
0.0
H1
Chen et al., 2012
H2
Haplotypes
H3
Smokers with the high risk haplotype are more likely to
respond to pharmacologic treatment
OR
(Abstinence)
1.6
1.4
1.13
1.2
1.00
1.0
Placebo
0.6
0.4
Reference
0.8
Treatment
0.37
0.2
0.0
H1
Chen et al., 2012
H2
Haplotypes
H3
Smokers with the low risk haplotype do NOT benefit
from pharmacologic treatment
OR
(Abstinence)
1.6
1.4
1.2
1.0
1.00 0.98
Placebo
0.6
0.4
Reference
0.8
Treatment
0.2
0.0
H1
Chen et al., 2012
H2
Haplotypes
H3
The genotype by treatment effect do not differ across
pharmacologic treatments.
1.00
0.90
0.80
Abstinence
(Proportion)
0.70
0.60
0.50
H1
0.40
H2
H3
0.30
0.20
0.10
0.00
Placebo
Chen et al., 2012
Buproprion
only
NRT only
Combined
No difference in haplotypic risks on
cessation across medication groups
Conclusions
 GWAS Studies have been amazingly productive in
identifying new genetic architecture of common
influences for complex diseases
 Evaluating the actual functional effects of identified
variants can be challenging, relatively few causal
variants are directly identified
 Follow-up studies have identified significant effects of
genetic loci on cancer development
 Genetic loci identified by GWAS analyses have
provided insights into cancer prevention
 Sequencing is useful but expensive, targeted
applications are best
Oncochip - $49/sample
U19
Other
Lung
7,000
Ovary
Sources
CIDR
Genotype
0
40,000
Toronto, China
5,000
0
40,000
Cambridge
Colorectal
5,000
0
40,000
USC
Breast
6,000
77,000
Genome
Canada
EU
CR-UK
DKFZ
BCFR
MDEIE
40,000
Cambridge,
Genome Quebec
Prostate
45,000
46,000
NCI
BH
Movember
40,000
Cambridge, USC,
NHGRI
BRCA1/2
Carriers
0
20,000
10,000
Genome Quebec,
Cambridge
TOTAL
68,000 143,000
210,000
Download