Ivan P. Gorlov, Olga Y. Gorlova & Christopher I. Amos Colon cancer results from genetic alterations in multiple genes Inherited mutations in the APC gene dramatically increase risk of colon cancer Cancer development reflects early germline susceptibility as well as subsequent mutations that promote cancer Lu et al., 2014 The continuum of variation Linkage Sequencing Association Chung et al.Carcinogenesis 31:111-120, 2010 SNPs are the most common type of polymorphism in the human genome SNPs occur ~ every 400 bases Minor allele frequency describes prevalence of rarer variant (0,0.5] Very Few (<0.01%) SNPs are associated with diseases SNPs are the bread of Genome Wide Association Studies (GWAS) Success requires inference about effects from Causal Variants GWAS Have Been Highly Successful in Finding Novel Assocations Manhattan plot of all lung cancers from 1000 Genomes Imputation CHRNA5 hTERT BRCA2 TP63 hMSH5 CHEK2 There are multiple signals on Chr 15q.25 - sequencing of 96 people Exon 4 76,667,807 GG->GA rs2229961 exon regionVal134->ILE134 Exon 5 76,669,980 GG->AA rs16969968exon regionAsp398->Asn398 (negatively charged to positively charged) 76,669,876 TT->TA exon regionLeu363->Glu363 (hydrophobic to hydrophilic) 76,669,781 CC->CT exon regionThr331->Thr331 76,669,288 AA->GA exon regionLys167->Arg167 76,669,665-6 AC deletion exon regionReading frame shift after codon292 Intron 4 76668145 Observed in 4 cases Additional variation (small microsatellites and in/dels) in the promoters of both CHRNA5 and CHRNA3 sdSNPs are deleterious enough to impair gene function and increase disease risk. Their effects, however, are not strong enough for selection to eliminate them from population (genetic drift, founder effects, and population bottlenecks are factors that help retain sdSNPs). Proportions of functional SNPs in different MAF categories We estimated the proportion of nonsynonymous SNPs predicted to be functional by PolyPhen and SIFT in different MAF categories PolyPhen SIFT The Erudite Hypothesis: Large fraction of the genetic susceptibility influenced by rare (<5%) variants with relatively strong effect size Targeting rarer variants for analysis may detect causal variants when sample size is large. The majority of the significant SNPs detected by GWAS are SNPs tagging untyped causal variants – greatly reduces power – but indicates region for further study Brute force analysis is not going to be sufficient to overcome power loss for multiple comparison – have to upweight analyses for most likely causal variants MAF and a Required Sample Size The numbers of cases and controls that are required in an association study to detect disease variants with allelic odds ratios of 1.2 (red), 1.3 (blue), 1.5 (yellow), and 2 (black). Numbers shown are for a statistical power of 80% at a significance level of P <10-6. SNP Replication Rate GWAS replication rate is low: about 5% This suggest that GWAS produce a considerable number of false discoveries Development of methods to predict which SNP will be replicated and which will not is important – relevant for seq. We evaluated this approach using the results of published GWASs Evaluating SNP replications Retrieved all data from the Catalog of Published GWA Studies (http://www.genome.gov/26525384/), Sep 13 Restricted analysis to SNPs mapping to a single gene and associated with a disease rather than variation SNPs related to 106 diseases were studied, 2659 SNPs from 512 studies SNP findings sorted by date with the first report denoted as a discovery and subsequent studies may be replications Reproducibility score modeled as the ratio of successful replications over the total number of subsequent studies. Characteristics Studied Name Conservation index eQTL Gene Size Growth factor Kinase -Log(P) MAF Nuclear Localization OMIM Plasma Membrane Receptor SNP type Tissue specific Transcription factor Description The level of evolutionary conservation of the protein based on the most distant homolog of the human gene SNP reported as an eQTL for HapMap Data Size of the gene region in nucleotides The protein encoded by the linked gene is a growth factor The protein encoded by the linked gene is a kinase Minus LOG(P) where P is the P-value reported in CPG Minor allele frequency The protein encoded by the linked gene is localized in the nucleus Was associated gene in OMIM The protein encoded by the linked gene is localized in plasma membrane The protein encoded by the linked gene is a receptor Type of the SNP. For the details see materials and methods The expression of the linked gene is tissue specific The protein encoded by the linked gene is a transcription factor Source of the data NCBI HomoloGene database: http://www.ncbi.nlm.nih.gov/homologene eQTL SNPs identified in lymphoblastic cell lines from HapMap project [16] NCBI RefSeq database: http://www.ncbi.nlm.nih.gov/refseq/ Gene Ontology (GO) database http://geneontology.org/ Gene Ontology (GO) database http://geneontology.org/ Catalog of Published GWAS (CPG): http://www.genome.gov/26525384 Catalog of Published GWAS (CPG): http://www.genome.gov/26525384 . MAF reported in control group were used. Gene Ontology (GO) database http://geneontology.org/ http://www.ncbi.nlm.nih.gov/omim Gene Ontology (GO) database http://geneontology.org/ Gene Ontology (GO) database http://geneontology.org/ Catalog of the Published GWAS (CPG): http://www.genome.gov/26525384 Tissue specific Gene Expression and Regulation (TiGER) database: http://bioinfo.wilmer.jhu.edu/tiger/ Gene Ontology (GO) database http://geneontology.org/ 3' UTR, 3' Downstream, 5' Upstream, 5' UTR, Coding nonsynonymous, Coding synonymous, Intergenic, Intronic, Non-coding, and Non-coding intronic. Univariate Test Results Predictor Test Statistics P-value Conservation Index eQTL Gene Size Growth Factor Kinase -Log(P) MAF Nuclear Localization OMIM Plasma Membrane Receptor SNP Type Tissue Specific Transcription Factor Spearman correlation Mann-Whitney U test Spearman correlation Mann-Whitney U test Mann-Whitney U test Spearman correlation Spearman correlation Mann-Whitney U test Mann-Whitney U test Mann-Whitney U test Mann-Whitney U test Kruskal-Wallis (KW) test Mann-Whitney U test Mann-Whitney U test Spearman R = 0.09 (M-W U test) Z adjusted = -5.39 Spearman R = -0.11 (M-W U test) Z adjusted = -2.79 (M-W U test) Z adjusted = -7.84 Spearman R = 0.36 Spearman R = 0.09 (M-W U test) Z adjusted = -7.18 (M-W U test) Z adjusted = 8.16 (M-W U test) Z adjusted = -6.23 (M-W U test) Z adjusted = -5.17 (KW test) Chi-Square = 391. 8, df = 9 (M-W U test) Z adjusted = -3.97 (M-W U test) Z adjusted = 3.32 0.0007 1.90E-07 0.00001 0.008 2.40E-14 2.30E-08 0.0007 2.20E-12 2.20E-15 1.80E-09 8.90E-07 5.50E-79 0.0001 0.008 Further evaluating SNP Type (1) All pair-wise comparisons inside the group should be insignificant; and (2) All pairwise comparisons between the groups should be significant. SNP reproducibility was lowest in the group 1 (5’ UTR, Intergenic, 5’ Upstream, Non-coding intronic, Intronic), intermediate in the group 2 (Coding synonymous, 3’ Downstream, Non-coding), and highest in the group 3 (Coding nonsynonymous, 5’UTR). 43 SNPs had >1 annotation Univariate Predictors of Replication Multivariable Analysis of Predictors Characteristic SNP_TYPE Pvalue_mlog 5_MAF_Groups B se(B) p-value 0.034687 0.000936 0.005509 0.006443 0.000187 0.001293 OMIM 0.021424 0.007765 Nuclear localization 0.025213 0.008232 Kinases -0.01213 0.014694 Conservation index Growth factor Tissue_specific eQTL HapMap 0.052759 -0.01789 -0.00375 -0.00122 0.04749 0.020355 0.007339 0.007148 Receptors 0.019558 0.012282 Transcription factors -0.02182 0.012491 Gene Size 0.011653 0.007745 7.31E-08 5.34E-07 2.03E-05 0.0058 0.002192 0.409105 0.266581 0.379 0.609778 0.864 0.111304 0.080629 0.132435 Interactions among factors Identification of Causal Variants Showing a variants has a causal effect on a disease process is challenging Good candidate for the causal variant should be linked to relevant biological mechanism. For example it may have effect on gene expression or protein structure/folding. Functional analysis is needed to prove that (1) SNP change the important biological function, and that (2) alteration of function is associated with disease risk. Conclusions Multivariable analysis showed that 11% of variability was explained by SNP predictors with 5% attributed to –log(P) value alone. SNP characteristics are second most prominent, followed by MAF (in wrong direction for weighting) Could also define profile measurement Number of GWA Studies Evaluated Number of Different Diseases Studied 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 Number of Replicates 10 11 12 13 14 15 16 17 18 Selecting SNPs for Validation Selection SNPs for validation stage is tricky The common approach is based on the ranking SNPs based on P-values from the discovery stage and select the top SNPs People also take into account the region: SNPs located in the region detected by linkage analysis of associated with genes from candidate pathway are considered to be a good candidate for replication Are GWAS-Detected SNPs Tagging or Causal Variants Even large genotyping studies e.g. genotyping 1,000,000 SNPs cover only a fraction of genetic variation in the human genome: there are more than 70,000,000 SNPs in the human genome Causal variant may or may not be on a genotyping platform but is likely to be captured by sequencing studies Short repeats are not detectable Some regions of the genome are not yet mapped Inversions are difficult to identify Improving power to detect uncommon variants Single variant as disease locus* Hardy-Weinberg Equilibrium in controls Affection status modeled by a penetrance table. case-control samples High risk allele A with frequency p Low risk allele B with frequency q Power calculation actual sample size needed for genotype-phenotype relationship to risk Peng et al. Hum Genet. 2010 Jun;127(6):699-704. *Multiple variants in a region can be modeled using burden tests Power = 0.8 Relative risk A: 4, Disease prevalence: 0.05,Disease allele frequency: p=0.01, MAF 0.01, significance 1 x 10-7 GAME-ON Integrative Projects Discovery, Expansion, and Replication • Find new associations through pooled or meta-analysis • Independent replication studies to confirm genotype-phenotype associations • Fine-map of association signals Biological Studies • Identify risk-enhancing genetic variants • Examine functional consequences of a genetic variant • Determine biological mechanisms of risk enhancement Epidemiologic Studies • Evaluate gene-gene and geneenvironment interactions • Assess penetrance and population attributable risk • Develop complex risk models • Evaluate clinical/analytic validity of risk models in observational studies GAME-ON Organizational Structure Lung Prostate Breast Colon Ovarian U19- based on Cancer Sites External Advisory Committee Executive Committee (decisional body) 5 P.I.s NCI program Officers Steering Committee (policies and processes) 2 voting members from each U19 Working Groups Chairs NCI Program Officers Working Groups in areas of interest Next Generation Genomics Technologies Analytic and Risk Modeling Functional Assays Epigenetics Epidemiology and Clinical TERT-CLPTM1L Clinical and Epidemiological Working Group • Established July 2010; Chairs – Roz Eeles, Rayjean Hung • Major initiatives ‒ Pan-cancer meta-analysis across sites, ‒ Inflammation pathway, ‒ pleiotropy analysis (with PAGE) • Inventory of data and biospecimens ‒ 96,205 cases/241,880 controls ‒ whole blood derived DNA for 70585 cases and 81,673 controls ‒ 3317 fresh frozen tumors ‒ 14,150 FFPE tumors • Virtual pathology network • Managing data harmonization across sites Analytical and Risk Modeling Working Group • Established July 2010, Chair – Peter Kraft • Held regular webinars about innovative methods • Provided leadership for integration of data from HapMap 2 • Provided guidelines for ongoing imputations of 1000 Genomes imputation • Developed and implemented novel meta-analytical methods • Evaluated approaches for detecting gene x environment effects Sporadic Lung Cancer Genome Wide Chr 5p Association Wang, et al. Nature Genetics 40(12): 1407–1409 (2008) There is Marked Heterogeneity in Associations for Chromosome 5p region Evaluation of 5p Genes as Potential Lung Tumorigenesis Modifiers in a shRNA/ KRASLSL-G12D/+ Mouse Model Loss of CLPTM1L sensitizes lung tumor cells to genotoxic agents Prevention and Cessation The Public Health Mission Initiation Cigarette Use Nicotine Dependence Cessation and Remission Chromosome 15q25 Is Important for Smoking CHRNA5A3-B4 The Tobacco and Genetics Consortium (2010) Nature Genetics Genetics of Smoking Cessation Though the strong genetic effect of the a5 nicotinic receptor contributes to heavy smoking, little to no effect is seen for smoking cessation. Survival Analysis Smoking Cessation N=1,015 University of Wisconsin Timothy Baker and Colleagues Chen et al., 2012 Genetic Effect Seen in the Placebo Group Entire Sample (N=1015) Placebo group (N=117) UW-TTURC study. Haplotypes based on 2 SNPs (rs16969968, rs680244) H1=GC 20.8% H2=GT 43.7% H3=AC 35.5% Chen et al., 2012 Treatment group (N=898) A Significant Genotype by Treatment Interaction OR (Abstinence) 1.6 1.4 1.2 1.0 1.11 Haplotypes (rs16969968, rs680244) H1=GC(20.8%) H2=GT(43.7%) 1.13 H3=AC(35.5%) 1.00 0.98 Placebo 0.6 0.4 Reference 0.8 0.62 Treatment 0.37 0.2 0.0 H1 Chen et al., 2012 H2 Haplotypes H3 Haplotypes predict abstinence in individuals receiving placebo medication OR (Abstinence) 1.6 1.4 1.2 1.00 1.0 Placebo 0.6 0.4 Reference 0.8 0.62 Treatment 0.37 0.2 0.0 H1 Chen et al., 2012 H2 Haplotypes H3 A Significant Genotype by Treatment Interaction (X2=8.97, df=2, p=0. OR (Abstinence) 1.6 1.4 1.2 1.0 1.13 1.11 1.00 0.98 Placebo 0.6 0.4 Reference 0.8 0.62 Treatment 0.37 0.2 0.0 H1 Chen et al., 2012 H2 Haplotypes H3 Haplotypes do NOT predict abstinence in individuals receiving pharmacologic treatment OR (Abstinence) 1.6 1.4 1.2 1.0 1.11 1.13 1.00 0.98 Placebo 0.6 0.4 Reference 0.8 Treatment 0.2 0.0 H1 Chen et al., 2012 H2 Haplotypes H3 Smokers with the high risk haplotype are more likely to respond to pharmacologic treatment OR (Abstinence) 1.6 1.4 1.13 1.2 1.00 1.0 Placebo 0.6 0.4 Reference 0.8 Treatment 0.37 0.2 0.0 H1 Chen et al., 2012 H2 Haplotypes H3 Smokers with the low risk haplotype do NOT benefit from pharmacologic treatment OR (Abstinence) 1.6 1.4 1.2 1.0 1.00 0.98 Placebo 0.6 0.4 Reference 0.8 Treatment 0.2 0.0 H1 Chen et al., 2012 H2 Haplotypes H3 The genotype by treatment effect do not differ across pharmacologic treatments. 1.00 0.90 0.80 Abstinence (Proportion) 0.70 0.60 0.50 H1 0.40 H2 H3 0.30 0.20 0.10 0.00 Placebo Chen et al., 2012 Buproprion only NRT only Combined No difference in haplotypic risks on cessation across medication groups Conclusions GWAS Studies have been amazingly productive in identifying new genetic architecture of common influences for complex diseases Evaluating the actual functional effects of identified variants can be challenging, relatively few causal variants are directly identified Follow-up studies have identified significant effects of genetic loci on cancer development Genetic loci identified by GWAS analyses have provided insights into cancer prevention Sequencing is useful but expensive, targeted applications are best Oncochip - $49/sample U19 Other Lung 7,000 Ovary Sources CIDR Genotype 0 40,000 Toronto, China 5,000 0 40,000 Cambridge Colorectal 5,000 0 40,000 USC Breast 6,000 77,000 Genome Canada EU CR-UK DKFZ BCFR MDEIE 40,000 Cambridge, Genome Quebec Prostate 45,000 46,000 NCI BH Movember 40,000 Cambridge, USC, NHGRI BRCA1/2 Carriers 0 20,000 10,000 Genome Quebec, Cambridge TOTAL 68,000 143,000 210,000