Cancer Sequencing Quality & The ICGC-TCGA DREAM Somatic Mutation Calling Challenge Dr. Paul C. Boutros Ontario Institute for Cancer Research July 14, 2015 1 Pathway The Consequences of Analytical Diversity SMC-DNA: Challenge-Based Benchmarking SMC-Het & SMC-RNA 2 General Plan for Data-Analysis Proteomics, genomics, metabolomics… Data-Analysis Results 3 Different Analysis; Same Conclusions? Proteomics, genomics, metabolomics… Data-Analysis Results 4 Biomarker for NSCLC: 24 Methods Starmans et al. Genome Med 2013 5 Agreement: 151/442 Patients 6 This holds for all tumour-types: breast cancer 74% of genes in 1+ 16% of genes in 16+ 0% of genes in 21+ Fox et al. 2014 7 Why Do We Need Improved Mutation Detection? SNVs Singer Ma (UCSC) SVs 8 Pathway The Consequences of Analytical Diversity SMC-DNA: Challenge-Based Benchmarking SMC-Het & SMC-RNA 9 Why Do We Need Improved Mutation Detection? SNVs Singer Ma (UCSC) SVs 10 DREAM Mutation-Calling Challenge Nov 2013 Dec Jan Feb Mar Apr May Oct Nov Validation Winner Competition 1 Jun Aug 2014 Jul 2 3 In silico data: ●5 T/N pairs ●For “play” and dry-runs ●Releases of increasing complexity ●Rapid scoring turn-around ●BAMs (Novoalign or BWA) 4 5 Real data: ●10 T/N pairs (50x/30x) ●Two tumour-types: ●5 pancreatic ●5 prostate ●Lane-level FASTQs & BAMs 11 How Can You Get The Data? Register for the Challenge at Synapse Complete an ICGC DACO Application Download using Annai’s GeneTorrent No-cost to download Directly access in the Google Compute Engine (Google cloud) $2,000 free computing 12 Initial Results • So Far: o 391 registrants o 3,260 entries on 14 genomes • On-going post-challenge submissions as people try to understand the failures of their algorithms (a living benchmark!) • Key discussions on scoring SVs and on improving BamSurgeon (the simulator) 13 Sample Per-Tumour Summaries 100% Cellular Tumours 80% Cellular Tumours 14 No Evidence of Over-Fitting (1/2) 15 No Evidence of Over-Fitting (2/2) 16 Coding Regions Had Lower Error Rates 17 But Recurrent Errors in Coding Regions 18 Wisdom of the Crowd (Per Tumour) 19 BUT Parameterization is Critical 20 Tuning Improvements Hold Across Tumours Are we thinking about tumour variant calling wrong? 21 Sequence Context: Trinucleotides 22 Where are we now? Initial SNV analysis complete Ewing et al. in press Nature Methods Initial SV analysis (of in silico tumours) in progress No-cost to download Experimental validation studies nearly complete 23 Pathway The Consequences of Analytical Diversity SMC-DNA: Challenge-Based Benchmarking SMC-Het & SMC-RNA 24 So, What About Heterogeneity? As part of TCGA-Prostate we were looking at normal cell contamination We = Svitlana Tyekucheva, Syed Haider, Massimo Loda, Francesca Demichelis We’d just take a consensus of estimators…. 25 Exactly As Expected…. 26 Opening for registration on November 10, 2014 Opening for submissions on August 2015 (ahem!) https://www.synapse.org/#!Synapse:syn2813581 27 Lcchong, wikipedia SMC Tumour Heterogeneity Challenge Single Sample • 50 samples • Simulated from GIAB and a deep-sequenced normal • Cloud-only (GCE+Galaxy) REB, distribution • Varying complexity, mutational load, depth, etc. • ~3 months run-time Multi-Sample • Sample number pending • Similar design, though • Cloud-based (Galaxy) • Similar parameter ranges • 3 months 28 BAMSurgeon Overview 29 Validating BAMSurgeon Changing Aligner Changing Cell-Line 30 How Are We Going to Simulate? Start with a chr-BAM Phase A Phase B Phase and create two ph-chr-BAM Extract reads for normal & contamination Contaminating Normal Sub-clone B Sub-clone A Spike SNVs, CNAs, GRs 31 How Are We Going to Simulate? Final BAM CNA Calls SNV Calls Battenburg MuTect TITAN Strelka Available via Google Cloud / Docker API 32 Draft of Tumour Design (Not Final!) 33 What Are We Scoring? 1. Sub-populations characteristics a) What is the level of normal “contamination”? b) How many sub-populations are present? c) What are their proportions? 2. What is the phylogenetic order of sub-populations? 3. For each mutation, what sub-populations is it in? 34 Pathway The Consequences of Analytical Diversity SMC-DNA: Challenge-Based Benchmarking SMC-Het & SMC-RNA 35 CPC-GENE: The People Involved Dr. Robert Bristow Dr. Theodore van der Kwast PIs & PMs Genomics Michael Fraser Melania Pintilie Neil Fleshner Lakshmi Muthuswamy Colin Collins Thomas Hudson Lincoln Stein Taryne Chong Andrew Brown Michelle Sam Jeremy Johns Lee Timms Nicholas Buchner Ada Wong Informatics Clinico-Molecular Timothy Beck Fouad Yousif Robert Denroche Xuemei Luo Dominique Trudel Alice Meng Gaetano Zafarana Dr. John McPherson Boutros Lab Richard de Borja Nicholas Harding Pablo Hennings-Yeomans Emilie Lalonde Amin Zia Jianxin Wang Francis Nguyen Natalie Fox Michelle Chan-Seng-Yue Lauren Chong Takafumi Yamaguchi Veronica Sabelnykova 36 SMC-DNA Organizing Team Sage/DREAM Organizers External Organizers Gustavo Stolovitzky Paul Boutros (OICR) Stephen Friend Josh Stuart (UCSC) Adam Margolin Lincoln Stein (OICR) Thea Norman Kyle Ellrott (UCSC) Christine Suver Adam Ewing (UCSC) Christopher Bare Anna Lee (OICR) Kristen Dang Katie Houlahan (OICR) Bruce Hoff Cristian Caloian (OICR) Mike Kellen Takafumi Yamaguchi (OICR) Data Contributors: Funding/Sponsoring/Publication Partners Include: 37 SMC-Het Organizing Team Organizers • Paul Boutros (OICR) • Amit Deshwar (University of Toronto) • Josh Stuart (UCSC) • Minjeong Ko (OICR) • Gustavo Stolovitzky (IBM) • Kyle Ellrott (UCSC) • Stephan Friend (Sage) • Christopher Bare (Sage) • David Wedge (Sanger) • Kristen Dang (Sage) • Peter Van Loo (UCL) • Yin Hu (Sage) • Quaid Morris (University of Toronto) • Shannon Carter (Sage) • Thea Norman (Sage) Data Contributors Funding/Sponsoring/Publication Partners Include: 38