Presentation

advertisement
Cancer Sequencing Quality &
The ICGC-TCGA DREAM
Somatic Mutation Calling
Challenge
Dr. Paul C. Boutros
Ontario Institute for Cancer Research
July 14, 2015
1
Pathway
 The Consequences of Analytical Diversity
SMC-DNA: Challenge-Based Benchmarking
 SMC-Het & SMC-RNA
2
General Plan for Data-Analysis
Proteomics, genomics,
metabolomics…
Data-Analysis
Results
3
Different Analysis; Same Conclusions?
Proteomics, genomics,
metabolomics…
Data-Analysis
Results
4
Biomarker for NSCLC: 24 Methods
Starmans et al.
Genome Med 2013
5
Agreement: 151/442 Patients
6
This holds for all tumour-types: breast cancer
74% of genes in 1+
16% of genes in 16+
0% of genes in 21+
Fox et al.
2014 7
Why Do We Need Improved
Mutation Detection?
SNVs
Singer Ma (UCSC)
SVs
8
Pathway
 The Consequences of Analytical Diversity
SMC-DNA: Challenge-Based Benchmarking
 SMC-Het & SMC-RNA
9
Why Do We Need Improved
Mutation Detection?
SNVs
Singer Ma (UCSC)
SVs
10
DREAM Mutation-Calling Challenge
Nov 2013
Dec
Jan
Feb
Mar
Apr
May
Oct
Nov
Validation Winner
Competition
1
Jun
Aug 2014
Jul
2
3
In silico data:
●5 T/N pairs
●For “play” and dry-runs
●Releases of increasing complexity
●Rapid scoring turn-around
●BAMs (Novoalign or BWA)
4
5
Real data:
●10 T/N pairs (50x/30x)
●Two tumour-types:
●5 pancreatic
●5 prostate
●Lane-level FASTQs & BAMs
11
How Can You Get The Data?
 Register for the Challenge at Synapse
 Complete an ICGC DACO Application
 Download using Annai’s GeneTorrent
 No-cost to download
 Directly access in the Google Compute
Engine (Google cloud)
 $2,000 free computing
12
Initial Results
• So Far:
o 391 registrants
o 3,260 entries on 14 genomes
• On-going post-challenge submissions as
people try to understand the failures of their
algorithms (a living benchmark!)
• Key discussions on scoring SVs and on
improving BamSurgeon (the simulator)
13
Sample Per-Tumour Summaries
100% Cellular Tumours
80% Cellular Tumours
14
No Evidence of Over-Fitting (1/2)
15
No Evidence of Over-Fitting (2/2)
16
Coding Regions Had Lower Error Rates
17
But Recurrent Errors in Coding Regions
18
Wisdom of the Crowd (Per Tumour)
19
BUT Parameterization is Critical
20
Tuning Improvements Hold Across Tumours
Are we thinking about tumour variant calling wrong?
21
Sequence Context: Trinucleotides
22
Where are we now?
 Initial SNV analysis complete
 Ewing et al. in press Nature Methods
 Initial SV analysis (of in silico tumours) in
progress
 No-cost to download
 Experimental validation studies nearly
complete
23
Pathway
 The Consequences of Analytical Diversity
SMC-DNA: Challenge-Based Benchmarking
 SMC-Het & SMC-RNA
24
So, What About Heterogeneity?
As part of TCGA-Prostate we were looking at
normal cell contamination
We = Svitlana Tyekucheva, Syed Haider,
Massimo Loda, Francesca Demichelis
We’d just take a consensus of estimators….
25
Exactly As Expected….
26
Opening for registration on
November 10, 2014
Opening for submissions on
August 2015 (ahem!)
https://www.synapse.org/#!Synapse:syn2813581
27
Lcchong, wikipedia
SMC Tumour Heterogeneity Challenge
Single Sample
• 50 samples
• Simulated from GIAB and a
deep-sequenced normal
• Cloud-only (GCE+Galaxy)
 REB, distribution
• Varying complexity,
mutational load, depth, etc.
• ~3 months run-time
Multi-Sample
• Sample number pending
• Similar design, though
• Cloud-based (Galaxy)
• Similar parameter ranges
• 3 months
28
BAMSurgeon Overview
29
Validating BAMSurgeon
Changing Aligner
Changing Cell-Line
30
How Are We Going to Simulate?
Start with a
chr-BAM
Phase A
Phase B
Phase and
create two
ph-chr-BAM
Extract reads
for normal &
contamination
Contaminating
Normal
Sub-clone B
Sub-clone A
Spike SNVs,
CNAs, GRs
31
How Are We Going to Simulate?
Final BAM
CNA Calls
SNV Calls
Battenburg
MuTect
TITAN
Strelka
Available via
Google Cloud /
Docker API
32
Draft of Tumour Design (Not Final!)
33
What Are We Scoring?
1. Sub-populations characteristics
a) What is the level of normal “contamination”?
b) How many sub-populations are present?
c) What are their proportions?
2. What is the phylogenetic order of sub-populations?
3. For each mutation, what sub-populations is it in?
34
Pathway
 The Consequences of Analytical Diversity
 SMC-DNA: Challenge-Based Benchmarking
 SMC-Het & SMC-RNA
35
CPC-GENE: The People Involved
Dr. Robert Bristow
Dr. Theodore van der Kwast
PIs & PMs
Genomics
Michael Fraser
Melania Pintilie
Neil Fleshner
Lakshmi Muthuswamy
Colin Collins
Thomas Hudson
Lincoln Stein
Taryne Chong
Andrew Brown
Michelle Sam
Jeremy Johns
Lee Timms
Nicholas Buchner
Ada Wong
Informatics
Clinico-Molecular
Timothy Beck
Fouad Yousif
Robert Denroche
Xuemei Luo
Dominique Trudel
Alice Meng
Gaetano Zafarana
Dr. John McPherson
Boutros Lab
Richard de Borja
Nicholas Harding
Pablo Hennings-Yeomans
Emilie Lalonde
Amin Zia
Jianxin Wang
Francis Nguyen
Natalie Fox
Michelle Chan-Seng-Yue
Lauren Chong
Takafumi Yamaguchi
Veronica Sabelnykova 36
SMC-DNA Organizing Team
Sage/DREAM Organizers
External Organizers
Gustavo Stolovitzky
Paul Boutros (OICR)
Stephen Friend
Josh Stuart (UCSC)
Adam Margolin
Lincoln Stein (OICR)
Thea Norman
Kyle Ellrott (UCSC)
Christine Suver
Adam Ewing (UCSC)
Christopher Bare
Anna Lee (OICR)
Kristen Dang
Katie Houlahan (OICR)
Bruce Hoff
Cristian Caloian (OICR)
Mike Kellen
Takafumi Yamaguchi (OICR)
Data Contributors:
Funding/Sponsoring/Publication
Partners Include:
37
SMC-Het Organizing Team
Organizers
• Paul Boutros (OICR)
• Amit Deshwar (University of Toronto)
• Josh Stuart (UCSC)
• Minjeong Ko (OICR)
• Gustavo Stolovitzky (IBM)
• Kyle Ellrott (UCSC)
• Stephan Friend (Sage)
• Christopher Bare (Sage)
• David Wedge (Sanger)
• Kristen Dang (Sage)
• Peter Van Loo (UCL)
• Yin Hu (Sage)
• Quaid Morris (University of Toronto)
• Shannon Carter (Sage)
• Thea Norman (Sage)
Data Contributors
Funding/Sponsoring/Publication
Partners Include:
38
Download