Marth-BCUM-v03

advertisement
Robust Software Tools for Variant
Identification and Functional Assessment
(Boston College & University of Michigan)
Gabor Marth, Goncalo Abecasis, PIs
Informatics challenges for genomic analysis
• Tool building
• Widening accessibility
• Facilitating analysis
Intentions of the RFA
Our approach
• Complete toolbox including variant
interpretation
• Full pipelines for start-to-finish analysis
• Easily accessible and well documented methods
• Cloud deployment (in addition to single
machine/local compute cluster)
• Open development model
Progress in first 6 months
• Starting with two sets of tools and pipelines, geared
toward high quality local analysis, battle-tested in the
1000GP data and medical sequencing projects
• The two groups follow a “divide and conquer” strategy
to put critical pieces in place for making our algorithms
available for the wider genomics community
• Boston College
– A universal tool/pipeline launcher application
– Infrastructure for dissemination
– Cloud access via Galaxy
• University of Michigan
– Integration of variant annotation/impact assessment
– Pipeline/workflow control infrastructure
– Adaptation for Amazon Cloud Services
FUNCTIONALITY & TOOLS
Read
mapping
Scope
FASTQ
• Read mapping with MOSAIK
• Quality recalibra on
• Duplicate removal
• INDEL realignment
J
P
T
:N
A
1
8
9
4
8
• Descrip ve sta s cs
• Sequence quality checking
• Sample mix-ups
#
In
c
o
r
r
e
c
tG
e
n
o
ty
p
e
C
a
lls
Data
QC
7
0
8
0
0
1
0
09
7
0
0
08
1
0
09
1
1 1
3
0
%
2
0
%
1
0
%
2
1
0
5
%
6
0
0
7
0
9
08
1
0
0
BAM
2
%
1
%
2
1
1
0
1
0
.
0
5
%
3
5
4
0
1
0
0
2
0
0
4
0
0
P
ilo
t1
P
ilo
t3
(
B
I! B
C
)
B
I
B
C
6
0
0
8
0
0
1
0
0
0 1
2
0
0
#
G
e
n
o
ty
p
e
C
a
lls
(
C
a
ll_
g
t=
A
n
y
)
1
4
0
0
1
6
0
0
Variant
interpreta on
Variant
calling
BAM
• SNPs, short INDELs
• Structural variants
• LD aware genotype calling
VCF
• Annota ons: SIFT, PolyPhen
• Annota ons: genome features
• Sta s cal associa on analysis
VCF/GVF
Include latest versions
• Tools constantly evolving (as they must to remain relevant)
• Our community toolbox to be updated with new tools as
they become available
New algorithms for complex
variant detection (FreeBayes)
ref:
TATAGAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGGGAGAGACGGA
GTT
alt:
ref:
TATAGAGAGAGAGAGAGAGC-TATAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGAGGGAGAGACGGA
GAGAGAGAGAGAGAGAGGGAGAGACGGAGTT
GTT
alt: TATAGAGAGAGAGAGAG--
Include tools when ready for prime time
RetroSeq
MEI type
ALU
L1
Tangram
Tea
Sample
Total
Sensitivity
Total
Sensitivity
Total
Sensitivity
NA12891
719
89%
1192
98%
1127
92%
NA12892
687
86%
1185
98%
1078
92%
NA12878
793
82%
1326
99%
1038
89%
NA12891
52
78%
190
81%
286
81%
The BC mobile element insertion
caller performs best in its class
EPACTS variant interpretation tools
(Efficient and Parallelizable Association Container Toolbox)
•
•
Genetic analysis tool based on VCF
o Fast and parallelizable access to large VCF files
o Built-in widely used single variant and burden tests
o R/C++ interface for extending to newer tests
o Binary & quantitative phenotypes with covariates
o Useful visualization tools of association results
Automated visualization
PIPELINES & WORKFLOW
The UM pipeline
samtools
BAM
Genotype
Likelihood
Genotype
Likelihood
Genotype
BAM
BAM
glfMultiples
Unfiltered
VCF
Likelihood
vcfCooker
Hard-filtered
VCF
SVM
Optional LD-aware step
Beagle/Thunder
Filtered/Phased
VCF
Filtered
VCF
EPACTS
Filtered/Phased
VCF
UMAKE workflow system
•
•
•
Makefile based approach
– The Make utility is very good for representing dependencies
– Pick up where left off on Failure
Flexible deployment
– Local Machine
– Local Cluster (Mosix)
– Amazon Web Services Elastic Compute Cloud (EC2)
Default options
–
User configurable
Application of UMAKE to large-scale projects
Depth /
Region
N
#SNPs
%dbSNP
(129)
Known
Ts/Tv
Novel
Ts/Tv
1000G
4x Genome
1,092
34.5M
24.4
2.14
2.16
1000G
>40x Exome
822
598K
22.1
2.96
2.80
GoT2D
4x Genome
~2,800
26.7M
25.5
2.16
2.19
ESP
>80x Exome
~6,900
1.92M
8.6
2.94
2.83
Sardinia
3x Genome
2120
17.6M
38.4
2.15
2.22
Bipolar
10x Genome
Project
Computational cost is ~1 week / 1000 samples in a 5 node mini-cluster
14
ACCESSIBILITY
The Boston College tool hub
http://gkno.me
(genome)
Simplified installation & use
• Unified launcher application (gkno)
– single tools (e.g. Mosaik)
– tool “macros” (e.g. map)
– pipelines (e.g. exome variant calling)
• Download and installation
– All tools pulled in a single step from github
– All tools installed
– All tools tested
Easily configurable pipeline system
• Part of our new unified launcher system (gkno)
• Pipeline types (e.g. mapping, variant calling) and
instances (exome, whole-genome)
• User-configurable: tools can be swapped in and
out, parameters configured via config files
Support
•
•
•
•
Documentation
Tutorials / Blog
User forum
Bug reports
DEPLOYMENT / CLOUD
Software deployment
• All software is ready for running locally on a
single machine
• UMAKE adds cluster support
• Cloud deployment
– Simple Michigan pipelines ported to Amazon
– Portation of all project software on the way
Cloud-based analysis – Galaxy
OPEN & COLLABORATIVE
DEVELOPMENT MODEL
Integration
• Our workflows leverage
3rd party tools for
specific functionality
• All our tools are opensource, available on
github (many clones,
community contributed
code)
• Ensemble approach
(multiple tools for
critical tasks)
Ensemble approach
Ts/Tv
Called in
# SNPs
%dbSNP
Novel
Known
Total
Union
907,170
22.09
2.22
2.30
2.24
2 of 5
766,608
25.33
2.38
2.33
2.37
3 of 5
696,358
27.05
2.44
2.36
2.42
4 of 5
601,132
29.62
2.49
2.40
2.46
Intersection
520,083
32.20
2.53
2.42
2.49
• Multiple tools usually benefit analysis
Ensemble approach
• Our pipelines will use multiple aligners (BWA, Mosaik)
and variant callers (Freebayes, glfMultiples), developed
by BC/UM
In progress
•
•
•
•
•
•
•
Expanding pipelines to integrate all tools
Michigan tools -> gkno
BC tools -> Michigan cloud ready pipelines
Large data set analysis on the cloud
Integrate variant interpretation tools
Integrate SV tools as they become more robust
Integrate consensus analysis (SVM and MLP
approaches to callset aggregation)
• Minimal, functional pipeline -> Galaxy
Team
Boston College
• Alistair Ward
• Derek Barnett
• Chase Miller
• Wan-Ping Lee
• Erik Garrison
University of Michigan
• Mary-Kate Trost
• Tom Blackwell
• Hyun-Min Kang
• Youna Hu
• Adrian Tan
• Xiaowei Zhan
• Dajiang Liu
• Gabor Marth
• Goncalo Abecasis
Download