Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan) Gabor Marth, Goncalo Abecasis, PIs Informatics challenges for genomic analysis • Tool building • Widening accessibility • Facilitating analysis Intentions of the RFA Our approach • Complete toolbox including variant interpretation • Full pipelines for start-to-finish analysis • Easily accessible and well documented methods • Cloud deployment (in addition to single machine/local compute cluster) • Open development model Progress in first 6 months • Starting with two sets of tools and pipelines, geared toward high quality local analysis, battle-tested in the 1000GP data and medical sequencing projects • The two groups follow a “divide and conquer” strategy to put critical pieces in place for making our algorithms available for the wider genomics community • Boston College – A universal tool/pipeline launcher application – Infrastructure for dissemination – Cloud access via Galaxy • University of Michigan – Integration of variant annotation/impact assessment – Pipeline/workflow control infrastructure – Adaptation for Amazon Cloud Services FUNCTIONALITY & TOOLS Read mapping Scope FASTQ • Read mapping with MOSAIK • Quality recalibra on • Duplicate removal • INDEL realignment J P T :N A 1 8 9 4 8 • Descrip ve sta s cs • Sequence quality checking • Sample mix-ups # In c o r r e c tG e n o ty p e C a lls Data QC 7 0 8 0 0 1 0 09 7 0 0 08 1 0 09 1 1 1 3 0 % 2 0 % 1 0 % 2 1 0 5 % 6 0 0 7 0 9 08 1 0 0 BAM 2 % 1 % 2 1 1 0 1 0 . 0 5 % 3 5 4 0 1 0 0 2 0 0 4 0 0 P ilo t1 P ilo t3 ( B I! B C ) B I B C 6 0 0 8 0 0 1 0 0 0 1 2 0 0 # G e n o ty p e C a lls ( C a ll_ g t= A n y ) 1 4 0 0 1 6 0 0 Variant interpreta on Variant calling BAM • SNPs, short INDELs • Structural variants • LD aware genotype calling VCF • Annota ons: SIFT, PolyPhen • Annota ons: genome features • Sta s cal associa on analysis VCF/GVF Include latest versions • Tools constantly evolving (as they must to remain relevant) • Our community toolbox to be updated with new tools as they become available New algorithms for complex variant detection (FreeBayes) ref: TATAGAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGGGAGAGACGGA GTT alt: ref: TATAGAGAGAGAGAGAGAGC-TATAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGAGGGAGAGACGGA GAGAGAGAGAGAGAGAGGGAGAGACGGAGTT GTT alt: TATAGAGAGAGAGAGAG-- Include tools when ready for prime time RetroSeq MEI type ALU L1 Tangram Tea Sample Total Sensitivity Total Sensitivity Total Sensitivity NA12891 719 89% 1192 98% 1127 92% NA12892 687 86% 1185 98% 1078 92% NA12878 793 82% 1326 99% 1038 89% NA12891 52 78% 190 81% 286 81% The BC mobile element insertion caller performs best in its class EPACTS variant interpretation tools (Efficient and Parallelizable Association Container Toolbox) • • Genetic analysis tool based on VCF o Fast and parallelizable access to large VCF files o Built-in widely used single variant and burden tests o R/C++ interface for extending to newer tests o Binary & quantitative phenotypes with covariates o Useful visualization tools of association results Automated visualization PIPELINES & WORKFLOW The UM pipeline samtools BAM Genotype Likelihood Genotype Likelihood Genotype BAM BAM glfMultiples Unfiltered VCF Likelihood vcfCooker Hard-filtered VCF SVM Optional LD-aware step Beagle/Thunder Filtered/Phased VCF Filtered VCF EPACTS Filtered/Phased VCF UMAKE workflow system • • • Makefile based approach – The Make utility is very good for representing dependencies – Pick up where left off on Failure Flexible deployment – Local Machine – Local Cluster (Mosix) – Amazon Web Services Elastic Compute Cloud (EC2) Default options – User configurable Application of UMAKE to large-scale projects Depth / Region N #SNPs %dbSNP (129) Known Ts/Tv Novel Ts/Tv 1000G 4x Genome 1,092 34.5M 24.4 2.14 2.16 1000G >40x Exome 822 598K 22.1 2.96 2.80 GoT2D 4x Genome ~2,800 26.7M 25.5 2.16 2.19 ESP >80x Exome ~6,900 1.92M 8.6 2.94 2.83 Sardinia 3x Genome 2120 17.6M 38.4 2.15 2.22 Bipolar 10x Genome Project Computational cost is ~1 week / 1000 samples in a 5 node mini-cluster 14 ACCESSIBILITY The Boston College tool hub http://gkno.me (genome) Simplified installation & use • Unified launcher application (gkno) – single tools (e.g. Mosaik) – tool “macros” (e.g. map) – pipelines (e.g. exome variant calling) • Download and installation – All tools pulled in a single step from github – All tools installed – All tools tested Easily configurable pipeline system • Part of our new unified launcher system (gkno) • Pipeline types (e.g. mapping, variant calling) and instances (exome, whole-genome) • User-configurable: tools can be swapped in and out, parameters configured via config files Support • • • • Documentation Tutorials / Blog User forum Bug reports DEPLOYMENT / CLOUD Software deployment • All software is ready for running locally on a single machine • UMAKE adds cluster support • Cloud deployment – Simple Michigan pipelines ported to Amazon – Portation of all project software on the way Cloud-based analysis – Galaxy OPEN & COLLABORATIVE DEVELOPMENT MODEL Integration • Our workflows leverage 3rd party tools for specific functionality • All our tools are opensource, available on github (many clones, community contributed code) • Ensemble approach (multiple tools for critical tasks) Ensemble approach Ts/Tv Called in # SNPs %dbSNP Novel Known Total Union 907,170 22.09 2.22 2.30 2.24 2 of 5 766,608 25.33 2.38 2.33 2.37 3 of 5 696,358 27.05 2.44 2.36 2.42 4 of 5 601,132 29.62 2.49 2.40 2.46 Intersection 520,083 32.20 2.53 2.42 2.49 • Multiple tools usually benefit analysis Ensemble approach • Our pipelines will use multiple aligners (BWA, Mosaik) and variant callers (Freebayes, glfMultiples), developed by BC/UM In progress • • • • • • • Expanding pipelines to integrate all tools Michigan tools -> gkno BC tools -> Michigan cloud ready pipelines Large data set analysis on the cloud Integrate variant interpretation tools Integrate SV tools as they become more robust Integrate consensus analysis (SVM and MLP approaches to callset aggregation) • Minimal, functional pipeline -> Galaxy Team Boston College • Alistair Ward • Derek Barnett • Chase Miller • Wan-Ping Lee • Erik Garrison University of Michigan • Mary-Kate Trost • Tom Blackwell • Hyun-Min Kang • Youna Hu • Adrian Tan • Xiaowei Zhan • Dajiang Liu • Gabor Marth • Goncalo Abecasis