Choosing a Cloud Provider for your Genomic Applications Phillip Pham Director, Technology Development phillip@cyphergenomics.com November 13th, 2014 The information disclosed in this document, including all designs and related materials, is the valuable property of Cypher Genomics, Inc. and its licensors. Cypher Genomics, Inc. and its licensors, as appropriate, reserve all patent, copyright and other proprietary rights to this document, including all design, manufacturing, © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 1 reproduction, use, and sales rights, except to the extent said rights are expressly granted to others. Biography • Developed Cypher Genomics technology at the Scripps Translational Science Institute • Designed and deployed high performance and parallel computing systems for annotation and genome interpretation • Expertise in genome analysis workflows • Interested in the genetic architecture influencing drug efficacy and response in clinical trials • Bioinformatics: Bioengineering, B.S. • Patents: – Pham P, Deshpande S, 2013. Systems and Methods for Genomic Variant Annotation, U.S. Patent 13/841,575, filed March 2013. Patent Pending. © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 2 Agenda • • • • • • Cypher Genomics Problem Space Simplified Logical Architecture Pain Points Evaluating Alternatives Lessons Learned Summary © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 3 © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 4 Our Problem Space • Cypher Genomics provides automated Annotation, Interpretation and Biomarker Discovery for Whole Genomes, Exomes, CNV, etc. – Sequencing = AGCTTGAGGATCAACTAGTGCATGCTATACCTGC… – Alignment & Variant Calling • Align the appropriate sets of nucleotides to their place in the genome. – 150 Gb BAM files (compressed) • Compare the aligned input genome to the reference genome • Identify variants (i.e. mutations) in the input genome. – Annotation • Each variant annotated with 90+ attributes from 50+ reference data sources as well as Cypher’s proprietary data and prediction algorithms. • Web-Based UI for accessing all data, applying analytic filters, etc. – Interpretation • Human Readable PDF Report with Cypher Synthesis summary of most important findings. – Includes all variants with Cypher Synthesis and references to supporting evidence © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 5 Mantis™ Workflow – Clinical Use Case Patient Sample Sent to Lab Sample is Sequenced, Aligned & Variant Called Next Gen Sequencing Data Uploaded to Cypher Mantis Interprets Genome Data Report Generated .PDF Clinical Summary Delivered 200M+ Variants Annotated and growing 90+ Annotations per Variant 50+ Reference DB’s © 2014 Cypher Genomics, Inc. 150GB BAM 125MB VCF 40+ TB (and growing) Proprietary and Confidential 3MB Page 6 Automated solutions are essential to provide precision medicine at population scales • • • • • YESTERDAY TODAY TOMORROW Nicholas Volker Lilly Grossman Every Patient 2010 2013 2018 Medical College of Wisconsin Life-threatening bowel disease Whole exome sequencing (1% of genome) Cost: $15,000 Single mutation in XIAP + © 2014 Cypher Genomics, Inc. • • • • • Scripps IDIOM / Cypher Genomics Debilitating neuromuscular disease Whole genome sequencing Automated interpretation Mutations in ADCY5 • • • • Driven by decrease in sequencing cost (e.g. illumina X10 - $1,000 genome) Whole genome – baseline Updated interpretation over time Genomic clinical decision support + Proprietary and Confidential Page 7 Coral™ Workflow – Pharma Use Case Patient Data Collected Study Samples Sequenced Next Gen Sequencing Data to Cypher Cypher Runs Genomes at Scale Coral Produces Predictive Models Biomarker Identified Data and Compute usage grow in population studies © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 8 Logical Deployment Components • Web Front-End – User Interface – – • Apache Web Server Zend Framework Server Cypher Core Services (CCS) – REST API Implementation to Service • • – CDH – Hadoop Ecosystem • • • – Consolidated Reference Data Demographics Data Vertica • • HDFS – Holds raw reference data HBase – Variant Annotation DB MapReduce – Process variant data and annotation information at scale MongoDB • • – User Interface Automated Integration Analytics DB for real-time, interactive analytic investigation Cypher Analytic Pipeline (CAP) - Novel Variant Annotation Pipeline – – – Penguin on Demand – HPC with Torque scheduler Custom parallelization algorithm for annotating, running predictive algorithms Final annotation of Variant goes into CCS Annotation DB © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 9 Simplified Logical Architecture Cloud Provider Front-end Applications Web Server Web Server Web App Server Web App Server HPC Provider REST APIs Cypher Annotation Pipeline (Novel Variants) CAP • Annotation of known Variants • Interactive Analytics Back-End • Interpretation & Reporting CCS CCS CCS CCS CCS CCS CCS CCS Demographic Data (Mongo) Annotation DB (Hbase) CAP CAP • Annotation of Novel Variants CAP CAP Analytics DB (Vertica) CAP CAP CAP Source Reference Data HPC Environment HDFS Hadoop & Map/Reduce Cluster © 2014 Cypher Genomics, Inc. CAP CAP CAP Consolidated Reference DB (Mongo) CAP CAP Master Application Server TORQUE CCS Master Application Server CCS HPC Job Scheduler Cypher Core Services Proprietary and Confidential Page 10 Pain Points • Downtime – Unscheduled system reboots and upgrades. • Temporary loss of data and configurations – Poor communication • Difficult to manage temporary cluster instantiation • Firewall architecture imposes inflexible rules on our naming conventions • Limits to cluster and storage quota (either programmatically or through the Management Console). – Results in requiring us to call the vendor to raise our limits. • Ease of getting to HIPAA Compliance © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 11 Evaluating Alternatives • • • • • • • Comparable (or better) Performance per $ Better Uptime / Less Dramatic Impact during Maintenance Start and stop temporary clusters with ease Flexible firewall rules Unlimited storage scaling on demand Strategic options for utilizing service offerings HIPAA compliance and business associate agreement © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 12 What did we test? • Most Time Consuming Stages of our Processing – – – – – Reformat Input Files Query Annotations Process Annotations Split Results Organize Results • And – Total End-to-End processing time © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 13 Test Configurations – Hadoop (Hbase/MapReduce) Cluster Current Cloud Vendor New Cloud Vendor • Test Configuration – “Original Apple” • Configuration 1 – “Pear” – 6 Nodes • • • • – 6 Nodes • • • • 8 CPU 32 GB RAM 40 TB Block Storage (Magnetic) No Encryption 8 CPU 32 GB RAM 40 TB Block Storage (SSD) All Disks Encrypted • Configuration 2 – “Melon” – 6 Nodes NOTE: The above configuration is smaller than our entire product and is just for performance comparison purposes. You should draw NO conclusions regarding the performance of our product from these comparisons. © 2014 Cypher Genomics, Inc. Proprietary and Confidential • • • • 32 CPUs 60 GB RAM 40 TB Local Disk (Magnetic) All Disks Encrypted Page 14 Apples to Pears to Melons… Oh my! (Reformat) © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 15 Apples to Pears to Melons… Oh my! (Query) © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 16 Apples to Pears to Melons… Oh my! (Process Annots) © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 17 Apples to Pears to Melons… Oh my! (Split Results) © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 18 Apples to Pears to Melons… Oh my! (Organize) 1600 1400 1200 Time / min 1000 Original Apple 800 Pear - 6x8x32x40Tb (Enc. SSD) Melon - 6x32x60x40TB (Enc. Local Mag) 600 400 200 0 5 © 2014 Cypher Genomics, Inc. 10 50 80 Proprietary and Confidential Page 19 Apples to Pears to Melons… Oh my! (Overall) © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 20 Lessons Learned • Environment set-up and tweaking – 80% of time spent here – Be prepared to iterate on this as you test larger and larger batches. • Work with your vendor to provide experts in Hadoop, Performance and Capacity Planning to aid in your evaluation. • Work with your vendor to cost out expenditures – Don’t forget security, load balancing, auto-scaling services, etc. – Look closely at the advantages of paying a portion up-front to get good discounts on monthly costs. © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 21 Summary – Key Requirements of a Genomics Cloud Provider • Find a Cloud Provider that will be a Strategic Partner! – Resources to aid in your evaluation. – Resources to aid in optimizing costs. • Instrument your applications to capture detailed performance stats per logical computational step. – Gives delta between current and new Cloud Provider infrastructure. • During your testing establish how responsive the Cloud Partner is to issues unrelated to your pilot (i.e. machine upgrades, failures in nodes, networking and/or storage). • Ensure that you can stop clusters that you are not testing so that you aren’t billed more than once • Make sure they will sign a HIPAA BAA and that you know exactly what is covered • Pick a Cloud Partner that offers services for automating deployment, scaling, load balancing, security and log management (even if you don’t plan to use it now). © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 22 Acknowledgements • • • • • • Javier Velazquez-Muriel – Bioinformatics Engineer Patrick Ravenel – CTO & VP of Engineering Ashley Van Zeeland – CEO & Founder Ali Torkamani – CSO & Founder Nicholas Schork – Founder Eric Topol – Founder © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 23 Questions? © 2014 Cypher Genomics, Inc. Proprietary and Confidential Page 24