Slides - AITP SD Cloud Computing Conference 2014

advertisement
Choosing a Cloud Provider for your Genomic Applications
Phillip Pham
Director, Technology Development
phillip@cyphergenomics.com
November 13th, 2014
The information disclosed in this document, including all designs and related materials, is the valuable property of Cypher Genomics, Inc. and its licensors. Cypher
Genomics, Inc. and its licensors, as appropriate, reserve all patent, copyright and other proprietary rights to this document, including all design, manufacturing,
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 1
reproduction, use, and sales rights, except to the extent said rights are expressly
granted to others.
Biography
• Developed Cypher Genomics technology at the
Scripps Translational Science Institute
• Designed and deployed high performance and
parallel computing systems for annotation and
genome interpretation
• Expertise in genome analysis workflows
• Interested in the genetic architecture influencing
drug efficacy and response in clinical trials
• Bioinformatics: Bioengineering, B.S.
• Patents:
–
Pham P, Deshpande S, 2013. Systems and Methods for Genomic Variant
Annotation, U.S. Patent 13/841,575, filed March 2013. Patent Pending.
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 2
Agenda
•
•
•
•
•
•
Cypher Genomics Problem Space
Simplified Logical Architecture
Pain Points
Evaluating Alternatives
Lessons Learned
Summary
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 3
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 4
Our Problem Space
• Cypher Genomics provides automated Annotation, Interpretation and Biomarker
Discovery for Whole Genomes, Exomes, CNV, etc.
– Sequencing = AGCTTGAGGATCAACTAGTGCATGCTATACCTGC…
– Alignment & Variant Calling
• Align the appropriate sets of nucleotides to their place in the genome.
– 150 Gb BAM files (compressed)
• Compare the aligned input genome to the reference genome
• Identify variants (i.e. mutations) in the input genome.
– Annotation
• Each variant annotated with 90+ attributes from 50+ reference data sources as well as Cypher’s
proprietary data and prediction algorithms.
• Web-Based UI for accessing all data, applying analytic filters, etc.
– Interpretation
• Human Readable PDF Report with Cypher Synthesis summary of most important findings.
– Includes all variants with Cypher Synthesis and references to supporting evidence
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 5
Mantis™ Workflow – Clinical Use Case
Patient Sample
Sent to Lab
Sample is
Sequenced,
Aligned &
Variant Called
Next Gen Sequencing
Data Uploaded to
Cypher
Mantis
Interprets
Genome Data
Report
Generated
.PDF
Clinical
Summary
Delivered
200M+ Variants Annotated and growing
90+ Annotations per Variant
50+ Reference DB’s
© 2014 Cypher Genomics, Inc.
150GB BAM
125MB VCF
40+ TB (and growing)
Proprietary and Confidential
3MB
Page 6
Automated solutions are essential to provide precision medicine at
population scales
•
•
•
•
•
YESTERDAY
TODAY
TOMORROW
Nicholas Volker
Lilly Grossman
Every Patient
2010
2013
2018
Medical College of Wisconsin
Life-threatening bowel disease
Whole exome sequencing (1% of genome)
Cost: $15,000
Single mutation in XIAP
+
© 2014 Cypher Genomics, Inc.
•
•
•
•
•
Scripps IDIOM / Cypher Genomics
Debilitating neuromuscular disease
Whole genome sequencing
Automated interpretation
Mutations in ADCY5
•
•
•
•
Driven by decrease in sequencing cost
(e.g. illumina X10 - $1,000 genome)
Whole genome – baseline
Updated interpretation over time
Genomic clinical decision support
+
Proprietary and Confidential
Page 7
Coral™ Workflow – Pharma Use Case
Patient Data
Collected
Study Samples
Sequenced
Next Gen Sequencing
Data to Cypher
Cypher Runs
Genomes at
Scale
Coral Produces
Predictive
Models
Biomarker
Identified
Data and Compute usage grow in population studies
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 8
Logical Deployment Components
•
Web Front-End – User Interface
–
–
•
Apache Web Server
Zend Framework Server
Cypher Core Services (CCS)
–
REST API Implementation to Service
•
•
–
CDH – Hadoop Ecosystem
•
•
•
–
Consolidated Reference Data
Demographics Data
Vertica
•
•
HDFS – Holds raw reference data
HBase – Variant Annotation DB
MapReduce – Process variant data and annotation information at scale
MongoDB
•
•
–
User Interface
Automated Integration
Analytics DB for real-time, interactive analytic investigation
Cypher Analytic Pipeline (CAP) - Novel Variant Annotation Pipeline
–
–
–
Penguin on Demand – HPC with Torque scheduler
Custom parallelization algorithm for annotating, running predictive algorithms
Final annotation of Variant goes into CCS Annotation DB
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 9
Simplified Logical Architecture
Cloud Provider
Front-end Applications
Web Server
Web Server
Web App
Server
Web App
Server
HPC Provider
REST APIs
Cypher Annotation Pipeline (Novel Variants)
CAP
• Annotation of known Variants
• Interactive Analytics Back-End
• Interpretation & Reporting
CCS
CCS
CCS
CCS
CCS
CCS
CCS
CCS
Demographic
Data
(Mongo)
Annotation DB
(Hbase)
CAP
CAP
• Annotation of Novel Variants
CAP
CAP
Analytics DB
(Vertica)
CAP
CAP
CAP
Source Reference Data
HPC Environment
HDFS Hadoop & Map/Reduce Cluster
© 2014 Cypher Genomics, Inc.
CAP
CAP
CAP
Consolidated
Reference DB
(Mongo)
CAP
CAP Master Application
Server
TORQUE
CCS Master Application Server
CCS
HPC Job Scheduler
Cypher Core Services
Proprietary and Confidential
Page 10
Pain Points
• Downtime
– Unscheduled system reboots and upgrades.
• Temporary loss of data and configurations
– Poor communication
• Difficult to manage temporary cluster instantiation
• Firewall architecture imposes inflexible rules on our naming
conventions
• Limits to cluster and storage quota (either programmatically or
through the Management Console).
– Results in requiring us to call the vendor to raise our limits.
• Ease of getting to HIPAA Compliance
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 11
Evaluating Alternatives
•
•
•
•
•
•
•
Comparable (or better) Performance per $
Better Uptime / Less Dramatic Impact during Maintenance
Start and stop temporary clusters with ease
Flexible firewall rules
Unlimited storage scaling on demand
Strategic options for utilizing service offerings
HIPAA compliance and business associate agreement
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 12
What did we test?
• Most Time Consuming Stages of our Processing
–
–
–
–
–
Reformat Input Files
Query Annotations
Process Annotations
Split Results
Organize Results
• And
– Total End-to-End processing time
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 13
Test Configurations – Hadoop (Hbase/MapReduce) Cluster
Current Cloud Vendor
New Cloud Vendor
• Test Configuration – “Original Apple”
• Configuration 1 – “Pear”
– 6 Nodes
•
•
•
•
– 6 Nodes
•
•
•
•
8 CPU
32 GB RAM
40 TB Block Storage (Magnetic)
No Encryption
8 CPU
32 GB RAM
40 TB Block Storage (SSD)
All Disks Encrypted
• Configuration 2 – “Melon”
– 6 Nodes
NOTE: The above configuration is smaller than our entire product
and is just for performance comparison purposes. You
should draw NO conclusions regarding the performance of
our product from these comparisons.
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
•
•
•
•
32 CPUs
60 GB RAM
40 TB Local Disk (Magnetic)
All Disks Encrypted
Page 14
Apples to Pears to Melons… Oh my! (Reformat)
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 15
Apples to Pears to Melons… Oh my! (Query)
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 16
Apples to Pears to Melons… Oh my! (Process Annots)
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 17
Apples to Pears to Melons… Oh my! (Split Results)
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 18
Apples to Pears to Melons… Oh my! (Organize)
1600
1400
1200
Time / min
1000
Original Apple
800
Pear - 6x8x32x40Tb (Enc. SSD)
Melon - 6x32x60x40TB (Enc. Local Mag)
600
400
200
0
5
© 2014 Cypher Genomics, Inc.
10
50
80
Proprietary and Confidential
Page 19
Apples to Pears to Melons… Oh my! (Overall)
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 20
Lessons Learned
• Environment set-up and tweaking
– 80% of time spent here
– Be prepared to iterate on this as you test larger and larger batches.
• Work with your vendor to provide experts in Hadoop,
Performance and Capacity Planning to aid in your evaluation.
• Work with your vendor to cost out expenditures
– Don’t forget security, load balancing, auto-scaling services, etc.
– Look closely at the advantages of paying a portion up-front to get good
discounts on monthly costs.
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 21
Summary – Key Requirements of a Genomics Cloud Provider
• Find a Cloud Provider that will be a Strategic Partner!
– Resources to aid in your evaluation.
– Resources to aid in optimizing costs.
• Instrument your applications to capture detailed performance stats per logical
computational step.
– Gives delta between current and new Cloud Provider infrastructure.
• During your testing establish how responsive the Cloud Partner is to issues
unrelated to your pilot (i.e. machine upgrades, failures in nodes, networking
and/or storage).
• Ensure that you can stop clusters that you are not testing so that you aren’t billed
more than once
• Make sure they will sign a HIPAA BAA and that you know exactly what is covered
• Pick a Cloud Partner that offers services for automating deployment, scaling, load
balancing, security and log management (even if you don’t plan to use it now).
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 22
Acknowledgements
•
•
•
•
•
•
Javier Velazquez-Muriel – Bioinformatics Engineer
Patrick Ravenel – CTO & VP of Engineering
Ashley Van Zeeland – CEO & Founder
Ali Torkamani – CSO & Founder
Nicholas Schork – Founder
Eric Topol – Founder
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 23
Questions?
© 2014 Cypher Genomics, Inc.
Proprietary and Confidential
Page 24
Download