Genomics and High Performance Computing Folker Meyer Argonne National Laboratory and University of Chicago Brief intro: I am a computer scientist turned computational biologist My CS friends tell me I am a biologist My BIO friends tell me I am a CS person So clearly take my comments with a grain of salt My research interests: – Metagenomics to study microbial role in • geochemical cycling (Climate, Remediation) • human health (Human Microbiome) Strong emphasis on technology development to allow study – Algorithms, integrations, tools, genomics tech Argonne National Laboratory Why microbes? Source: Rob Knight, U. Colorado Argonne National Laboratory Biology changed. From this … http://www.the-aps.org/education/ http://www.ferrum.edu/majors/biology.jpg http://www.oneocean.org/ambassadors/track_a_turtle/biology These are: “biology.png”, “biology.gif” and “biology.jpg”. 4 Argonne National Laboratory … to this. (in ~2003) Argonne National Laboratory DNA Technology advances … FAST Technology Read length Mbp/hr 1977 Sanger ~100 ~0.0001 1987 ABI 370 ~500 ~2001 ABI 3700 2005 MBp/run Cost/ run ~0.001 ? 0.01 0.05 ~$1,000 ~1000 0.1 0.5 $200 454 pyroseq 120-450 25-100 500 $13,000 2007 Solexa 50-125bp(1) ~1000 50,000-200,000 $15,000 2007 ABI Solid 35-50bp 75-250 6,000-20,000 ~$5,000 2010 Ion Torrent(2) ~200 100 50 $500 2010? Helicos 25-50 ~50 ? $20,000 2010? PacBio 700 120 30 $100 1) Solexa paired-end now 200-220bp 2) Machine cost ~$50k Argonne National Laboratory Ongoing change Argonne National Laboratory 1) Democratization of data generation From factory to bench-top And >70% of Illumina machines go to “small” customers(1) 1) From Illumina at 2010 GIA meeting Argonne National Laboratory 2) Data set size Instrument output went from <<1 GB to 200 GB in <5 years 24-48 month trend looks … interesting to put this into perspective: – All genomics knowledge (“RefSeq”) was 57GB when I last checked – “Large” Global Ocean Survey study by Craig Venter in 2004 was 600 megabases When Biologist speak of big data if frequently fits on an IPod Argonne National Laboratory 3) Computing cost dominate $ $900,000 $600,000 $240,000 $300,000 $300,000 $120,000 $45,000 $30,000 $30,000 $30,000 Sequencing $30,000 $15,000 $15,000 Bioinformatics $15,000 $7,000 $3,500 $3,000 0.5 1 454 30 60 98 GAIIx 196 HiSeq2000 • 95GB == 195,600 node hours (on Nehalem 8core, 16GB), • Illumina HiSeq2000 = 2x100GB/run available today • cost is purely BLASTX (no storage or transfer cost) on Amazon EC2 Source: Wilkening et al., Proceedings IEEE Cluster09, 2009 • note: 10x or 100x improvements over BLASTX will help, but not solve Argonne National Laboratory 294 GB What does genomics look like? Argonne National Laboratory Example: Analysis of pathogen: B. subtilis Argonne National Laboratory Sequencing a genome Where are we? Closing gaps between contigs validating the sequence via a map Argonne National Laboratory Genome sequencing is now routine Figures: © Nikos Kyrpides DOE Joint Genome Institute, G.O.L.D. Argonne National Laboratory Genome sequencing: a success story Genome assembly used to require large machines and significant manual effort More data and novel codes make this much easier – Velvet, newbler, AllPaths, ….. Bacterial genome sequencing and assembly sometimes in a day While some quality issues still exist complete automation on low-cost machine is likely to happen in the next 12-24 months Argonne National Laboratory A genome: …CGGGGGAGCCCTCCAGAATACCCATCA TATAG CCCCT GAGGT GGCAT GGGA TGTCT CCATG AGGGA ACCCC TTCC CACTT CATAC TGTC ACGTATA TCATA GTGT TC TTGACTG GGCCA TTCA TC TAAGATG GGATT TACC CT GTGAAAC AGGGA GAAG AC TTATGGA CCCCA AGCA TC AT TTCAAGT TGAAG TTGA GT TTTTAAA AGCCA TCCA TG CAAAGTT CCTTT GCTT TG GACCCTC TGCAT TATT AA AGCTGCT GTATT GCTA AC CC AGAACTG CTCCA GTGT CT TGACTGA TCATC ATGG CT TCAGTTT GGAAG AGAC TG CAGCGTG TGGGA AAAC AT GCATCCA AGTTC CAGT TT GT GGCCTCC TACCA GGAG CT CATGGTT GAGTG TACG AA GAAATGG TAACC AGAT AA ACTGGTG GTAGA TGAA GA CATGCAA AGTTT GGCT AG TT TGGTGAG TATGA AGCA GG CTGACAT TGGCA ATTT AG ATGACTT CGAAG AAGA TA ATGAAGA TGATG ATGA GA ACAGAGT GAACC AAGA AG AA AAGGCAG CTAAA ATTA CA GAGCTTA TCAAC AAAC TT AACTTTT TGGAT GAAG CA GAAAAGG ACTTG GCCA CC GTGAATT CAAAT CCAT TT GA TGATCCT GATGC TGCA GA ATTAAAT CCATT TGGA GA TCCTGAC TCAGA AGAA CC TATCACT GAAAC AGCT TC ACCTAGA AAAAC AGAA GA CT CTTTTTA TAATA ACAG CT ATAATCC CTTTA AAGA GG TGCAGAC TCCAC AGTA TT TGAACCC ATTCG ATGA GC CAGAAGC ATTTG TGAC CA TA AAGGATT CTCCT CCCC AG TCTACAA AAAGA AAAA AT ATAAGAC CTGTG GATA TG AGCAAGT ACCTC TATG CT GATAGTT CTAAA ACTG AA GC AGAGCTT AGTGA TCTG AA GCGGGAG CCTGA ACTA CA ACAGCCT ATCAG CGGA GC GTGACAG GTACG TGAT GC TAGCTTT TATCA GGCA GC GG TATGCGC GATCA ATGC GC GCGGCTA TATGA TCTG CA TGCGGCG CGATT ACTC TT CGGAGCT TATTT CTGC GG CGGGCCG GGGGA GCCC TC CA GAATACC CATCA TATA GC CCCTGAG GTGGC ATGG GA TGTCTCC ATGAG GGAA CC CCTTCCC ACTTC ATAC TG TCACGTA TATCA TAGT GT TC TTGACTG GGCCA TTCA TC TAAGATG GGATT TACC CT GTGAAAC AGGGA GAAG AC TTATGGA CCCCA AGCA TC ATTTCAA GTTGA AGTT GA GT TTTTAAA AGCCA TCCA TG CAAAGTT CCTTT GCTT TG GACCCTC TGCAT TATT AA AGCTGCT GTATT GCTA AC CCAGAAC TGCTC CAGT GT CT TGACTGA TCATC ATGG CT TCAGTTT GGAAG AGAC TG CAGCGTG TGGGA AAAC AT GCATCCA AGTTC CAGT TT GTGGCCT CCTAC CAGG AG CT CATGGTT GAGTG TACG AA GAAATGG TAACC AGAT AA ACTGGTG GTAGA TGAA GA CATGCAA AGTTT GGCT AG TTTGGTG AGTAT GAAG CA GG CTGACAT TGGCA ATTT AG ATGACTT CGAAG AAGA TA ATGAAGA TGATG ATGA GA ACAGAGT GAACC AAGA AG AAAAGGC AGCTA AAAT TA CA GAGCTTA TCAAC AAAC TT AACTTTT TGGAT GAAG CA GAAAAGG ACTTG GCCA CC GTGAATT CAAAT CCAT TT GATGATC CTGAT GCTG CA GA ATTAAAT CCATT TGGA GA TCCTGAC TCAGA AGAA CC TATCACT GAAAC AGCT TC ACCTAGA AAAAC AGAA GA CTCTTTT TATAA TAAC AG CT ATAATCC CTTTA AAGA GG TGCAGAC TCCAC AGTA TT TGAACCC ATTCG ATGA GC CAGAAGC ATTTG TGAC CA TAAAGGA TTCTC CTCC CC AG TCTACAA AAAGA AAAA AT ATAAGAC CTGTG GATA TG AGCAAGT ACCTC TATG CT GATAGTT CTAAA ACTG AA GCAGAGC TTAGT GATC TG AA GCGGGAG CCTGA ACTA CA ACAGCCT ATCAG CGGA GC GTGACAG GTACG TGAT GC TAGCTTT TATCA GGCA GC GGTATGC GCGAT CAAT GC GC GCG … Argonne National Laboratory From genome to information: Annotation Bioinformatics Source: A. Becker, U of Freiburg,Germany Argonne National Laboratory Annotation … In the old days: – Find every possible gene – Run every tool known to mankind • BLAST, HMMer, … – Against every known database • RefSeq, PFAM, InterPro, KEGG, COG, … – Have humans interpret the results HTGA Several drawbacks: – Computationally expensive fixable with $$ – Requires lots of FTEs fixable with $$$$ – Subjective factors come into play fixable with standards? • Still an open debate Argonne National Laboratory Resulting compute requirements … bacterial genomes contain ~1000 genes per Megabase BLAST vs NCBI NR search takes >10min per gene annotations often require EC (Enzyme) numbers protein domains (Pfam) help gain confidence genomic context comparison (“Clusters”) = 10.000 min + 10.000 min + 50.000 min +200.000 min “Clusters” and “Pfam” have highest confidence BLAST viewed as error prone CPU investment and quality are correlated more computing helps most groups can not pay the price Source: Informal survey of ~20 manual annotators Argonne National Laboratory Things change …. More sequences are being annotated Database grows Human annotator expertise shrinks relatively 100% 50% relative annotator expertise DNA space Sanger Argonne National Laboratory time WGS Pyrosequencing 1% Ecosystem is unsustainable As sequencing becomes so cheap Analysis is the bottleneck Community needs a scalable solution Many view standards as the solution: – It is unclear to me how standards alone can help this problem As we get more data computes take longer, everything becomes more complex Argonne National Laboratory Annotation: Another success story Many expensive solutions exist But there is also a novel approach: RAST server (Aziz et al, BMC Genomics, 2008) – Team of CS and Bio experts developed novel approach • Subsystem technology – Combining domain knowledge from both areas • Integrate data curation and annotation using novel approaches • requiring far less resources, better accuracy Annotate several genomes in a day on a laptop Server has processed over 12k genomes since 2008 – Note: Works only for bacteria – Extension to other areas possible – Try it: http://rast.nmpr.org Argonne National Laboratory The future .. Bacterial genomics has become easy Larger genomes remain harder – But plummeting sequencing cost will help “traditional” genomics – Sequencer output is “compressed” to a few contigs – Via “assembly” to a fraction of the size – Human genome only has 20,000 genes Image what would happen w/o assembly Every sequence a “gene” The next big thing: Metagenomics Argonne National Laboratory Cost per base Source: Rob Knight, UColorado Argonne National Laboratory Metagenomics needs the magic wand.. == “shotgun genomics applied directly to various environments” – “shotgun metagenomics” != sequencing of BAC clones with env. DNA – “functional metagenomics” != sequencing single genes (16 rDNA) – “gene surveys” What are they doing? Who are they? data Argonne National Laboratory Community Structure and Metabolism Gene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4, Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2 NATURE |VOL 428 | 4 MARCH 2004 | www.nature.com/nature Size of flagship projects … Date Metagenome Size type 2004 Acid Mine Drainage 76Mbp Sanger 2004 GOS-I 700Mbp Sanger 2009 ANL-Subsurface 12GBp Solexa 2x75PE 2009 JGI-Cow-rumen 17GBp Solexa 2010 JGI-Cow-rumen 255Gbp Solexa 2010 MetaHIT 500GBp Solexa 2010 DeepSoil 70GBp Solexa 180bp (125x2) 2010 HMP 5.7TBp Solexa 100x2 Argonne National Laboratory Metagenome tasks Metabolic functions Community members Who is doing what Argonne National Laboratory “binning” genes to genomes Sync… We are here: 8492 metagenomes from > 500 groups Over 10GB per week (rapid growth) Many centers produce data Argonne National Laboratory MG-RAST – metagenomics RAST server Data growing over time • open access, web based • user upload data sets • >3500 data submitters Web UI Argonne National Laboratory Scaling up an MG-RAST v2 ~3,500 users (data submitters) ~200 daily users (>10 minutes) V2.0 was a typical bioinformatics app ( next slide) Throughput was becoming a major problem approaching 1 GB per week in late 2008 Need a mechanisms allowing more throughput Argonne National Laboratory Technology choices (typical for BIO) Tightly integrated system – Pleasantly parallel code – NFS for data movement – central database server – Workflow management via SGE Running on ~50 machines locally 40 node Dual-PPC-Cluster shares NFS filesystem with all systems Argonne National Laboratory 2x Performance analysis MG-RAST jobs small Run time hours large (avg. ~0.1Gb) A snapshot with little wait time Most time is spent in SIMS Short jobs spend a long time in create jobs • Careful analysis of all computations • IO to CPU ratio determines suitability of platforms Argonne National Laboratory Redesign workflow Enable use of distributed computing platforms – Including e.g. BG/P, EC2, Azure and local clusters Enable users to contribute resources Be robust, scalable, fault tolerant Enable replacement of algorithm with more efficient ones Enable support for staged database updates Built a prototype workflow engine – Argonne Workflow Environment (A.W.E) Argonne National Laboratory A.W.E. A AWE SERVER webserver db Work AWE Client request B fna fna fna C result s fna Facebook’s Tornado + SQLalchemy RESTful diverse clients • Single analysis operation results in a series of work units • Client requests a lease on work, with a timeout • Results reported to the server, with failures resulting in lease expiration • REST interface scales well and minimizes prerequisites • Client can size requests to local computational capacity • Tested up to ~500 clients Argonne National Laboratory What will the future bring? Argonne National Laboratory I lost my crystal ball BUT: A lot more data seems a safe bet A lot more computing is certain The computing will be non coupled codes (pleasantly parallel) I hope for better algorithms More standards … ah best-practice Argonne National Laboratory M5 -- a metagenomics data sharing infrastructure for a democratized sequencing world M5 = metagenomics, metadata, meta-analysis, models, and metainfrastructure M5 – The initial goal Enable similarities in one non-redundant database against: – GenBank, RefSeq, KEGG, UniProt, SEED, IMG, EggNogs One computation can be used for all tools Store and transport similarities to avoid recomputing Proposed MTF – Metagenome Transport Format – v0.1 Benefit: compute once, use in ALL tools Argonne National Laboratory M5 – the long term goal Establish community best practice It will also lead to the ability to outsource some computing tasks for the community Maintaining large metagenomic data sets will overwhelm all bio computing centers I know Searches against published metagenomes will become impossible if we don’t established a curated body of metagenomes Argonne National Laboratory (A part of) the proposed M5 Platform e.g. OLCF, ALCF, TeraGRID, OSG, … HPC define Reference data set CLOUD TeraGRID export MTF SOP/Workflo w repository Standards in Genomics IMG MG RAST CAMER A USERS (large scale) • Environental metagenomics: GOS, Terragenome • Microbiota in human health: HMP Many smaller user groups Xyz.. Back to Summary Argonne National Laboratory Summary Overall genomics is not limited by lack of cycles Lack of good codes and best practice is more limiting Adjusting to “large data” Will become important for HPC community to recognize good use of cycles And help avoid “stupid computes” Remember Argonne National Laboratory Bioinformatics Interesting novel questions Which microbes are where on the planet – Microbial weather Which genes are where – Gene migration patterns – Combinations of genes – Where do pathways or clusters travel Integration of climate data Predictive models – How will X impact the microbes Argonne National Laboratory Abundance of machines Evolution of computing infrastructure for BIO Before genomics Argonne National Laboratory Early genomics genomics era 2010 Why should I care? Microbes determine the climate on the planet! (e.g. Falkowski et al Science 2008) Microbes impact human health (e.g. obesity Turnbaugh et al, Nature 2006) Computations are pleasantly parallel, but there are a lot of them Example: Oil spill, integrate gene inventory data with oil spill patterns Argonne National Laboratory Argonne National Laboratory