Computer Science Department of A New Distributed System for Large-Scale Sequence Analysis School of Engineering and Applied Science University of Virginia Douglas Blair and Gabriel Robins www.cs.virginia.edu {dmb4x, robins}@cs.virginia.edu (804) 982-2207 Presented at Intelligent Systems for Molecular Biology 2000, August 19-23, 2000, San Diego, CA Bioinformatics “Central Dogma” of Molecular Biology Sequence Alignment Proteins and Evolution FRVAKFEIDKYANLNRWYENAKKVTPGWEE YRVAFEPTLDAYANLRDFEGVKKITPE Basic mechanisms in all living organisms [Crick, ~1956] Y R M F E P K C L D A F A N L R D F L A R F E G L K K I S A YRVFEPDAYANLRDFLEGVKKITSE YRMFEPKLDAFANLRDFLREGVKKITSA Time Time YRVAKFELDAYANLRWENVKKITPE FRVAKFELDKYANLRWENVKKITPGWE Transcription YRMFEPKLDAFANLRDFLREGVKKITSA Replication RNA FRVAKFELDKYANLRWYENAKKITPGWE AUGCCUAUGAUACUGGGAUAC... YRMFEPKLDAFANLRDFLAREGLKKITSA FRVAKFEIDKYANLNRWYENAKKVTPGWEE Translation YRMFEPKCLDAFANLRDFLARFEGLKKISA FRVAKFE---IDKYANLNRW---YENAKKVTPGWEE .:. :: .: .::: . .:. ::.. YRM--FEPKCLDAFANLRDFLARFEGLKKISA Protein MPMILGY... Data Avalanche Smith-Waterman [Smith & Waterman, 1981] Smith-Waterman with gaps [Gotoh, 1982] FASTA [Pearson & Lipman, 1988] BLAST [Altschul et al., 1990] Statistical Significance: Distribution of Smith-Waterman scores [Karlin & Altschul 1990] Distribution of n SW scores [Karlin & Altschul 1993] Empirical distribution for gapped scores [Altschul & Gish 1996] Paradigm Shift Yesterday New Sequence Analysis Paradigm: Genomics and Comparative Genomics Old Sequence Analysis Paradigm Record new experimentally derived sequence Compare to known sequences in database Determine statistical significance of comparison scores Deduce biological and evolutionary relationships DATABASE OF KNOWN SEQUENCES S. cerevisiae ATGCCTATGATACTGGGATAC... Gene Protein ? TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC ATTTACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTTTTTTCA GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCTTTATTGAAAAGAAGAGA TGTAAAAGAAGCTTGTCAACAGAATAAAAATTTTATTGAAGTTATAAAAGAGCAATATAA CTATTTTGGTTGAATTGAAGCTAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC ACAGAGAGAGATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCATTTCGCTTATC TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCAATAAAGGTAATAGAGCTTTAGG Genome D. melan. M. gen. M. gen. Today E. coli S. cerevisiae MPMILGYWNVRG LTHPIRMLLEYT DSSYDEKRYTMG DAPDFDRSQWLN EKFKLGLDFPNL PYLIDGSHKITQ SNAILRAHWSNK GENOMIC DNA CTGAAGCCAGTTTGAGAA GACCACAGCACCAGCACC ATGCCTATGATACTGGGA TACTGGAACGTCCGCGGA CTGACACACCCGATCCGC ATGCTCCTGGAATACACA GACTCAAGCTATGATGAG AAGAGATACACCATGGGT GACGCTCCCGACTTTGAC AGAAGCCAGTGGCTGAAT GAGAAGTTCAAGCTGGGC CTGGACTTTCCCAATCTG CCTTACTTGATCGATGGA TCACACAAGATCACCCAG Needleman-Wunsch [Needleman & Wunsch, 1970] E. coli ATGCCTATGATACTGGGATAC... Sequence Comparison Dynamic Programming Algorithms: D. melan. DNA Algorithms and Statistics MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKE EIITIQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNR EISSKFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRK DNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCD MQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITS MNLYDLLELPTTASIKEIKIAYKRLAKRYHPDVNKLGSQTFVEINNAYSILSDPNQKEKYDSMLKVNDFQ NRIKNLDISVRWHENFMEELELRKTWEFDFFSSDEDFFYSPFTKNKYASFLDKDVSLAFFQLYSKGKIDH QLEKSLLKRRDVKEACQQNKNFIEVIKEQYNYFGWIEAKRYFNINVELELTQREIRDRDVVNLPLKIKVI NNDFPNQLWYEIYKNYSFRLSWDIKNGEIAEFFNKGNRALGWKGDLIVRMKVVNKVNKRLRIFSSFFEND Challenges MEENNKANIYDSSSIKVLEGLEAVRKRPGMYIGSTGEEGLHHMIWEIVDNSIDEAMGGFASFVKLTLEDN FVTRVEDDGRGIPVDIHPKTNRSTVETVFTVLHAGGKFDNDSYKVSGGLHGVGASVVNALSSSFKVWVFR QNKKYFLSFSDGGKVIGDLVQEGNSEKEHGTIVEFVPDFSVMEKSDYKQTVIVSRLQQLAFLNKGIRIDF VDNRKQNPQSFSWKYDGGLVEYIHHLNNEKEPLFNEVIADEKTETVKAVNRDENYTVKVEVAFQYNKTYN QSIFSFCNNINTTEGGTHVEGFRNALVKIINRFAVENKFLKDSDEKINRDDVCEGLTAIISIKHPNPQYE GQTKKKLGNTEVRPLVNSVVSEIFERFMLENPQEANAIIRKTLLAQEARRRSQEARELTRRKSPFDSGSL Proteome MGKLADCTTRDPSISELYIVEGDSAGGTAKTGRDRYFQAILPLRGKILNVEKSNFEQIFNNAEISALVMA IGCGIKPDFELEKLRYSKIVIMTDADVDGAHIRTLLLTFFFRFMYPLVEQGNIFIAQPPLYKVSYSHKDL YMHTDVQLEQWKSQNPNVKFGLQRYKGLGEMDALQLWETTMDPKVRTLLKVTVEDASIADKAFSLLMG MAKQQDQVDKIRENLDNSTVKSISLANELERSFMEYAMSVIVARALPDARDGLKPVHRRVLYGAYIGGMH HDRPFKKSARIVGDVMSKFHPHGDMAIYDTMSRMAQDFSLRYLLIDGHGNFGSIDGDRPAAQRYTEARLS KLAAELLKDIDKDTVDFIANYDGEEKEPTVLPAAFPNLLANGSSGIAVGMSTSIPSHNLSELIAGLIMLI DNPQCTFQELLTVIKGPDFPTGANIIYTKGIESYFETGKGNVVIRSKVEIEQLQTRSALVVTEIPYMVNK TTLIEKIVELVKAEEISGIADIRDESSREGIRLVIEVKRDTVPEVLLNQLFKSTRLQVRFPVNMLALVKG • Computation grows quadratically with data volume • Computing power growing less quickly than data volume Faster Better APVLLNMKQALEVYLDHQIDVLVRKTKFVLNKQQERYHILSGLLIAALNIDEVVAIIKKSANNQEAINTL NTKFKLDEIQAKAVLDMRLRSLSVLEVNKLQTEQKELKDSIEFCKKVLADQKLQLKIIKEELQKINDQFG DERRSEILYDISEEIDDESLIKVENVVITMSTNGYLKRIGVDAYNLQHRGGVGVKGLTTYVDDSISQLLV CSTHSDLLFFTDKGKVYRIRAHQIPYGFRTNKGIPAVNLIKIEKDERICSLLSVNNYDDGYFFFCTKNGI VKRTSLNEFINILSNGKRAISFDDNDTLYSVIKTHGNDEIFIGSTNGFVVRFHENQLRVLSRTARGVFGI SLNKGEFVNGLSTSSNGSLLLSVGQNGIGKLTSIDKYRLTKRNAKGVKTLRVTDRTGPVVTTTTVFGNED LLMISSAGKIVRTSLQELSEQGKNTSGVKLIRLKDNERLERVTIFKEELEDKEMQLEDVGSKQI • Current parallel implementations scale poorly • Heuristic methods are faster but less sensitive Solution: Break the Data Bottleneck Data Transmitted M Genomes and Proteomes 70 Organism Year Completed Mycoplasma Genitalium Escherichia Coli Saccharomyces Cerevisiae Caenorhabditis Elegans Drosophila Melanogaster Homo Sapiens 1995 1997 1997 1998 2000 ~2000 Genome Size (Base Pairs) Computation Proteome Size (# of Proteins) 60 50 ~588,000 480 ~4,600,000 4,289 ~11,000,000 ~6,600 ~86,000,000 ~14,300 ~137,000,000 ~13,500 ~3,100,000,000 ~30,000-60,000 4-1 Compression Data Transmitted N Data 2-1 Compression 40 New Method Old Method 30 32 35 31 complete microbial genomes (87 in progress) 20 Many new microbial genomes every year Computer 10 Many other higher organisms’ genomes being sequenced 0 Runaway Growth M+(N/k) Data/CPU (krM)+N Total Data Advances in sequencing technology Exponentially increasing data volume GenBank: 8.6 billion nucleotides (Jun 2000) 9.5 billion nucleotides (Aug 2000) Data growing faster than computer speeds: 2 k Computers (MrN)/k Work/CPU MrN Total Work (M+N)/k Data/CPU (M+N)rk Total Data 4 9 16 25 36 49 64 # of CPUs Old vs. New Data Work Data << Work -- “Square” tasks minimize data/computation Data expansion takes as long as or longer than comparison operations - Data volume doubles every 12 months - Moore’s Law: 18-month doubling time Data expansion relatively inexpensive - compression becomes worthwhile Entire library required everywhere simultaneously (Poor NFS server…) Tasks self-contained, compact, independent - exquisitely parallel Parallelism constrained to number of machines not starved for data Parallelism constrained only by the available number of machines Paves the way for Massively Parallel Computation Ability to take advantage of more processors encourages use of more sensitive, computationally demanding techniques Source: http://www.ncbi.nlm.nih.gov Implementation & Results Test Platform: Parabon Frontier Scalability Future Directions 15000 Job Code Data 12000 Task Internet Smith-Waterman Seq. Comparisons/Sec “Determine never to be idle. No person will have occasion to complain of the want of time, who never loses any. It is wonderful how much may be done, if we are always doing.” Internet Client (UVa) Task Results Job Results Further Smith-Waterman optimizations Investigation of novel methods for estimating statistical significance Other methods (BLAST, FASTA, HMMs, GeneWise, etc.) Data compression Implementation of DNA-protein and DNA-DNA comparisons Large-scale structure-structure comparison Large-scale sequence-structure threading/comparison Human Genome vs. GenBank scale searches Java 1.3 JVM for Provider Compute Engine (Faster than C!) Other projects (e.g. Maximum Likelihood Tree Searches) 9000 y = 37.412x + 271.65 2 R = 0.9968 6000 3000 -- Thomas Jefferson, May 5, 1787 Results Postprocessing Code & Data Elements Task Definitions Frontier Server (Housed at Exodus) Providers (Idle Internet Machines) 0 0 100 200 CPUs 300 400