A New Distributed System for Large-Scale Sequence Analyses Douglas Blair Department of Computer Science University of Virginia “Central Dogma” of Molecular Biology Basic molecular mechanisms in all living organisms [Crick, ~1956] Transcription Translation Replication DNA RNA Protein Describes storage, duplication, transmission, and processing of genetic information 2 Bioinformatics Transcription DNA ATGCCTATGATACTG... • • • • Translation RNA AUGCCUAUGAUACUG... Protein MPMILGY... Nucleotide sequences, genes (DNA, RNA) Amino acid sequences (proteins) 3D molecular structures RNA and protein expression profiles 3 Proteins and Evolution Time Time YRVAFEPTLDAYANLRDFEGVKKITPE YRVFEPDAYANLRDFLEGVKKITSE YRVAKFELDAYANLRWENVKKITPE YRMFEPKLDAFANLRDFLREGVKKITSA FRVAKFELDKYANLRWENVKKITPGWE YRMFEPKLDAFANLRDFLREGVKKITSA FRVAKFELDKYANLRWYENAKKITPGWE YRMFEPKLDAFANLRDFLAREGLKKITSA FRVAKFEIDKYANLNRWYENAKKVTPGWEE YRMFEPKCLDAFANLRDFLARFEGLKKISA FRVAKFE---IDKYANLNRW---YENAKKVTPGWEE .:. :: .: .::: . .:. ::.. YRM--FEPKCLDAFANLRDFLARFEGLKKISA 4 Sequence Alignment FRV AK FE--- IDKYANLNRW--- YENAKKVTPGWEE .:. :: .: .::: . .:. ::.. YRM -- FEPKC LDAFANLRDFLAR FEGLKKISA Y R M F E P K C L D A F A N L R D F L A R F E G L K K I S A FRVAKFEIDKYANLNRWYENAKKVTPGWEE 5 Algorithms and Statistics • Sequence Comparison Dynamic Programming Algorithms: – Needleman-Wunsch [Needleman & Wunsch, 1970] – Smith-Waterman [Smith & Waterman, 1981] – Smith-Waterman with Gaps [Gotoh, 1982] – FASTA [Pearson & Lipman, 1988] – BLAST [Altschul et al., 1990] • Statistical Significance: – Distribution of S-W scores [Karlin & Altschul 1990] – Distribution of n S-W scores [Karlin & Altschul 1993] – Empirical distribution of scores w/gaps [Altschul & Gish 1996] 6 Old Sequence Analysis Paradigm Record new experimentally derived sequence Compare to known sequences in database Determine statistical significance of comparison scores Deduce biological and evolutionary relationships ATGCCTATGATACTGGGATAC... ? 7 New Sequence Analysis Paradigm TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC AAAATACTTGATAAGTATTA ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA AACAAAATTATAGACA GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA TAAGTGAATTTA TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA ATCAATTCAAAATAAATTATTCACT ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC CAGAGGTAAACTACTTT AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC GATAATGAATCTTTACGATCTTTTA TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC ATTTACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTTTTTTCA GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCTTTATTGAAAAGAAGAGA TGTAAAAGAAGCTTGTCAACAGAATAAAAATTTTATTGAAGTTATAAAAGAGCAATATAA CTATTTTGGTTGAATTGAAGCTAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC TTGGTTGAATTGAAGCTAAGCGTTATTTCAATATT ACAGAGAGAGATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCATTTCGCTTATC TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCAATAAAGGTAATAGAGCTTTAGG CTGAAGCCAGTTTGAGAA GACCACAGCACCAGCACC ATGCCTATGATACTGGGA TACTGGAACGTCCGCGGA CTGACACACCCGATCCGC ATGCTCCTGGAATACACA GACTCAAGCTATGATGAG AAGAGATACACCATGGGT GACGCTCCCGACTTTGAC AGAAGCCAGTGGCTGAAT GAGAAGTTCAAGCTGGGC CTGGACTTTCCCAATCTG CCTTACTTGATCGATGGA TCACACAAGATCACCCAG Genome Gene MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLI EIITIQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLV EISSKFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSF DNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDA MQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINF MPMILGYWNVRG LTHPIRMLLEYT DSSYDEKRYTMG DAPDFDRSQWLN Proteome EKFKLGLDFPNL PYLIDGSHKITQ SNAILRAHWSNK MNLYDLLELPTTASIKEIKIAYKRLAKRYHPDVNKLGSQTFVEINNAYSILSDP NRIKNLDISVRWHENFMEELELRKTWEFDFFSSDEDFFYSPFTKNKYASFLDK QLEKSLLKRRDVKEACQQNKNFIEVIKEQYNYFGWIEAKRYFNINVELELTQR NNDFPNQLWYEIYKNYSFRLSWDIKNGEIAEFFNKGNRALGWKGDLIVRMK MEENNKANIYDSSSIKVLEGLEAVRKRPGMYIGSTGEEGLHHMIWEIVDNSID FVTRVEDDGRGIPVDIHPKTNRSTVETVFTVLHAGGKFDNDSYKVSGGLHGV QNKKYFLSFSDGGKVIGDLVQEGNSEKEHGTIVEFVPDFSVMEKSDYKQTVIV VDNRKQNPQSFSWKYDGGLVEYIHHLNNEKEPLFNEVIADEKTETVKAVNRD QSIFSFCNNINTTEGGTHVEGFRNALVKIINRFAVENKFLKDSDEKINRDDVCE GQTKKKLGNTEVRPLVNSVVSEIFERFMLENPQEANAIIRKTLLAQEARRRSQ MGKLADCTTRDPSISELYIVEGDSAGGTAKTGRDRYFQAILPLRGKILNVEKS IGCGIKPDFELEKLRYSKIVIMTDADVDGAHIRTLLLTFFFRFMYPLVEQGNIFI YMHTDVQLEQWKSQNPNVKFGLQRYKGLGEMDALQLWETTMDPKVRTLLK MAKQQDQVDKIRENLDNSTVKSISLANELERSFMEYAMSVIVARALPDARDG HDRPFKKSARIVGDVMSKFHPHGDMAIYDTMSRMAQDFSLRYLLIDGHGNFG KLAAELLKDIDKDTVDFIANYDGEEKEPTVLPAAFPNLLANGSSGIAVGMSTS DNPQCTFQELLTVIKGPDFPTGANIIYTKGIESYFETGKGNVVIRSKVEIEQLQT TTLIEKIVELVKAEEISGIADIRDESSREGIRLVIEVKRDTVPEVLLNQLFKSTRL APVLLNMKQALEVYLDHQIDVLVRKTKFVLNKQQERYHILSGLLIAALNIDE NTKFKLDEIQAKAVLDMRLRSLSVLEVNKLQTEQKELKDSIEFCKKVLADQK DERRSEILYDISEEIDDESLIKVENVVITMSTNGYLKRIGVDAYNLQHRGGVGV CSTHSDLLFFTDKGKVYRIRAHQIPYGFRTNKGIPAVNLIKIEKDERICSLLSVN 8 VKRTSLNEFINILSNGKRAISFDDNDTLYSVIKTHGNDEIFIGSTNGFVVRFHEN SLNKGEFVNGLSTSSNGSLLLSVGQNGIGKLTSIDKYRLTKRNAKGVKTLRVT Protein Genomes and Proteomes Organism Year Sequenced and Annotated Mycoplasma Genitalium Haemophilus Influenzae Escherichia Coli Saccharomyces Cerevisiae Caenorhabditis Elegans Drosophila Melanogaster Homo Sapiens 1995 1995 1997 1997 1998 2000 ~Jan 2001 Genome Size (Base pairs) ~588,000 ~1,500,000 ~4,600,000 ~11,000,000 ~86,000,000 ~137,000,000 ~3,100,000,000 Proteome Size (Number of Proteins) 480 1,709 4,289 ~6,600 ~14,300 ~13,500 ~30,000-60,000 37 35 90 32 31 complete microbial genomes (87 in progress) Many new microbial genomes every year Many other higher organisms’ genomes being sequenced 9 Data Avalanche Growth of GenBank 5 – 8.6 billion nucleotides (June 2000) – 9.5 billion nucleotides (August 2000) • Data growing faster than computer speeds: – Data volume doubles every 12 months – Moore’s Law: 18-month doubling time 4 4 3 3 2 2 1 1 Billions of Nucleotides Advances in sequencing technology Exponentially increasing data volume GenBank: Millions of Sequences • • • Year Source: http://www.ncbi.nlm.nih.gov 10 Genomics and Comparative Genomics DATABASE OF KNOWN SEQUENCES E. coli H. influenzae Fruit Fly Cholera Cholera Fruit Fly GENOMIC DNA H. influenzae E. coli 11 Challenges • Computing power growing less quickly than data volume • Computation grows quadratically with data volume • Heuristic methods are faster but less sensitive Faster Better • Current parallel implementations scale poorly 12 Solution: Break the Data Bottleneck M Computation N Data Transmitted Computer Computers 14 2 Computer k Computers 142 Computer Computers M Work/CPU M+¼N Data/CPU Mrr¼N ½N N Work/CPU M+(N/k) Data/CPU M+N M+½N Data/CPU (MrN)/k Work/CPU ½MM r M½N r r½N N Work/CPU Work/CPU Work/CPU (M+N)/½M+ k M+N M+½N Data/CPU ½N Data/CPU Data/CPU M Work 4M+N Total Data MrrNNTotal Total (krM)+N WorkTotal M+N 2M+N Data Total Data MrN Total MWork r M M NrrTotal N NTotal Total Work (M+N) Work Workr(M+N) k M+N 2M+N TotalrTotal Data 2Total TotalData Data 13 Transmitted Data 300 250 Old Method 200 New Method 150 2-1 Compression Units of Data 4-1 Compression 100 50 0 2 4 9 16 25 36 49 64 128 256 # of CPUs 15 Test Platform: Parabon Frontier “Determine never to be idle. No person will have occasion to complain of the want of time, who never loses any. It is wonderful how much may be done, if we are always doing.” -- Thomas Jefferson, May 5, 1787 Job Code Data Task Internet Internet Client (UVa) Task Results Job Results Results Postprocessing Data Elements Task Definitions Frontier Server (Housed at Exodus) Providers (Idle Internet Machines) 16 Drosophila Proteome vs. C. elegans Proteome Power Scalability 15000 12000 300 9000 200 6000 100 3000 0 0 0 2 4 Time (Hours) 6 8 Smith-Waterman Sequence Comparisons/Sec 400 CPUs Smith-Waterman Sequence Comparisons/Sec 15000 12000 Idealized 450 MHz Pentium II: y = 43.7x 9000 6000 Linear fit: y = 37.412x + 271.65 R2 = 0.9968 3000 0 0 50 100 150 200 250 300 350 400 CPUs 17 Conclusions • Not much we can do about the increasing volume of data • New method, however, allows massive parallelism • Driven by observation that the problem has changed Faster Better • Encourages use of more sensitive methods 18 Future Directions • • • • • • • • • Data compression Further Smith-Waterman optimizations Java 1.3 JVM for Provider Compute Engine (Faster than C!) Investigation of novel methods for estimating statistical significance Human Genome vs. GenBank scale searches Implementation of DNA-protein comparisons Other methods (BLAST, FASTA, HMMs, GeneWise, etc.) Large-scale structure-structure comparison Large-scale sequence-structure threading/comparison 19 Smith-Waterman: Java vs. C Mouse GST m1 (218 amino acids) vs. 14548 random sequences 300 MHz Pentium II / Red Hat Linux 6.2 Smith-Waterman w/Miller-Myers optimizations Sun 1.2.2 JDK: 456 sec gcc –O3: 185 sec IBM 1.3 JDK: 116 sec (!) 20 Demand for Sequence Comparison GenBank: June 2000: 8.6 Billion characters August 2000: 9.5 Billion characters Difference: 0.9 Billion characters 0.9 Billion r 8.6 Billion = 770 Quadrillion cells 770 Quadrillion cells / (60 60 days * 86400 sec) = 150 Billion cells/sec 21