Document 17842611

advertisement
Computer Science
Department of
A New Distributed System for
Large-Scale Sequence Analysis
School of Engineering
and Applied Science
University of Virginia
Douglas Blair and Gabriel Robins
www.cs.virginia.edu
{dmb4x, robins}@cs.virginia.edu
(804) 982-2207
Presented at Intelligent Systems for Molecular Biology 2000, August 19-23, 2000, San Diego, CA
Bioinformatics
“Central Dogma” of Molecular Biology
Sequence Alignment
Proteins and Evolution
FRVAKFEIDKYANLNRWYENAKKVTPGWEE
YRVAFEPTLDAYANLRDFEGVKKITPE
Basic mechanisms in all living organisms [Crick, ~1956]
Y
R
M
F
E
P
K
C
L
D
A
F
A
N
L
R
D
F
L
A
R
F
E
G
L
K
K
I
S
A
YRVFEPDAYANLRDFLEGVKKITSE
YRMFEPKLDAFANLRDFLREGVKKITSA
Time
Time
YRVAKFELDAYANLRWENVKKITPE
FRVAKFELDKYANLRWENVKKITPGWE
Transcription
YRMFEPKLDAFANLRDFLREGVKKITSA
Replication
RNA
FRVAKFELDKYANLRWYENAKKITPGWE
AUGCCUAUGAUACUGGGAUAC...
YRMFEPKLDAFANLRDFLAREGLKKITSA
FRVAKFEIDKYANLNRWYENAKKVTPGWEE
Translation
YRMFEPKCLDAFANLRDFLARFEGLKKISA
FRVAKFE---IDKYANLNRW---YENAKKVTPGWEE
.:. ::
.: .::: .
.:. ::..
YRM--FEPKCLDAFANLRDFLARFEGLKKISA
Protein
MPMILGY...
Data Avalanche
Smith-Waterman [Smith & Waterman, 1981]
Smith-Waterman with gaps [Gotoh, 1982]
FASTA [Pearson & Lipman, 1988]
BLAST [Altschul et al., 1990]
Statistical Significance:
Distribution of Smith-Waterman scores [Karlin & Altschul
1990]
Distribution of n SW scores [Karlin & Altschul 1993]
Empirical distribution for gapped scores [Altschul & Gish
1996]
Paradigm Shift
Yesterday
New Sequence Analysis Paradigm:
Genomics and Comparative Genomics
Old Sequence Analysis Paradigm
Record new experimentally derived sequence
Compare to known sequences in database
Determine statistical significance of comparison scores
Deduce biological and evolutionary relationships
DATABASE OF KNOWN SEQUENCES
S. cerevisiae
ATGCCTATGATACTGGGATAC...
Gene
Protein
?
TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT
ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC
ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT
TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA
CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT
ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT
AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG
AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT
TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA
GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA
ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA
CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA
AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT
CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT
TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG
TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC
AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA
TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA
ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG
TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG
GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC
AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT
TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT
TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG
AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC
AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA
CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA
ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA
TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA
ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC
TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC
TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC
TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT
AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA
TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA
ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC
ATTTACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTTTTTTCA
GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCTTTATTGAAAAGAAGAGA
TGTAAAAGAAGCTTGTCAACAGAATAAAAATTTTATTGAAGTTATAAAAGAGCAATATAA
CTATTTTGGTTGAATTGAAGCTAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC
ACAGAGAGAGATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA
TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCATTTCGCTTATC
TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCAATAAAGGTAATAGAGCTTTAGG
Genome
D. melan.
M. gen.
M. gen.
Today
E. coli
S. cerevisiae
MPMILGYWNVRG
LTHPIRMLLEYT
DSSYDEKRYTMG
DAPDFDRSQWLN
EKFKLGLDFPNL
PYLIDGSHKITQ
SNAILRAHWSNK
GENOMIC DNA
CTGAAGCCAGTTTGAGAA
GACCACAGCACCAGCACC
ATGCCTATGATACTGGGA
TACTGGAACGTCCGCGGA
CTGACACACCCGATCCGC
ATGCTCCTGGAATACACA
GACTCAAGCTATGATGAG
AAGAGATACACCATGGGT
GACGCTCCCGACTTTGAC
AGAAGCCAGTGGCTGAAT
GAGAAGTTCAAGCTGGGC
CTGGACTTTCCCAATCTG
CCTTACTTGATCGATGGA
TCACACAAGATCACCCAG
Needleman-Wunsch [Needleman & Wunsch, 1970]
E. coli
ATGCCTATGATACTGGGATAC...
Sequence Comparison Dynamic Programming Algorithms:
D. melan.
DNA
Algorithms and Statistics
MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKE
EIITIQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNR
EISSKFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRK
DNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCD
MQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITS
MNLYDLLELPTTASIKEIKIAYKRLAKRYHPDVNKLGSQTFVEINNAYSILSDPNQKEKYDSMLKVNDFQ
NRIKNLDISVRWHENFMEELELRKTWEFDFFSSDEDFFYSPFTKNKYASFLDKDVSLAFFQLYSKGKIDH
QLEKSLLKRRDVKEACQQNKNFIEVIKEQYNYFGWIEAKRYFNINVELELTQREIRDRDVVNLPLKIKVI
NNDFPNQLWYEIYKNYSFRLSWDIKNGEIAEFFNKGNRALGWKGDLIVRMKVVNKVNKRLRIFSSFFEND
Challenges
MEENNKANIYDSSSIKVLEGLEAVRKRPGMYIGSTGEEGLHHMIWEIVDNSIDEAMGGFASFVKLTLEDN
FVTRVEDDGRGIPVDIHPKTNRSTVETVFTVLHAGGKFDNDSYKVSGGLHGVGASVVNALSSSFKVWVFR
QNKKYFLSFSDGGKVIGDLVQEGNSEKEHGTIVEFVPDFSVMEKSDYKQTVIVSRLQQLAFLNKGIRIDF
VDNRKQNPQSFSWKYDGGLVEYIHHLNNEKEPLFNEVIADEKTETVKAVNRDENYTVKVEVAFQYNKTYN
QSIFSFCNNINTTEGGTHVEGFRNALVKIINRFAVENKFLKDSDEKINRDDVCEGLTAIISIKHPNPQYE
GQTKKKLGNTEVRPLVNSVVSEIFERFMLENPQEANAIIRKTLLAQEARRRSQEARELTRRKSPFDSGSL
Proteome
MGKLADCTTRDPSISELYIVEGDSAGGTAKTGRDRYFQAILPLRGKILNVEKSNFEQIFNNAEISALVMA
IGCGIKPDFELEKLRYSKIVIMTDADVDGAHIRTLLLTFFFRFMYPLVEQGNIFIAQPPLYKVSYSHKDL
YMHTDVQLEQWKSQNPNVKFGLQRYKGLGEMDALQLWETTMDPKVRTLLKVTVEDASIADKAFSLLMG
MAKQQDQVDKIRENLDNSTVKSISLANELERSFMEYAMSVIVARALPDARDGLKPVHRRVLYGAYIGGMH
HDRPFKKSARIVGDVMSKFHPHGDMAIYDTMSRMAQDFSLRYLLIDGHGNFGSIDGDRPAAQRYTEARLS
KLAAELLKDIDKDTVDFIANYDGEEKEPTVLPAAFPNLLANGSSGIAVGMSTSIPSHNLSELIAGLIMLI
DNPQCTFQELLTVIKGPDFPTGANIIYTKGIESYFETGKGNVVIRSKVEIEQLQTRSALVVTEIPYMVNK
TTLIEKIVELVKAEEISGIADIRDESSREGIRLVIEVKRDTVPEVLLNQLFKSTRLQVRFPVNMLALVKG
• Computation grows quadratically with data volume
• Computing power growing less quickly than data volume
Faster
Better
APVLLNMKQALEVYLDHQIDVLVRKTKFVLNKQQERYHILSGLLIAALNIDEVVAIIKKSANNQEAINTL
NTKFKLDEIQAKAVLDMRLRSLSVLEVNKLQTEQKELKDSIEFCKKVLADQKLQLKIIKEELQKINDQFG
DERRSEILYDISEEIDDESLIKVENVVITMSTNGYLKRIGVDAYNLQHRGGVGVKGLTTYVDDSISQLLV
CSTHSDLLFFTDKGKVYRIRAHQIPYGFRTNKGIPAVNLIKIEKDERICSLLSVNNYDDGYFFFCTKNGI
VKRTSLNEFINILSNGKRAISFDDNDTLYSVIKTHGNDEIFIGSTNGFVVRFHENQLRVLSRTARGVFGI
SLNKGEFVNGLSTSSNGSLLLSVGQNGIGKLTSIDKYRLTKRNAKGVKTLRVTDRTGPVVTTTTVFGNED
LLMISSAGKIVRTSLQELSEQGKNTSGVKLIRLKDNERLERVTIFKEELEDKEMQLEDVGSKQI
• Current parallel implementations scale poorly
• Heuristic methods are faster but less sensitive
Solution: Break the Data Bottleneck
Data Transmitted
M
Genomes and Proteomes
70
Organism
Year
Completed
Mycoplasma Genitalium
Escherichia Coli
Saccharomyces Cerevisiae
Caenorhabditis Elegans
Drosophila Melanogaster
Homo Sapiens
1995
1997
1997
1998
2000
~2000
Genome Size
(Base Pairs)
Computation
Proteome Size
(# of Proteins)
60
50
~588,000
480
~4,600,000
4,289
~11,000,000
~6,600
~86,000,000
~14,300
~137,000,000
~13,500
~3,100,000,000 ~30,000-60,000
4-1 Compression
Data Transmitted
N
Data
2-1 Compression
40
New Method
Old Method
30
32 35
31 complete microbial genomes (87 in progress)
20
Many new microbial genomes every year
Computer
10
Many other higher organisms’ genomes being sequenced
0
Runaway Growth
M+(N/k) Data/CPU
(krM)+N Total Data
Advances in sequencing technology
Exponentially increasing data volume
GenBank: 8.6 billion nucleotides (Jun 2000)
9.5 billion nucleotides (Aug 2000)
Data growing faster than computer speeds:
2
k Computers
(MrN)/k Work/CPU
MrN Total Work
(M+N)/k Data/CPU
(M+N)rk Total Data
4
9 16 25 36 49 64
# of CPUs
Old vs. New
Data  Work
Data << Work -- “Square” tasks minimize data/computation
Data expansion takes as long as or longer than comparison operations
- Data volume doubles every 12
months
- Moore’s Law: 18-month doubling
time
Data expansion relatively inexpensive - compression becomes worthwhile
Entire library required everywhere simultaneously (Poor NFS server…)
Tasks self-contained, compact, independent - exquisitely parallel
Parallelism constrained to number of machines not starved for data
Parallelism constrained only by the available number of machines
Paves the way for Massively Parallel Computation
Ability to take advantage of more processors encourages use of more sensitive, computationally demanding techniques
Source: http://www.ncbi.nlm.nih.gov
Implementation & Results
Test Platform: Parabon Frontier
Scalability
Future Directions
15000
Job
Code
Data
12000
Task
Internet
Smith-Waterman
Seq. Comparisons/Sec
“Determine never to be idle.
No person will have occasion
to complain of the want of
time, who never loses any.
It is wonderful
how much may
be done, if we are
always doing.”
Internet
Client
(UVa)
Task Results
Job Results
Further Smith-Waterman optimizations
Investigation of novel methods for estimating statistical significance
Other methods (BLAST, FASTA, HMMs, GeneWise, etc.)
Data compression
Implementation of DNA-protein and DNA-DNA comparisons
Large-scale structure-structure comparison
Large-scale sequence-structure threading/comparison
Human Genome vs. GenBank scale searches
Java 1.3 JVM for Provider Compute Engine (Faster than C!)
Other projects (e.g. Maximum Likelihood Tree Searches)
9000
y = 37.412x + 271.65
2
R = 0.9968
6000
3000
-- Thomas Jefferson, May 5, 1787
Results
Postprocessing
Code & Data
Elements
Task
Definitions
Frontier Server
(Housed at Exodus)
Providers
(Idle Internet Machines)
0
0
100
200
CPUs
300
400
Download