A New Distributed System for Large-Scale Sequence Analyses Douglas Blair

advertisement
A New Distributed System for
Large-Scale Sequence Analyses
Douglas Blair
Department of Computer Science
University of Virginia
“Central Dogma” of Molecular Biology
Basic molecular mechanisms in all living organisms [Crick, ~1956]
Transcription
Translation
Replication
DNA
RNA
Protein
Describes storage, duplication, transmission, and processing of genetic information
2
Bioinformatics
Transcription
DNA
ATGCCTATGATACTG...
•
•
•
•
Translation
RNA
AUGCCUAUGAUACUG...
Protein
MPMILGY...
Nucleotide sequences, genes (DNA, RNA)
Amino acid sequences (proteins)
3D molecular structures
RNA and protein expression profiles
3
Proteins and Evolution
Time
Time
YRVAFEPTLDAYANLRDFEGVKKITPE
YRVFEPDAYANLRDFLEGVKKITSE
YRVAKFELDAYANLRWENVKKITPE
YRMFEPKLDAFANLRDFLREGVKKITSA
FRVAKFELDKYANLRWENVKKITPGWE
YRMFEPKLDAFANLRDFLREGVKKITSA
FRVAKFELDKYANLRWYENAKKITPGWE
YRMFEPKLDAFANLRDFLAREGLKKITSA
FRVAKFEIDKYANLNRWYENAKKVTPGWEE
YRMFEPKCLDAFANLRDFLARFEGLKKISA
FRVAKFE---IDKYANLNRW---YENAKKVTPGWEE
.:. ::
.: .::: .
.:. ::..
YRM--FEPKCLDAFANLRDFLARFEGLKKISA
4
Sequence Alignment
FRV AK FE--- IDKYANLNRW--- YENAKKVTPGWEE
.:. ::
.: .::: .
.:. ::..
YRM -- FEPKC LDAFANLRDFLAR FEGLKKISA
Y
R
M
F
E
P
K
C
L
D
A
F
A
N
L
R
D
F
L
A
R
F
E
G
L
K
K
I
S
A
FRVAKFEIDKYANLNRWYENAKKVTPGWEE
5
Algorithms and Statistics
•
Sequence Comparison Dynamic Programming Algorithms:
– Needleman-Wunsch [Needleman & Wunsch, 1970]
– Smith-Waterman [Smith & Waterman, 1981]
– Smith-Waterman with Gaps [Gotoh, 1982]
– FASTA [Pearson & Lipman, 1988]
– BLAST [Altschul et al., 1990]
•
Statistical Significance:
– Distribution of S-W scores [Karlin & Altschul 1990]
– Distribution of n S-W scores [Karlin & Altschul 1993]
– Empirical distribution of scores w/gaps [Altschul & Gish 1996]
6
Old Sequence Analysis Paradigm
Record new experimentally derived sequence
Compare to known sequences in database
Determine statistical significance of comparison scores
Deduce biological and evolutionary relationships
ATGCCTATGATACTGGGATAC...
?
7
New Sequence Analysis Paradigm
TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT
ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC
AAAATACTTGATAAGTATTA
ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT
TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA
CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT
ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT
AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG
AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT
TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA
AACAAAATTATAGACA
GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA
ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA
CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA
AATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT
CATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACT
TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTG
TTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTC
AAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAA
TAAGTGAATTTA
TTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA
ATCAATTCAAAATAAATTATTCACT
ATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAG
TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATG
GAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGC
AAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT
TTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCT
TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATG
AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTC
CAGAGGTAAACTACTTT
AAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA
CTTTTTTATGCGATATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA
ATATCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAACTTAATA
TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTTTTGAATCAAATGAAATAA
ATTTTGATTTCCAAGGAAATAGTAAGTATTTTTTGATAACCTCTAAAAGTGAACCTGAAC
GATAATGAATCTTTACGATCTTTTA
TTAAGCAAATATTGGTTCCTTCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC
TACAGCATCAATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACCC
TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGCTTATTCAATATT
AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATGCTGAAAGTTAATGATTTTCAAAA
TCGCATCAAAAATTTAGATATTAGTGTTAGATGACATGAAAATTTCATGGAAGAACTCGA
ACTTCGTAAGACCTGAGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC
ATTTACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTTTTTTCA
GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCTTTATTGAAAAGAAGAGA
TGTAAAAGAAGCTTGTCAACAGAATAAAAATTTTATTGAAGTTATAAAAGAGCAATATAA
CTATTTTGGTTGAATTGAAGCTAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC
TTGGTTGAATTGAAGCTAAGCGTTATTTCAATATT
ACAGAGAGAGATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA
TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCATTTCGCTTATC
TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCAATAAAGGTAATAGAGCTTTAGG
CTGAAGCCAGTTTGAGAA
GACCACAGCACCAGCACC
ATGCCTATGATACTGGGA
TACTGGAACGTCCGCGGA
CTGACACACCCGATCCGC
ATGCTCCTGGAATACACA
GACTCAAGCTATGATGAG
AAGAGATACACCATGGGT
GACGCTCCCGACTTTGAC
AGAAGCCAGTGGCTGAAT
GAGAAGTTCAAGCTGGGC
CTGGACTTTCCCAATCTG
CCTTACTTGATCGATGGA
TCACACAAGATCACCCAG
Genome
Gene
MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLI
EIITIQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLV
EISSKFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSF
DNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDA
MQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINF
MPMILGYWNVRG
LTHPIRMLLEYT
DSSYDEKRYTMG
DAPDFDRSQWLN
Proteome
EKFKLGLDFPNL
PYLIDGSHKITQ
SNAILRAHWSNK
MNLYDLLELPTTASIKEIKIAYKRLAKRYHPDVNKLGSQTFVEINNAYSILSDP
NRIKNLDISVRWHENFMEELELRKTWEFDFFSSDEDFFYSPFTKNKYASFLDK
QLEKSLLKRRDVKEACQQNKNFIEVIKEQYNYFGWIEAKRYFNINVELELTQR
NNDFPNQLWYEIYKNYSFRLSWDIKNGEIAEFFNKGNRALGWKGDLIVRMK
MEENNKANIYDSSSIKVLEGLEAVRKRPGMYIGSTGEEGLHHMIWEIVDNSID
FVTRVEDDGRGIPVDIHPKTNRSTVETVFTVLHAGGKFDNDSYKVSGGLHGV
QNKKYFLSFSDGGKVIGDLVQEGNSEKEHGTIVEFVPDFSVMEKSDYKQTVIV
VDNRKQNPQSFSWKYDGGLVEYIHHLNNEKEPLFNEVIADEKTETVKAVNRD
QSIFSFCNNINTTEGGTHVEGFRNALVKIINRFAVENKFLKDSDEKINRDDVCE
GQTKKKLGNTEVRPLVNSVVSEIFERFMLENPQEANAIIRKTLLAQEARRRSQ
MGKLADCTTRDPSISELYIVEGDSAGGTAKTGRDRYFQAILPLRGKILNVEKS
IGCGIKPDFELEKLRYSKIVIMTDADVDGAHIRTLLLTFFFRFMYPLVEQGNIFI
YMHTDVQLEQWKSQNPNVKFGLQRYKGLGEMDALQLWETTMDPKVRTLLK
MAKQQDQVDKIRENLDNSTVKSISLANELERSFMEYAMSVIVARALPDARDG
HDRPFKKSARIVGDVMSKFHPHGDMAIYDTMSRMAQDFSLRYLLIDGHGNFG
KLAAELLKDIDKDTVDFIANYDGEEKEPTVLPAAFPNLLANGSSGIAVGMSTS
DNPQCTFQELLTVIKGPDFPTGANIIYTKGIESYFETGKGNVVIRSKVEIEQLQT
TTLIEKIVELVKAEEISGIADIRDESSREGIRLVIEVKRDTVPEVLLNQLFKSTRL
APVLLNMKQALEVYLDHQIDVLVRKTKFVLNKQQERYHILSGLLIAALNIDE
NTKFKLDEIQAKAVLDMRLRSLSVLEVNKLQTEQKELKDSIEFCKKVLADQK
DERRSEILYDISEEIDDESLIKVENVVITMSTNGYLKRIGVDAYNLQHRGGVGV
CSTHSDLLFFTDKGKVYRIRAHQIPYGFRTNKGIPAVNLIKIEKDERICSLLSVN
8
VKRTSLNEFINILSNGKRAISFDDNDTLYSVIKTHGNDEIFIGSTNGFVVRFHEN
SLNKGEFVNGLSTSSNGSLLLSVGQNGIGKLTSIDKYRLTKRNAKGVKTLRVT
Protein
Genomes and Proteomes
Organism
Year Sequenced
and Annotated
Mycoplasma Genitalium
Haemophilus Influenzae
Escherichia Coli
Saccharomyces Cerevisiae
Caenorhabditis Elegans
Drosophila Melanogaster
Homo Sapiens
1995
1995
1997
1997
1998
2000
~Jan 2001
Genome Size
(Base pairs)
~588,000
~1,500,000
~4,600,000
~11,000,000
~86,000,000
~137,000,000
~3,100,000,000
Proteome Size
(Number of Proteins)
480
1,709
4,289
~6,600
~14,300
~13,500
~30,000-60,000
37
35
90
32
31 complete microbial genomes (87 in progress)
Many new microbial genomes every year
Many other higher organisms’ genomes being sequenced
9
Data Avalanche
Growth of GenBank
5
– 8.6 billion nucleotides (June 2000)
– 9.5 billion nucleotides (August 2000)
•
Data growing faster than computer speeds:
– Data volume doubles every 12 months
– Moore’s Law: 18-month doubling time
4
4
3
3
2
2
1
1
Billions of Nucleotides
Advances in sequencing technology
Exponentially increasing data volume
GenBank:
Millions of Sequences
•
•
•
Year
Source: http://www.ncbi.nlm.nih.gov
10
Genomics and Comparative Genomics
DATABASE OF KNOWN SEQUENCES
E. coli H. influenzae
Fruit Fly Cholera
Cholera Fruit Fly
GENOMIC DNA
H. influenzae E. coli
11
Challenges
• Computing power growing less quickly than data volume
• Computation grows quadratically with data volume
• Heuristic methods are faster but less sensitive
Faster
Better
• Current parallel implementations scale poorly
12
Solution: Break the Data Bottleneck
M
Computation
N
Data Transmitted
Computer
Computers
14
2 Computer
k Computers
142 Computer
Computers
M
Work/CPU
M+¼N
Data/CPU
Mrr¼N
½N
N Work/CPU
M+(N/k) Data/CPU
M+N
M+½N
Data/CPU
(MrN)/k Work/CPU
½MM
r
M½N
r
r½N
N Work/CPU
Work/CPU
Work/CPU
(M+N)/½M+
k M+N
M+½N
Data/CPU
½N
Data/CPU
Data/CPU
M
Work
4M+N
Total
Data
MrrNNTotal
Total
(krM)+N
WorkTotal
M+N
2M+N
Data
Total
Data
MrN Total
MWork
r
M
M
NrrTotal
N
NTotal
Total
Work
(M+N)
Work
Workr(M+N)
k M+N
2M+N
TotalrTotal
Data
2Total
TotalData
Data
13
Transmitted Data
300
250
Old Method
200
New Method
150
2-1 Compression
Units of Data
4-1 Compression
100
50
0
2
4
9
16
25
36
49
64
128 256
# of CPUs
15
Test Platform: Parabon Frontier
“Determine never to be idle. No person will have occasion
to complain of the want of time, who never loses any. It is
wonderful how much may be done, if we are always doing.”
-- Thomas Jefferson, May 5, 1787
Job
Code
Data
Task
Internet
Internet
Client
(UVa)
Task Results
Job Results
Results
Postprocessing
Data
Elements
Task
Definitions
Frontier Server
(Housed at Exodus)
Providers
(Idle Internet Machines)
16
Drosophila Proteome vs. C. elegans Proteome
Power
Scalability
15000
12000
300
9000
200
6000
100
3000
0
0
0
2
4
Time (Hours)
6
8
Smith-Waterman
Sequence Comparisons/Sec
400
CPUs
Smith-Waterman
Sequence Comparisons/Sec
15000
12000
Idealized
450 MHz Pentium II:
y = 43.7x
9000
6000
Linear fit:
y = 37.412x + 271.65
R2 = 0.9968
3000
0
0
50
100 150 200 250 300 350 400
CPUs
17
Conclusions
• Not much we can do about the increasing volume of data
• New method, however, allows massive parallelism
• Driven by observation that the problem has changed
Faster
Better
• Encourages use of more sensitive methods
18
Future Directions
•
•
•
•
•
•
•
•
•
Data compression
Further Smith-Waterman optimizations
Java 1.3 JVM for Provider Compute Engine (Faster than C!)
Investigation of novel methods for estimating statistical significance
Human Genome vs. GenBank scale searches
Implementation of DNA-protein comparisons
Other methods (BLAST, FASTA, HMMs, GeneWise, etc.)
Large-scale structure-structure comparison
Large-scale sequence-structure threading/comparison
19
Smith-Waterman: Java vs. C
Mouse GST m1 (218 amino acids) vs. 14548 random sequences
300 MHz Pentium II / Red Hat Linux 6.2
Smith-Waterman w/Miller-Myers optimizations
Sun 1.2.2 JDK:
456 sec
gcc –O3:
185 sec
IBM 1.3 JDK:
116 sec (!)
20
Demand for Sequence Comparison
GenBank:
June 2000: 8.6 Billion characters
August 2000: 9.5 Billion characters
Difference: 0.9 Billion characters
0.9 Billion r 8.6 Billion = 770 Quadrillion cells
770 Quadrillion cells / (60
60 days
* 86400 sec) = 150 Billion cells/sec
21
Download