Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation Overview  Overview of talk  CLUSTALW algorithm, speedup opportunities  Problems with caching  Parallelizing technique  Weaknesses  Applying technique to other bioinformatics problems CMSC 838T – Presentation Motivation    Query overlap in queries submitted to MSA tools  Single researcher: new sequences vs. database  Multiple researchers: similar subsets CLUSTALW: Progressive algorithm  Three steps  Progressive refinement Opportunities for speedup  Caching  Query ordering CMSC 838T – Presentation CLUSTALW: Progressive global alignment    Step 1: Pairwise alignment, distance matrix  Fast technique calculates distance between two scores  Calculated for all sequence pairs  Cost: O(q2l2) Step 2: Guide tree  Group nearest first  Build tree sequentially  Cost: O(q3) Step 3: Progressive alignment  Align, starting at leaves of tree  Cost: O(ql2) * q sequences – mean length l CMSC 838T – Presentation Optimization: Query caching  Step 1: Pairwise alignment, building distance matrix    Many requests partially duplicated Individual distance calculation not dependent on rest of query Observation: Dominant step in execution time MLI… MLISHSDLNQ… GISRETSS… GIS… MLI… MLISHSDLNQ… 0.0 0.0 GISRETSS… Steps 2, 3:   Output dependent on results of entire query Results less reusable Technique: cache output of step 1  Individual distances CMSC 838T – Presentation GIS… MST… 0.0 MSTVTKYFYKGE… Query 1  Query 2 0.0 QPAKKTYTW…  QPA… 0.0 0.0 Challenges to cache implementation   I/O and filesystem overhead  Large cache vs. 2GB file size limit  High seek times within single file Search and insertion overhead  Sequence: lengthy key  Keyed on each pair of sequences CMSC 838T – Presentation Technique: 2-level B-Tree cache    Level 1: Map sequence text to sequence ID  Hash of sequence?  Sequentially assigned number  Cache size: O(ql) Level 2: Map ID pairs to calculated distance  Concatenate IDs from level 1  Lower Level 1 ID -> upper half of Level 2 key  Cache size: O(q2) Distribute level 2 cache across bins  Round robin or block allocated  Distribute bins across machines * q sequences – mean length l CMSC 838T – Presentation SMP  Parallelizable:  Pairwise searches performed independently  Farmed out to query threads Web server Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation Query Thread SMP  Challenge: Cache coherence  Read-only?  Requires advance knowledge of query details Online update and serialization?  Locking, duplicate updates Offline updates?    Per-thread list of cache changes Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation Query Thread Evaluation: Implementation  Public B-Tree implementation: GIST library  First evaluation on Intel PC    (Pentium III 650, 75GB disks)  q = 25-1000 sequences  l = 450 amino acids per sequence Second evaluation on Sun Fire  (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory)  l = 417 amino acids per sequence  q = 2-200 sequences  Seeded cache with dummy values Future work: architectural impact CMSC 838T – Presentation Evaluation: Results CMSC 838T – Presentation Observations    Simple technique  Cheap and easy to implement  Cheap and easy to deploy  Unsupported claim: Are queries really similar? Concern about distribution across processors  Paper mentions latency, workload balancing  Also reliability of distributed bins  Cache lifetimes? Proposed solution “component-based system”  “Hand-wavey”; would like to see more. CMSC 838T – Presentation

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman

Related documents

Products

Support

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib