Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation Overview Overview of talk CLUSTALW algorithm, speedup opportunities Problems with caching Parallelizing technique Weaknesses Applying technique to other bioinformatics problems CMSC 838T – Presentation Motivation Query overlap in queries submitted to MSA tools Single researcher: new sequences vs. database Multiple researchers: similar subsets CLUSTALW: Progressive algorithm Three steps Progressive refinement Opportunities for speedup Caching Query ordering CMSC 838T – Presentation CLUSTALW: Progressive global alignment Step 1: Pairwise alignment, distance matrix Fast technique calculates distance between two scores Calculated for all sequence pairs Cost: O(q2l2) Step 2: Guide tree Group nearest first Build tree sequentially Cost: O(q3) Step 3: Progressive alignment Align, starting at leaves of tree Cost: O(ql2) * q sequences – mean length l CMSC 838T – Presentation Optimization: Query caching Step 1: Pairwise alignment, building distance matrix Many requests partially duplicated Individual distance calculation not dependent on rest of query Observation: Dominant step in execution time MLI… MLISHSDLNQ… GISRETSS… GIS… MLI… MLISHSDLNQ… 0.0 0.0 GISRETSS… Steps 2, 3: Output dependent on results of entire query Results less reusable Technique: cache output of step 1 Individual distances CMSC 838T – Presentation GIS… MST… 0.0 MSTVTKYFYKGE… Query 1 Query 2 0.0 QPAKKTYTW… QPA… 0.0 0.0 Challenges to cache implementation I/O and filesystem overhead Large cache vs. 2GB file size limit High seek times within single file Search and insertion overhead Sequence: lengthy key Keyed on each pair of sequences CMSC 838T – Presentation Technique: 2-level B-Tree cache Level 1: Map sequence text to sequence ID Hash of sequence? Sequentially assigned number Cache size: O(ql) Level 2: Map ID pairs to calculated distance Concatenate IDs from level 1 Lower Level 1 ID -> upper half of Level 2 key Cache size: O(q2) Distribute level 2 cache across bins Round robin or block allocated Distribute bins across machines * q sequences – mean length l CMSC 838T – Presentation SMP Parallelizable: Pairwise searches performed independently Farmed out to query threads Web server Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation Query Thread SMP Challenge: Cache coherence Read-only? Requires advance knowledge of query details Online update and serialization? Locking, duplicate updates Offline updates? Per-thread list of cache changes Query Thread Query Thread Query Thread Level 1 maps (per-machine) Level 2 bins (distributed) CMSC 838T – Presentation Query Thread Evaluation: Implementation Public B-Tree implementation: GIST library First evaluation on Intel PC (Pentium III 650, 75GB disks) q = 25-1000 sequences l = 450 amino acids per sequence Second evaluation on Sun Fire (Sun Fire 6800, 48*750MHz CPUs, 48GB main memory) l = 417 amino acids per sequence q = 2-200 sequences Seeded cache with dummy values Future work: architectural impact CMSC 838T – Presentation Evaluation: Results CMSC 838T – Presentation Observations Simple technique Cheap and easy to implement Cheap and easy to deploy Unsupported claim: Are queries really similar? Concern about distribution across processors Paper mentions latency, workload balancing Also reliability of distributed bins Cache lifetimes? Proposed solution “component-based system” “Hand-wavey”; would like to see more. CMSC 838T – Presentation