Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman

advertisement
Improving performance of Multiple
Sequence Alignment in Multi-client
Environments
Aaron Zollman
CMSC 838 Presentation
Overview

Overview of talk

CLUSTALW algorithm, speedup opportunities

Problems with caching

Parallelizing technique

Weaknesses

Applying technique to other bioinformatics problems
CMSC 838T – Presentation
Motivation



Query overlap in queries submitted to MSA tools

Single researcher: new sequences vs. database

Multiple researchers: similar subsets
CLUSTALW: Progressive algorithm

Three steps

Progressive refinement
Opportunities for speedup

Caching

Query ordering
CMSC 838T – Presentation
CLUSTALW: Progressive global alignment



Step 1: Pairwise alignment, distance matrix

Fast technique calculates distance between two scores

Calculated for all sequence pairs

Cost: O(q2l2)
Step 2: Guide tree

Group nearest first

Build tree sequentially

Cost: O(q3)
Step 3: Progressive alignment

Align, starting at leaves of tree

Cost: O(ql2)
* q sequences – mean length l
CMSC 838T – Presentation
Optimization: Query caching

Step 1: Pairwise alignment, building distance matrix



Many requests partially duplicated
Individual distance calculation not dependent on rest of query
Observation: Dominant step in execution time
MLI…
MLISHSDLNQ…
GISRETSS…
GIS…
MLI…
MLISHSDLNQ…
0.0
0.0
GISRETSS…
Steps 2, 3:


Output dependent on results of entire query
Results less reusable
Technique: cache output of step 1

Individual distances
CMSC 838T – Presentation
GIS…
MST…
0.0
MSTVTKYFYKGE…
Query 1

Query 2
0.0
QPAKKTYTW…

QPA…
0.0
0.0
Challenges to cache implementation


I/O and filesystem overhead

Large cache vs. 2GB file size limit

High seek times within single file
Search and insertion overhead

Sequence: lengthy key

Keyed on each pair of sequences
CMSC 838T – Presentation
Technique: 2-level B-Tree cache



Level 1: Map sequence text to sequence ID

Hash of sequence?

Sequentially assigned number

Cache size: O(ql)
Level 2: Map ID pairs to calculated distance

Concatenate IDs from level 1

Lower Level 1 ID -> upper half of Level 2 key

Cache size: O(q2)
Distribute level 2 cache across bins

Round robin or block allocated

Distribute bins across machines
* q sequences – mean length l
CMSC 838T – Presentation
SMP

Parallelizable:

Pairwise searches performed independently

Farmed out to query threads
Web server
Query
Thread
Query
Thread
Query
Thread
Level 1 maps
(per-machine)
Level 2 bins
(distributed)
CMSC 838T – Presentation
Query
Thread
SMP

Challenge: Cache coherence

Read-only?

Requires advance knowledge of query details
Online update and serialization?

Locking, duplicate updates
Offline updates?



Per-thread list of cache changes
Query
Thread
Query
Thread
Query
Thread
Level 1 maps
(per-machine)
Level 2 bins
(distributed)
CMSC 838T – Presentation
Query
Thread
Evaluation: Implementation

Public B-Tree implementation: GIST library

First evaluation on Intel PC



(Pentium III 650, 75GB disks)

q = 25-1000 sequences

l = 450 amino acids per sequence
Second evaluation on Sun Fire

(Sun Fire 6800, 48*750MHz CPUs, 48GB main memory)

l = 417 amino acids per sequence

q = 2-200 sequences

Seeded cache with dummy values
Future work: architectural impact
CMSC 838T – Presentation
Evaluation: Results
CMSC 838T – Presentation
Observations



Simple technique

Cheap and easy to implement

Cheap and easy to deploy

Unsupported claim: Are queries really similar?
Concern about distribution across processors

Paper mentions latency, workload balancing

Also reliability of distributed bins

Cache lifetimes?
Proposed solution “component-based system”

“Hand-wavey”; would like to see more.
CMSC 838T – Presentation
Download