Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi

advertisement
Parallel EST Clustering
by
Kalyanaraman, Aluru, and Kothari
Nargess Memarsadeghi
CMSC 838 Presentation
1
Talk Overview

Overview of talk

Motivation

Background

Techniques

Evaluation

Related work

Observations
CMSC 838T – Presentation
2
Motivation: EST Clustering

Problem: EST Clustering


Related to ‘fragment assembly’ problem



Cluster fragments of cDNA
Detecting overlapping fragments
Overlaps can be computed:

Pairwise alignment algorithm

Dynamic programming
Alternative:

Approximate overlap detection algorithms

Dynamic programming
CMSC 838T – Presentation
3
Motivation

Common Tools:

Takes too long

Days for 100,000 ESTs
Runs out of memory


This paper:

PaCE:

Parallel Clustering of ESTs
Efficient parallel EST Clustering




Space efficient algorithm
Reduce total work
Reduce run-time
CMSC 838T – Presentation
4
Background: EST Clustering Tools

Three traditional software:

Originally designed for fragment assembly:




TIGR Assembler
Phrap
CAP3
One parallel software:

UICLUSTER: assumes EST’s from 3’ end
CMSC 838T – Presentation
5
EST Clustering Tools

Basic approach

Find pairs of similar sequences

Align similar pairs


Dynamic programing
Quality of EST clustering

Phrap: Fastest



avoids dynamic programming
Relies on approximation, lower quality
CAP: Least # of erroneous clusters
CMSC 838T – Presentation
6
EST Clustering Tools’ Performance

With 50,000 maize ESTs


Using PC with dual Pentium 450MHZ , 512 RAM :

TIGR: ran out of memory

Phrap: 40 min

CAP:
> 24 hours
With 100,000 maize ESTs

all ran out of memory

CAP would require 4 days
CMSC 838T – Presentation
7
Goal

Space efficient algorithm


Reduce total work


Space requirement linear in the size of the input data set
Without sacrificing quality of clustering
Reduce run-time and facilitate the clustering of large
data sets

Through parallel processing

Scale memory with # of processors
CMSC 838T – Presentation
8
Approach

Expense:

Pairwise alignment (time + memory)

Promising pairs ≈


(# EST ) 2
Common string: |s|= w
Cost: if common |s|=l > w , then repeats l-w+1 times
CMSC 838T – Presentation
9
Approach (Cont ..)

Approach:

Use trie structure

Identify promising pairs

Merge clusters with strong overlaps
 Avoid storing/testing all similar pairs
Parallel EST Clustering Software:



Generalized Suffix Tree (GST)
Multiple processors:


Maintain and updates EST Clusters
Others generate batches of promising pairs, perform
alignment
CMSC 838T – Presentation
10
Approach (Cont …)
CMSC 838T – Presentation
11
Tries
1)
2)
3)
Index for each char
N leaves
Height N
CMSC 838T – Presentation
12
Suffix Tries (Cont ..)
1)
TRIM suffix trie
CMSC 838T – Presentation
13
Suffix Tries (Cont ..)
1)
2)
3)
4)
Indicies
Storage O(n), constant is high though
Common string
Longest common substring
CMSC 838T – Presentation
14
Suffix Tries (Cont ..)
a
b
a
b
$
1
b
$
3
5
$
a
b
$
4
2
Given a pattern P = ab we traverse the tree according to the
pattern.
CMSC 838T – Presentation
15
Parallel Generation of GST

GST: Generalized Suffix Tree

Compacted trie

Longest common prefix found in constant time

Used for on-demand pair generation

Sequential: O(nl)

Parallel: O(nl/p)
CMSC 838T – Presentation
16
Parallel Generation of GST (Cont …)

Previous implementations:


CRCW/CREW PRAM model
Work-optimal


Involves alphabetical ordering of characters
Unrealistic assumptions




synchronous operation of processors
infinite network bandwidth
no memory contention
Not practically efficient
CMSC 838T – Presentation
17
Parallel Generation of GST (Cont …)

Paper’s approach:

EST’s equally distributed among processors

Each processor

|
Partitions suffixes of ESTs into | 
buckets
Distribute buckets to the processors:
w



All suffixes in a bucket allocated to the same processor
Total # of suffixes allocated to a processor ≈ O ( nlp )
CMSC 838T – Presentation
18
Parallel Generation of GST (Cont …)

Each bucket’s processor:


Compute compacted trie of all its suffixes
Cannot use sequential construction

Suffixes of a string
– not in the same bucket

Each bucket:

Subtree in the GST
Nodes:



Depth first search traversal of the trie
Pointer to the right most child
CMSC 838T – Presentation
19
On-demand Pair Generation

A pair should be generated if

Share substring of length ≥ treshhold

Maximal

Leaves in a common node


Share a substring of length = depth of node
Parallel algorithm

Each processor works with its trie if

Depth of its root in GST < threshhold
CMSC 838T – Presentation
20
On-demand Pair Generation

To process

Sort internal nodes

Decreasing order of depth
Lists of a node







Generated after process
Removed after parent is processed
Limits space O(nl)
Run time ≈ # pairs generated + cost of sorting
Rejected pairs increase run-time by a factor of 2
Eliminating duplicates reduce run-time
CMSC 838T – Presentation
21
Parallel Clustering

Master-Slave paradigm:

Master processor:

Maintains and updates clusters





Using union-find data structure
Receives messages from slave processors
– A batch of next promising pairs generated by slave
– Results of the pairwise alignment
Determines which ones to explore
Determines if merging should occur
Slave processors:


Generate pairs on demand
Perform pairwise alignments of pairs dispatched by the
master processor
CMSC 838T – Presentation
22
Parallel Clustering (Cont…)
Organization of Parallel Clustering Software
•
•
Batch of promising pairs generated + results of
pairwise alignment
Batchsize or fewer # of pairs + results of pairwise
alignemnt on each pair
Master
P
Slave
P
Slave
P
slave
P
CMSC 838T – Presentation
23
Parallel Clustering (Cont..)

To start:

Slave P starts with 3× batchsize pairs




Sends the 3rd batch to Master P
Starts alignment on 1st batch
Sends results on 1st + a newly generated batch
While waiting to receive results from Master P, aligns 2nd batch

Processor always has the next batch to work between:
– Submitting the results of previous batch
– Receiving another set of pairs
CMSC 838T – Presentation
24
Parallel Clustering (Cont..)

Improve and control quality

Parameters:



Match and mismatch scores
Gap penalties
Post processing:



Detection of alternating splicing
Consulting protein databases
Organism specific
CMSC 838T – Presentation
25
Experimental environment

Used C and MPI

Tested

Quality of software:

Arabidopsis thaliana (due to availability of its genome)
Run-time behavior:








50,000 Maize ESTs with 32-processor IBM SP
# of processors
Data size
(# of Promising pairs) vs data size
Batchsize vs (# processors)
# of Clusters
Master processor’s time
CMSC 838T – Presentation
26
Quality Assessment

To asses quality

A data set and its correct clustering

ESTs from plant Arabidopsis thaliana

Splice program


Align ESTs to the genome
Discard ESTs that


Don’t align
Aligned in multiple spots
CMSC 838T – Presentation
27
Quality Assessment (Cont …)


False negative:

A pair in correct clustering is not paired in the output

5%
False positive:

A pair not in correct clustering appears in results

Negligible (< 0.04%)

Due to conservative nature of algorithm
CMSC 838T – Presentation
28
Quality Assessment
Cluster
results
Number of singleton
clusters
Number of nonsingleton clusters
10,803
18,727
CAP3
17,930
17,556
PaCE
14,802
19,536
Benchmark
Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.
CMSC 838T – Presentation
29
Quality Assessment (Cont..)
CMSC 838T – Presentation
30
Run-time Assessment
-Experiment with 50,000 maize ESTs:
-32-processor IBM SP-2
-16 minutes
CMSC 838T – Presentation
31
Run-time Assessment (Cont …)
p
Preprocessing Clustering
Total
4
273
102
375
8
119
50
169
16
61
26
87
32
38
15
53
64
29
10
39
Run-time (in seconds) spent in various components of PaCE for
20,000 ESTs. p, number of processors.
CMSC 838T – Presentation
32
Run-time Assessment (Cont ..)

Run-time as a function of batchsize

Small batchsize

Increase in communication overhead
Large batchsize

Slaves less responsive to the need of generating pairs
 Slave does not use latest clustering results
Optimal batchsize




Determined by experiment
Master processor’s time

Fixed batchsize, increase in # of processors

Gradual increase in Master P’s time
With 32 processors, increase < 1%

Using 1 Master Processor in not bottleneck

CMSC 838T – Presentation
33
Results

Space Linear in size of the input data set

Reduced total work without sacrificing quality

Reduced run-time


Parallel processors

Eliminating pairs
Faciliate clustering

Scale memory with # Processors
CMSC 838T – Presentation
34
Observations

PaCE: Approaches EST clustering problem directly

Better than

CAP3
 Phrap
 TIGR Assembler
Compare time/quality


TIGICL (TIGR Indices Clustering Tool)

Support for PVM

MegaBlast
 STACK
Large data sets

Lots of Processors
Can improve clustering time?



Clustering algorithm
CMSC 838T – Presentation
35
References

http://www.cs.berkeley.edu/~kubitron/courses/cs258S02/lectures/eval10-logp.pdf

Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U.
Vishkin. Parallel construction of a suffix tree with
applications. Algorithmica, 3:347–365, 1988.
CMSC 838T – Presentation
36
Download