Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation 1 Talk Overview Overview of talk Motivation Background Techniques Evaluation Related work Observations CMSC 838T – Presentation 2 Motivation: EST Clustering Problem: EST Clustering Related to ‘fragment assembly’ problem Cluster fragments of cDNA Detecting overlapping fragments Overlaps can be computed: Pairwise alignment algorithm Dynamic programming Alternative: Approximate overlap detection algorithms Dynamic programming CMSC 838T – Presentation 3 Motivation Common Tools: Takes too long Days for 100,000 ESTs Runs out of memory This paper: PaCE: Parallel Clustering of ESTs Efficient parallel EST Clustering Space efficient algorithm Reduce total work Reduce run-time CMSC 838T – Presentation 4 Background: EST Clustering Tools Three traditional software: Originally designed for fragment assembly: TIGR Assembler Phrap CAP3 One parallel software: UICLUSTER: assumes EST’s from 3’ end CMSC 838T – Presentation 5 EST Clustering Tools Basic approach Find pairs of similar sequences Align similar pairs Dynamic programing Quality of EST clustering Phrap: Fastest avoids dynamic programming Relies on approximation, lower quality CAP: Least # of erroneous clusters CMSC 838T – Presentation 6 EST Clustering Tools’ Performance With 50,000 maize ESTs Using PC with dual Pentium 450MHZ , 512 RAM : TIGR: ran out of memory Phrap: 40 min CAP: > 24 hours With 100,000 maize ESTs all ran out of memory CAP would require 4 days CMSC 838T – Presentation 7 Goal Space efficient algorithm Reduce total work Space requirement linear in the size of the input data set Without sacrificing quality of clustering Reduce run-time and facilitate the clustering of large data sets Through parallel processing Scale memory with # of processors CMSC 838T – Presentation 8 Approach Expense: Pairwise alignment (time + memory) Promising pairs ≈ (# EST ) 2 Common string: |s|= w Cost: if common |s|=l > w , then repeats l-w+1 times CMSC 838T – Presentation 9 Approach (Cont ..) Approach: Use trie structure Identify promising pairs Merge clusters with strong overlaps Avoid storing/testing all similar pairs Parallel EST Clustering Software: Generalized Suffix Tree (GST) Multiple processors: Maintain and updates EST Clusters Others generate batches of promising pairs, perform alignment CMSC 838T – Presentation 10 Approach (Cont …) CMSC 838T – Presentation 11 Tries 1) 2) 3) Index for each char N leaves Height N CMSC 838T – Presentation 12 Suffix Tries (Cont ..) 1) TRIM suffix trie CMSC 838T – Presentation 13 Suffix Tries (Cont ..) 1) 2) 3) 4) Indicies Storage O(n), constant is high though Common string Longest common substring CMSC 838T – Presentation 14 Suffix Tries (Cont ..) a b a b $ 1 b $ 3 5 $ a b $ 4 2 Given a pattern P = ab we traverse the tree according to the pattern. CMSC 838T – Presentation 15 Parallel Generation of GST GST: Generalized Suffix Tree Compacted trie Longest common prefix found in constant time Used for on-demand pair generation Sequential: O(nl) Parallel: O(nl/p) CMSC 838T – Presentation 16 Parallel Generation of GST (Cont …) Previous implementations: CRCW/CREW PRAM model Work-optimal Involves alphabetical ordering of characters Unrealistic assumptions synchronous operation of processors infinite network bandwidth no memory contention Not practically efficient CMSC 838T – Presentation 17 Parallel Generation of GST (Cont …) Paper’s approach: EST’s equally distributed among processors Each processor | Partitions suffixes of ESTs into | buckets Distribute buckets to the processors: w All suffixes in a bucket allocated to the same processor Total # of suffixes allocated to a processor ≈ O ( nlp ) CMSC 838T – Presentation 18 Parallel Generation of GST (Cont …) Each bucket’s processor: Compute compacted trie of all its suffixes Cannot use sequential construction Suffixes of a string – not in the same bucket Each bucket: Subtree in the GST Nodes: Depth first search traversal of the trie Pointer to the right most child CMSC 838T – Presentation 19 On-demand Pair Generation A pair should be generated if Share substring of length ≥ treshhold Maximal Leaves in a common node Share a substring of length = depth of node Parallel algorithm Each processor works with its trie if Depth of its root in GST < threshhold CMSC 838T – Presentation 20 On-demand Pair Generation To process Sort internal nodes Decreasing order of depth Lists of a node Generated after process Removed after parent is processed Limits space O(nl) Run time ≈ # pairs generated + cost of sorting Rejected pairs increase run-time by a factor of 2 Eliminating duplicates reduce run-time CMSC 838T – Presentation 21 Parallel Clustering Master-Slave paradigm: Master processor: Maintains and updates clusters Using union-find data structure Receives messages from slave processors – A batch of next promising pairs generated by slave – Results of the pairwise alignment Determines which ones to explore Determines if merging should occur Slave processors: Generate pairs on demand Perform pairwise alignments of pairs dispatched by the master processor CMSC 838T – Presentation 22 Parallel Clustering (Cont…) Organization of Parallel Clustering Software • • Batch of promising pairs generated + results of pairwise alignment Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair Master P Slave P Slave P slave P CMSC 838T – Presentation 23 Parallel Clustering (Cont..) To start: Slave P starts with 3× batchsize pairs Sends the 3rd batch to Master P Starts alignment on 1st batch Sends results on 1st + a newly generated batch While waiting to receive results from Master P, aligns 2nd batch Processor always has the next batch to work between: – Submitting the results of previous batch – Receiving another set of pairs CMSC 838T – Presentation 24 Parallel Clustering (Cont..) Improve and control quality Parameters: Match and mismatch scores Gap penalties Post processing: Detection of alternating splicing Consulting protein databases Organism specific CMSC 838T – Presentation 25 Experimental environment Used C and MPI Tested Quality of software: Arabidopsis thaliana (due to availability of its genome) Run-time behavior: 50,000 Maize ESTs with 32-processor IBM SP # of processors Data size (# of Promising pairs) vs data size Batchsize vs (# processors) # of Clusters Master processor’s time CMSC 838T – Presentation 26 Quality Assessment To asses quality A data set and its correct clustering ESTs from plant Arabidopsis thaliana Splice program Align ESTs to the genome Discard ESTs that Don’t align Aligned in multiple spots CMSC 838T – Presentation 27 Quality Assessment (Cont …) False negative: A pair in correct clustering is not paired in the output 5% False positive: A pair not in correct clustering appears in results Negligible (< 0.04%) Due to conservative nature of algorithm CMSC 838T – Presentation 28 Quality Assessment Cluster results Number of singleton clusters Number of nonsingleton clusters 10,803 18,727 CAP3 17,930 17,556 PaCE 14,802 19,536 Benchmark Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs. CMSC 838T – Presentation 29 Quality Assessment (Cont..) CMSC 838T – Presentation 30 Run-time Assessment -Experiment with 50,000 maize ESTs: -32-processor IBM SP-2 -16 minutes CMSC 838T – Presentation 31 Run-time Assessment (Cont …) p Preprocessing Clustering Total 4 273 102 375 8 119 50 169 16 61 26 87 32 38 15 53 64 29 10 39 Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors. CMSC 838T – Presentation 32 Run-time Assessment (Cont ..) Run-time as a function of batchsize Small batchsize Increase in communication overhead Large batchsize Slaves less responsive to the need of generating pairs Slave does not use latest clustering results Optimal batchsize Determined by experiment Master processor’s time Fixed batchsize, increase in # of processors Gradual increase in Master P’s time With 32 processors, increase < 1% Using 1 Master Processor in not bottleneck CMSC 838T – Presentation 33 Results Space Linear in size of the input data set Reduced total work without sacrificing quality Reduced run-time Parallel processors Eliminating pairs Faciliate clustering Scale memory with # Processors CMSC 838T – Presentation 34 Observations PaCE: Approaches EST clustering problem directly Better than CAP3 Phrap TIGR Assembler Compare time/quality TIGICL (TIGR Indices Clustering Tool) Support for PVM MegaBlast STACK Large data sets Lots of Processors Can improve clustering time? Clustering algorithm CMSC 838T – Presentation 35 References http://www.cs.berkeley.edu/~kubitron/courses/cs258S02/lectures/eval10-logp.pdf Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988. CMSC 838T – Presentation 36