Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation Talk Overview Organization of the paper Motivation Technique: Pairwise Sequence Comparison using Dynamic Programming EARTH Execution Model Evaluation Result Graphs Conclusions Related Work (MUMmer) CMSC 838T – Presentation Motivation Importance of Genome Alignment : Identify important matched and mismatched regions “matches” represent homolog pairs, conserved regions or long repeats “mismatches”represent foreign fragments inserted by transposition, sequence reversal or lateral transfer Detect functional differences between pathogenic/ non-pathogenic strains, evolutionary distance, mutations leading to disease, phenotypes, etc. Problems Large computational power, memory and execution time Existing algorithms apply dynamic programming only to subsequences Computationally intensive to apply to whole sequences (O(n2)) Thus applicable only to closely related genomes CMSC 838T – Presentation Solution.. Multithreaded parallel implementation of sequence alignment algorithm to align whole genomes Parallel implementation of dynamic programming technique Uses collective memory of several nodes Uses multithreading to overlap computation and communication Applicable to closely related as well as less similar genomes Reliable output in reasonable time CMSC 838T – Presentation Pairwise Sequence Comparison using Dynamic Programming Basic Idea: Quantify the similarity between pairs of symbols of target sequences Associate score for each possible arrangement Similarity is given by the highest score Example : sequence x A T A A G T sequence y A T G C A G T SCORE 1 1 -1 –1 –1 –1 –1 TOTAL = -3 sequence x A T A - A G T sequence y A T G C A G T SCORE 1 1 -1 –2 1 1 1 TOTAL = 2 Model mutation by “gaps” (gaps indicate evolution of one sequence into another) CMSC 838T – Presentation Dynamic Programming Smith and Waterman approach: Aligns subsequences of given sequences Involves: (a) calculation of scores indicating similarity (b) identification of alignment(s) corresponding to the score Build solution using previous solutions for smaller subsequences Construct a two-dimensional array – “Similarity Matrix” to store scores corresponding to partial results Matrix represents all possible alignments of the input sequences Recurrence equation SM[i, j] = SM[i, j-1] + gp SM[i-1, j-1] + ss SM[i-1, j] + gp 0 CMSC 838T – Presentation Contd…. Each element of the matrix is the max of the foll four values: Left element + gap, upper-left element + score of replacing vertical with horizontal symbol, upper element + gap, 0. Consider the foll example 0 T G A T G G A G G T 0 0 0 0 0 0 0 0 0 0 G 0 0 1 0 A 0 0 0 2 T 0 A 0 G 0 G 0 2 = max{0 + (-2), 1 + (1), 0 + (-2), 0} CMSC 838T – Presentation Identifying alignments Alignments with score above a given threshold are reported Start at end of the alignment and move backwards to the beginning T G A T G G A G G T 0 0 0 0 0 0 0 0 0 0 0 G 0 0 1 0 0 1 1 0 1 1 0 A 0 0 0 2 0 0 0 2 0 0 0 T 0 1 0 0 3 1 0 0 1 0 1 A 0 0 0 1 1 2 0 1 0 0 0 G 0 0 1 0 0 2 3 1 2 1 0 G 0 0 1 0 0 1 3 2 2 3 1 CMSC 838T – Presentation TGAT– G GAG G T GATAG G TGATG GAG G T GATAG G TGATG GAG G T GATAG G TGATG GAG G T GATAG G EARTH Execution Model Program is viewed as a collection of threads execution order determined by data and control dependencies Threads further divided into fibers fibers are non-preemptive and all data is ready before their execution Each node in EARTH has an execution unit synchronization unit queues linking the two (RQ and EQ) local memory interface to interconnection network CMSC 838T – Presentation EARTH Architecture Memory bus PE PE From RQ node … PE . . To EQ node node EU Local Memory RQ EQ SU CMSC 838T – Presentation Inter connection Network Multithreaded parallel implementation Divide scoring matrix as follows horizontal strips (each element of input sequence X) strips into rectangular blocks Blocks are calculated by two fibers within a thread Each thread is assigned to one horizontal strip only one fiber is active at any given time the computation is done by even/ odd fibers within the thread Initialization delay of reading sequences from server is minimized Each thread needs only the piece of input sequence it grabs and not the whole of sequence X After computing a block, fiber sends to fiber beneath a piece of sequence Y among other information The computation of the anti-diagonal elements of the matrix is as shown CMSC 838T – Presentation Computation of similarity matrix on EARTH P1 Thread A P2 Thread B E fibers Inactive fiber E fibers O O Active fiber Ack Sync Data P3 P1 P2 P3 P4 P1 P2 P3 P4 CMSC 838T – Presentation Evaluation Experimental environment Beowulf implementation of EARTH Uses Beowulf machine consisting of 64 nodes, each containing two 200MHz Pentium Pro processors (a total of 128 processors and 128MB of memory) Sequences of lengths ranging from 30K to 900K were tested Execution times for sequential and parallel implementation of Smith and Waterman algorithm is given below: Implementation Time Seq. Smith-Waterman 53 hours ATGC on 16 nodes 3.3 hours ATGC on 32 nodes 2.1 hours ATGC on 64 nodes 1.3 hours CMSC 838T – Presentation Evaluation The multithreaded parallel implementation is named ATGC – Another Tool for Genomic Comparison Experiment alignes human and mice mitochondrial genomes human and drosophila mitochondrial genomes Reason for selection human and mice are closely related and the other pair are less similar The results were confirmed with MUMmer – another whole genome alignment tool Result graphs show that ATGC is more accurate than MUMmer (verified by using NCBI Blast) CMSC 838T – Presentation Result Graphs CMSC 838T – Presentation Contd…. CMSC 838T – Presentation Conclusions Comparison of whole genomes requires high computation and memory Made convenient by using a multithreaded parallel implementation of dynamic programming on a cluster of PCs Accurate results obtained in reasonable amount of time Aligns closely related as well as less similar genomes Slower, but plays important role where high accuracy is needed ( as seen in comparison with MUMmer for human and drosophila mitochondrial genome) CMSC 838T – Presentation Related work –MUMmer(Maximal Unique Match) given genomes A and B find all maximal, unique, matching subsequences (MUMs) extract the longest possible set of matches that occur in the same order in both genomes close the gaps output the alignment maximal unique match (MUM): occurs exactly once in both genomes A and B not contained in any longer MUM key idea in identifying MUMs is to build a suffix tree for genomes A and B CMSC 838T – Presentation