Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation Talk Overview  Organization of the paper  Motivation  Technique: Pairwise Sequence Comparison using Dynamic Programming  EARTH Execution Model  Evaluation  Result Graphs  Conclusions  Related Work (MUMmer) CMSC 838T – Presentation Motivation   Importance of Genome Alignment :  Identify important matched and mismatched regions  “matches” represent homolog pairs, conserved regions or long repeats  “mismatches”represent foreign fragments inserted by transposition, sequence reversal or lateral transfer  Detect functional differences between pathogenic/ non-pathogenic strains, evolutionary distance, mutations leading to disease, phenotypes, etc. Problems  Large computational power, memory and execution time  Existing algorithms apply dynamic programming only to subsequences  Computationally intensive to apply to whole sequences (O(n2))  Thus applicable only to closely related genomes CMSC 838T – Presentation Solution..  Multithreaded parallel implementation of sequence alignment algorithm to align whole genomes  Parallel implementation of dynamic programming technique  Uses collective memory of several nodes  Uses multithreading to overlap computation and communication  Applicable to closely related as well as less similar genomes  Reliable output in reasonable time CMSC 838T – Presentation Pairwise Sequence Comparison using Dynamic Programming  Basic Idea:      Quantify the similarity between pairs of symbols of target sequences Associate score for each possible arrangement Similarity is given by the highest score Example : sequence x A T A A G T sequence y A T G C A G T SCORE 1 1 -1 –1 –1 –1 –1 TOTAL = -3 sequence x A T A - A G T sequence y A T G C A G T SCORE 1 1 -1 –2 1 1 1 TOTAL = 2 Model mutation by “gaps” (gaps indicate evolution of one sequence into another) CMSC 838T – Presentation Dynamic Programming  Smith and Waterman approach:  Aligns subsequences of given sequences  Involves: (a) calculation of scores indicating similarity (b) identification of alignment(s) corresponding to the score  Build solution using previous solutions for smaller subsequences  Construct a two-dimensional array – “Similarity Matrix” to store scores corresponding to partial results  Matrix represents all possible alignments of the input sequences  Recurrence equation SM[i, j] = SM[i, j-1] + gp SM[i-1, j-1] + ss SM[i-1, j] + gp 0 CMSC 838T – Presentation Contd…. Each element of the matrix is the max of the foll four values: Left element + gap, upper-left element + score of replacing vertical with horizontal symbol, upper element + gap, 0. Consider the foll example 0 T G A T G G A G G T 0 0 0 0 0 0 0 0 0 0 G 0 0 1 0 A 0 0 0 2 T 0 A 0 G 0 G 0 2 = max{0 + (-2), 1 + (1), 0 + (-2), 0} CMSC 838T – Presentation Identifying alignments  Alignments with score above a given threshold are reported  Start at end of the alignment and move backwards to the beginning T G A T G G A G G T 0 0 0 0 0 0 0 0 0 0 0 G 0 0 1 0 0 1 1 0 1 1 0 A 0 0 0 2 0 0 0 2 0 0 0 T 0 1 0 0 3 1 0 0 1 0 1 A 0 0 0 1 1 2 0 1 0 0 0 G 0 0 1 0 0 2 3 1 2 1 0 G 0 0 1 0 0 1 3 2 2 3 1 CMSC 838T – Presentation TGAT– G GAG G T GATAG G TGATG GAG G T GATAG G TGATG GAG G T GATAG G TGATG GAG G T GATAG G EARTH Execution Model  Program is viewed as a collection of threads    execution order determined by data and control dependencies Threads further divided into fibers  fibers are non-preemptive and  all data is ready before their execution Each node in EARTH has  an execution unit  synchronization unit  queues linking the two (RQ and EQ)  local memory  interface to interconnection network CMSC 838T – Presentation EARTH Architecture Memory bus PE PE From RQ node … PE . . To EQ node node EU Local Memory RQ EQ SU CMSC 838T – Presentation Inter connection Network Multithreaded parallel implementation   Divide scoring matrix as follows  horizontal strips (each element of input sequence X)  strips into rectangular blocks Blocks are calculated by two fibers within a thread   Each thread is assigned to one horizontal strip   only one fiber is active at any given time the computation is done by even/ odd fibers within the thread Initialization delay of reading sequences from server is minimized  Each thread needs only the piece of input sequence it grabs and not the whole of sequence X  After computing a block, fiber sends to fiber beneath a piece of sequence Y among other information  The computation of the anti-diagonal elements of the matrix is as shown CMSC 838T – Presentation Computation of similarity matrix on EARTH P1 Thread A P2 Thread B E fibers Inactive fiber E fibers O O Active fiber Ack Sync Data P3 P1 P2 P3 P4 P1 P2 P3 P4 CMSC 838T – Presentation Evaluation  Experimental environment  Beowulf implementation of EARTH  Uses Beowulf machine consisting of 64 nodes, each containing two 200MHz Pentium Pro processors (a total of 128 processors and 128MB of memory)  Sequences of lengths ranging from 30K to 900K were tested  Execution times for sequential and parallel implementation of Smith and Waterman algorithm is given below: Implementation Time Seq. Smith-Waterman 53 hours ATGC on 16 nodes 3.3 hours ATGC on 32 nodes 2.1 hours ATGC on 64 nodes 1.3 hours CMSC 838T – Presentation Evaluation  The multithreaded parallel implementation is named ATGC – Another Tool for Genomic Comparison  Experiment alignes   human and mice mitochondrial genomes  human and drosophila mitochondrial genomes Reason for selection  human and mice are closely related and the other pair are less similar  The results were confirmed with MUMmer – another whole genome alignment tool  Result graphs show that ATGC is more accurate than MUMmer (verified by using NCBI Blast) CMSC 838T – Presentation Result Graphs CMSC 838T – Presentation Contd…. CMSC 838T – Presentation Conclusions  Comparison of whole genomes requires high computation and memory  Made convenient by using a multithreaded parallel implementation of dynamic programming on a cluster of PCs  Accurate results obtained in reasonable amount of time  Aligns closely related as well as less similar genomes  Slower, but plays important role where high accuracy is needed ( as seen in comparison with MUMmer for human and drosophila mitochondrial genome) CMSC 838T – Presentation Related work –MUMmer(Maximal Unique Match)    given genomes A and B  find all maximal, unique, matching subsequences (MUMs)  extract the longest possible set of matches that occur in the same order in both genomes  close the gaps  output the alignment maximal unique match (MUM):  occurs exactly once in both genomes A and B  not contained in any longer MUM key idea in identifying MUMs is to build a suffix tree for genomes A and B CMSC 838T – Presentation

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation

Related documents

Products

Support

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib