Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation

advertisement
Whole Genome Alignment using
Multithreaded Parallel Implementation
Hyma S Murthy
CMSC 838 Presentation
Talk Overview

Organization of the paper

Motivation

Technique:
Pairwise Sequence Comparison using Dynamic Programming

EARTH Execution Model

Evaluation

Result Graphs

Conclusions

Related Work (MUMmer)
CMSC 838T – Presentation
Motivation


Importance of Genome Alignment :

Identify important matched and mismatched regions

“matches” represent homolog pairs, conserved regions or long repeats

“mismatches”represent foreign fragments inserted by transposition,
sequence reversal or lateral transfer

Detect functional differences between pathogenic/ non-pathogenic strains,
evolutionary distance, mutations leading to disease, phenotypes, etc.
Problems

Large computational power, memory and execution time

Existing algorithms apply dynamic programming only to subsequences

Computationally intensive to apply to whole sequences (O(n2))

Thus applicable only to closely related genomes
CMSC 838T – Presentation
Solution..

Multithreaded parallel implementation of sequence alignment
algorithm to align whole genomes

Parallel implementation of dynamic programming technique

Uses collective memory of several nodes

Uses multithreading to overlap computation and communication

Applicable to closely related as well as less similar genomes

Reliable output in reasonable time
CMSC 838T – Presentation
Pairwise Sequence Comparison using Dynamic
Programming

Basic Idea:





Quantify the similarity between pairs of symbols of target sequences
Associate score for each possible arrangement
Similarity is given by the highest score
Example :
sequence x A T A A G T
sequence y A T G C A G T
SCORE
1 1 -1 –1 –1 –1 –1
TOTAL = -3
sequence x A T A - A G T
sequence y A T G C A G T
SCORE
1 1 -1 –2 1 1 1
TOTAL = 2
Model mutation by “gaps” (gaps indicate evolution of one sequence into
another)
CMSC 838T – Presentation
Dynamic Programming

Smith and Waterman approach:

Aligns subsequences of given sequences

Involves: (a) calculation of scores indicating similarity
(b) identification of alignment(s) corresponding to the score

Build solution using previous solutions for smaller subsequences

Construct a two-dimensional array – “Similarity Matrix” to store scores
corresponding to partial results

Matrix represents all possible alignments of the input sequences

Recurrence equation
SM[i, j] =
SM[i, j-1] + gp
SM[i-1, j-1] + ss
SM[i-1, j] + gp
0
CMSC 838T – Presentation
Contd….
Each element of the matrix is the max of the foll four values:
Left element + gap, upper-left element + score of replacing vertical
with horizontal symbol, upper element + gap, 0.
Consider the foll example
0
T G A T G G A G G T
0 0 0 0 0 0 0 0 0 0
G
0
0
1
0
A
0
0
0
2
T
0
A
0
G
0
G
0
2 = max{0 + (-2),
1 + (1),
0 + (-2),
0}
CMSC 838T – Presentation
Identifying alignments

Alignments with score above a given threshold are reported

Start at end of the alignment and move backwards to the beginning
T G A T G G A G G T
0
0
0
0
0
0
0
0
0
0
0
G
0
0
1
0
0
1
1
0
1
1
0
A
0
0
0
2
0
0
0
2
0
0
0
T
0
1
0
0
3
1
0
0
1
0
1
A
0
0
0
1
1
2
0
1
0
0
0
G
0
0
1
0
0
2
3
1
2
1
0
G
0
0
1
0
0
1
3
2
2
3
1
CMSC 838T – Presentation
TGAT– G GAG G T
GATAG G
TGATG GAG G T
GATAG G
TGATG GAG G T
GATAG G
TGATG GAG G T
GATAG G
EARTH Execution Model

Program is viewed as a collection of threads



execution order determined by data and control dependencies
Threads further divided into fibers

fibers are non-preemptive and

all data is ready before their execution
Each node in EARTH has

an execution unit

synchronization unit

queues linking the two (RQ and EQ)

local memory

interface to interconnection network
CMSC 838T – Presentation
EARTH Architecture
Memory bus
PE
PE
From RQ
node
…
PE
.
.
To EQ
node
node
EU
Local
Memory
RQ
EQ
SU
CMSC 838T – Presentation
Inter
connection
Network
Multithreaded parallel implementation


Divide scoring matrix as follows

horizontal strips (each element of input sequence X)

strips into rectangular blocks
Blocks are calculated by two fibers within a thread


Each thread is assigned to one horizontal strip


only one fiber is active at any given time
the computation is done by even/ odd fibers within the thread
Initialization delay of reading sequences from server is minimized

Each thread needs only the piece of input sequence it grabs and not the whole of
sequence X

After computing a block, fiber sends to fiber beneath a piece of sequence Y
among other information

The computation of the anti-diagonal elements of the matrix is as shown
CMSC 838T – Presentation
Computation of similarity matrix on EARTH
P1
Thread A
P2
Thread B
E fibers
Inactive fiber
E fibers
O
O
Active fiber
Ack
Sync
Data
P3
P1
P2
P3
P4
P1
P2
P3
P4
CMSC 838T – Presentation
Evaluation

Experimental environment

Beowulf implementation of EARTH

Uses Beowulf machine consisting of 64 nodes, each containing two
200MHz Pentium Pro processors (a total of 128 processors and 128MB of
memory)

Sequences of lengths ranging from 30K to 900K were tested

Execution times for sequential and parallel implementation of Smith and
Waterman algorithm is given below:
Implementation
Time
Seq. Smith-Waterman
53 hours
ATGC on 16 nodes
3.3 hours
ATGC on 32 nodes
2.1 hours
ATGC on 64 nodes
1.3 hours
CMSC 838T – Presentation
Evaluation

The multithreaded parallel implementation is named ATGC –
Another Tool for Genomic Comparison

Experiment alignes


human and mice mitochondrial genomes

human and drosophila mitochondrial genomes
Reason for selection

human and mice are closely related and the other pair are less similar

The results were confirmed with MUMmer – another whole
genome alignment tool

Result graphs show that ATGC is more accurate than MUMmer
(verified by using NCBI Blast)
CMSC 838T – Presentation
Result Graphs
CMSC 838T – Presentation
Contd….
CMSC 838T – Presentation
Conclusions

Comparison of whole genomes requires high computation and
memory

Made convenient by using a multithreaded parallel
implementation of dynamic programming on a cluster of PCs

Accurate results obtained in reasonable amount of time

Aligns closely related as well as less similar genomes

Slower, but plays important role where high accuracy is needed
( as seen in comparison with MUMmer for human and drosophila
mitochondrial genome)
CMSC 838T – Presentation
Related work –MUMmer(Maximal Unique
Match)



given genomes A and B

find all maximal, unique, matching subsequences (MUMs)

extract the longest possible set of matches that occur in the same order in
both genomes

close the gaps

output the alignment
maximal unique match (MUM):

occurs exactly once in both genomes A and B

not contained in any longer MUM
key idea in identifying MUMs is to build a suffix tree for
genomes A and B
CMSC 838T – Presentation
Download