#5 - Dynamic Programming 8/29/07 Lecture 5 #5_Aug29

advertisement
#5 - Dynamic Programming
8/29/07
Required Reading
BCB 444/544
(before lecture)
Mon Aug 27 - for Lecture #4
Pairwise Sequence Alignment
• Chp 3 - pp 31-41
Lecture 5
Wed Aug 29 - for Lecture #5
Dynamic Programming
• Eddy: What is Dynamic Programming?
2004 Nature Biotechnol 22:909
Dynamic Programming
Thurs Aug 30 - Lab #2:
Databases, ISU Resources & Pairwise Sequence Alignment
#5_Aug29
Fri Aug 31 - for Lecture #6
Scoring Matrices and Alignment Statistics
• Chp 3 - pp 41-49
BCB 444/544 F07 ISU
Review:
Dobbs #5 - Dynamic Programming
8/29/07
1
8/29/07
2
3 Major types of electronic databases:
Introduction to Biological Databases
1. Flat files - simple text files
• no organization to facilitate retrieval
What is a Database?
Types of Databases
Biological Databases
Pitfalls of Biological Databases
Information Retrieval from Biological
Databases
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
2. Relational - data organized as tables ("relations")
• shared features among tables allows rapid
search
3. Object-oriented - data organized as "objects"
• objects associated hierarchically
3
Examples of Biological Databases
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
4
Examples of Biological Databases
1- Primary
2- Secondary
• DNA sequences
• Protein sequences
• GenBank - USA
• Swiss-Prot, TreEMBL, PIR
• European Molecular Biology Lab - EMBL
• these recently combined into UniProt
• DNA Data Bank of Japan - DDBJ
3- Specialized
• Structures (Protein, DNA, RNA)
• Species-specific (or "taxonomic" specific)
• PDB - Protein Data Bank
•
Dobbs #5 - Dynamic Programming
Types of Databases
Chp 2- Biological Databases
• Xiong: Chp 2
•
•
•
•
•
BCB 444/544 F07 ISU
• Flybase, WormBase, AceDB, PlantDB
NDB - Nucleic Acid Data Bank
• Molecule-specific, disease-specific
See: http://www.oxfordjournals.org/nar/database/c/
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
8/29/07
5
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
6
1
#5 - Dynamic Programming
8/29/07
SUMMARY:
#2- Biological Databases
Chp 3- Sequence Alignment
SECTION II
BEWARE!
Xiong: Chp 3
Pairwise Sequence Alignment
•
•
•
•
•
•
Who was that Icelandic fellow?
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
7
Evolutionary Basis
Sequence Homology versus Sequence Similarity
Sequence Similarity versus Sequence Identity
Methods
Scoring Matrices
Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some slides from:
Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.
BCB 444/544 F07 ISU
"Sequence comparison lies at the heart of bioinformatics
Homology = similarity due to descent from a common
evolutionary ancestor
Pairwise sequence alignment is fundamental; it used to:
But,
• Search for common patterns of characters
• Establish pair-wise correspondence between related sequences
• Database searching (e.g., BLAST)
• Multiple sequence alignment (MSA)
We can infer homology from similarity (can't prove it!)
Dobbs #5 - Dynamic Programming
8/29/07
9
aligned residues with similar physicochemical properties
(e.g., size, hydrophobicity, charge)
IMPORTANT:
• Sequence homology:
• result of gene duplication events
• proteins may (or may not) have similar functions
(e.g., human α-globin & human β-globin)
• An inference about a common ancestral relationship, drawn when
two sequences share a high enough degree of sequence similarity
• Homology is qualitative
A is the parent gene
Speciation leads to B & C
Duplication leads to C’
BCB 444/544 F07 ISU
• Sequence similarity:
• The direct result of observation from a sequence alignment
• Similarity is quantitative; can be described using percentages
B and C are Orthologous
C and C’ are Paralogous
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
10
evolutionary ancestry
• Paralogs - "similar genes" within a species;
C'
8/29/07
• Similar sequences - sequences that have a high percentage of
• result of common ancestry
• corresponding proteins have "same" functions
(e.g., human α-globin & mouse α-globin)
C
Dobbs #5 - Dynamic Programming
• Homologous sequences - sequences that share a common
2 types of homologous sequences:
• Orthologs - "same genes" in different species;
B
BCB 444/544 F07 ISU
Sequence Homology vs Similarity
Orthologs vs Paralogs
Duplication
HOMOLOGY ≠ SIMILARITY
When 2 sequences share a sufficiently high degree of
sequence similarity (or identity), we may infer that they
are homologous
Pairwise sequence alignment is basis for:
Speciation
8
For us:
Sequence comparison is important for drawing functional
& evolutionary inferences re: new genes/proteins
A
8/29/07
Homology has a very specific meaning in evolutionary &
computational biology - & term is often used incorrectly
Jin Xiong
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
Homology
Motivation for Sequence Alignment
analysis."
SEQUENCE ALIGNMENT
8/29/07
11
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
12
2
#5 - Dynamic Programming
8/29/07
Sequence Similarity vs Identity
What is Sequence Alignment?
For nucleotide sequences (DNA & RNA), sequence
similarity and identity have the "same" meaning:
• Two DNA sequences can share a high degree of sequence identity
(or similarity) -- means the same thing
• Drena's opinion: Always use "identity" when making quantitative
comparisons re: DNA or RNA sequences (to avoid confusion!)
For protein sequences, sequence similarity and identity
have different meanings:
Given 2 sequences of letters, and a scoring scheme for
evaluating matching letters, find an optimal pairing of
letters in one sequence to letters of other sequence.
Align:
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A SHORT SENTENCE.
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A ######SHORT###SENTENCE##############.
• Identity = % of exact matches between two aligned sequences
• Similarity = % of aligned residues that share similar
characteristics (e.g, physicochemical characteristics,
OR
1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT.
2: THIS IS A ##SHORT###SENT#EN###CE##############.
structural propsensities, evolutionary profiles)
Is one of these alignments "optimal"?
Which is better?
• Drena's opinion: Always use "identity" when making quantitative
comparisons re: protein sequences (to avoid confusion!)
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
13
BCB 444/544 F07 ISU
Goal of Sequence Alignment
• 2 sequences
• Scoring system for evaluating match (or
mismatch) of two characters
• Penalty function for gaps in sequences
4 letter alphabet (+ gap)
Find: Optimal pairing of sequences that:
• Retains the order of characters
• Introduces gaps where needed
• Maximizes total score
20 letter alphabet (+ gap)
RKVA-GMA
RKIAVAMA
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
15
Types of Sequence Variation
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
16
Gaps
Indels of various sizes can occur in one sequence relative
to the other
e.g., corresponding to a shortening of the polypeptide
chain in a protein
• Sequences can diverge from a common ancestor through
various types of mutations:
• Substitutions
• Insertions
• Deletions
14
Given:
TTGACAC
TTTACAC
• Proteins
8/29/07
Statement of Problem
Find the best pairing of 2 sequences, such that there
is maximum correspondence between residues
• DNA
Dobbs #5 - Dynamic Programming
ACGA → AGGA
ACGA → ACCGA
ACGA → AGA
• Insertions or deletions ("indels") result in gaps in
alignments
• Substitutions result in mismatches
• No change? match
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
8/29/07
17
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
18
3
#5 - Dynamic Programming
8/29/07
Avoiding Random Alignments with a
Scoring Function
Not All Mismatches are the Same
• Introducing too many gaps generates nonsense alignments:
s--e-----qu---en--ce
sometimesquipsentice
e.g., Ser & Thr are more similar than Trp & Ala
• Need to distinguish between alignments that occur due to
homology and those that occur by chance
• Define a scoring function that accounts for mismatches
and gaps
Scoring Function (F):
Match:
Mismatch:
Gap:
• Some amino acids are more "exchangeable" than
others (physicochemical properties are similar)
+ w
- x
- y
• Substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions
e.g.
+1
0
-1
• Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)
F = w(#matches) + x(#mismatches) + y(#gaps)
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
19
BCB 444/544 F07 ISU
Substitution Matrix
Dobbs #5 - Dynamic Programming
8/29/07
20
8/29/07
22
8/29/07
24
Methods
•
•
•
•
s(a,b) corresponds to score
of aligning character a with
character b
Match scores are often calculated
based on frequency of mutations
in very similar sequences
(more details later)
Global and Local Alignment
Alignment Algorithms
Dot Matrix Method
Dynamic Programming Method
• Gap penalities
• DP for Global Alignment
• DP for Local Alignment
• Scoring Matrices
• Amino acid scoring matrices
• PAM
• BLOSUM
• Comparisons between PAM & BLOSUM
• Statistical Significance of Sequence Alignment
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
21
Global vs Local Alignment
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
Global vs Local Alignment - example
S = CTGTCGCTGCACG
T = TGCCGTG
Global alignment
• Finds best possible alignment across entire length of 2 sequences
Global alignment
• Aligned sequences assumed to be generally similar over entire length
Local alignment
CTGTCGCTGCACG
Local alignment
CTGTCGCTGCACG
-TGCCG-TG----
-TG-C-C-G--TG
• Finds local regions with highest similarity between 2 sequences
• Aligns these without regard for rest of sequence
CTGTCGCTGCACG
• Sequences are not assumed to be similar over entire length
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
-TGCCG-T----G
8/29/07
23
Which is better?
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
4
#5 - Dynamic Programming
8/29/07
Global vs Local Alignment
Which should be used when?
Alignment Algorithms
Both are important
but it is critical to use right method for a given task!
3 major methods for pairwise sequence alignment:
Global alignment:
1. Dot matrix analysis
• Good for: aligning closely related sequences of similar length
• Not good for: divergent sequences or sequences with different
lengths
2. Dynamic programming
3. Word or k-tuple methods (later, in Chp 4)
Local Alignment:
• Good for: searching for conserved patterns (domains or motifs) in
DNA or protein sequences
• Not good for: generating an alignment of closely related sequences
Global and local alignments are fundamentally similar; they differ only
in optimization strategy used to align similar residues
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
25
• For proteins, usually use more
sophisticated scoring schemes than
"identical match"
• Diagonal lines indicate areas of match
26
• Diagonal lines of dots indicate regions of similarity
between 2 sequences
• Reverse diagonals (perpendicular to diagonal) indicate
inversions
A
C
A
C
G
• What do similar patterns mean when
comparing a sequence with itself (reverse
complement)?
• e.g.: Reverse diagonals crossing diagonals (X's) indicate
palindromes
Exploring Dot Plots
Dobbs #5 - Dynamic Programming
8/29/07
27
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
28
Strengths & Weakneses of Dot Plots
Dot Matrix Variations
Compare 2 sequences
Strengths:
• Fast and easy
• Allows direct visual identification of regions of similarity
• Repeats, inversions, etc. are readily apparent
• Displays all possible matches
• Identify matching regions
• Identities for DNA seqs
• Similarities for protein seqs
Compare sequence with itself
• Identify repeated regions
• Identify inverted repeats
• Identify palindromes
Weaknesses:
• Doesn't generate full alignment - user must "connect the
diagonals"
• No statistical assessment of quality of alignment (score)
• Impractical and noisy for long sequences
• Difficult to scale up to muliple alignment
For long sequences?
• Too many dots! Noisy!
• Instead of per "residue," plot
one dot per "window" of n
matching residues to reduce
noise
BCB 444/544 F07 ISU
8/29/07
When comparing 2 sequences:
A C G C G
• Contiguous diagonal lines reveal
alignment; "breaks" = gaps (indels)
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
Interpretation of Dot Plots
Dot Matrix Method (Dot Plots)
• Place 1 sequence along top row of matrix
• Place 2nd sequence along left column of
matrix
• Plot a dot each time there is a match
between an element of row sequence and
an element of column sequence
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
8/29/07
29
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
30
5
#5 - Dynamic Programming
8/29/07
Dynamic Programming
Global alignment: Scoring
For Pairwise sequence alignment
CTGTCG-CTGCACG
Idea: Display one sequence above another with
spaces inserted in both to reveal similarity
-TGC-CG-TG---Reward for matches: α
Mismatch penalty:
β
Space/gap penalty: γ
A: C A T - T C A - C
|
|
| |
|
B: C - T C G C A G C
Score = αw – βx - γy
w = #matches
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
31
C
-
T
T
G
G
T
C
C
–
G
C
–
G
10
2
5
C
–
BCB 444/544 F07 ISU
y = #spaces
Dobbs #5 - Dynamic Programming
8/29/07
32
Optimum Alignment
Global alignment: Scoring
Reward for matches:
Mismatch penalty:
Space/gap penalty:
x = #mismatches
T
T
• Score of an alignment is a measure of its quality
G
G
• Optimum alignment problem: Given a pair of
sequences X and Y, find an alignment (global or
local) with maximum score
C
-
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total = 11
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
33
Alignment algorithms
Dobbs #5 - Dynamic Programming
8/29/07
34
Dynamic Programming (DP)
• As computer science concept - formalized in early 1950's by
Bellman at RAND Corporation
• Global: Needleman-Wunsch
• Local: Smith-Waterman
“ Frequently, however, there are only a polynomial number of
subproblems… If we keep track of the solution to each subproblem
solved, and simply look up the answer when needed, we obtain a
polynomial-time algorithm. “
• Both NW and SW use dynamic programming
• Variations:
----Aho, Hopcroft, Ullman
• Gap penalty functions
• Scoring matrices
BCB 444/544 F07 ISU
BCB 444/544 F07 ISU
• Reported to biologists for sequence alignment problems by
Needleman & Wunsch, 1969
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
8/29/07
35
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
36
6
#5 - Dynamic Programming
8/29/07
Key Idea
Problem Formulation and Notations
Given two sequences (strings)
• X = x 1x 2 …xN of length N
Score of the best possible alignment that ends at a given
pair of positions (i,j) in two sequences is the score of the
best alignment previous to those two positions PLUS the
score for aligning those two positions
• Y = y1y2 …yM of length M
x = AGC
N=3
y = AAAC
M=4
Construct a matrix with (N+1) x (M+1) elements, where
S ( i,j) = score of best alignment of x[1..i]=x1 x 2…xi with y[1..j]=y1y2…yj
x1
Next best alignment = previous best + local best
x2
x3
y1
S(2,3) = score of best alignment
y2
of AG (x1x2) to AAA (y1y2y3)
y3
y4
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
37
BCB 444/544 F07 ISU
Dynamic Programming
4 Components:
Dobbs #5 - Dynamic Programming
8/29/07
38
Global Alignment: Algorithm
1. Recursive definition for optimal score
x = Prefix of length i of x
1.. i
2. Matrix for storing optimal scores of subproblems
y = Prefix of length j of y
3. Bottom-up approach for filling the matrix, by
solving smallest subproblems first
1.. j
S(i, j) = Score of optimal alignment of x and y
!
4. Traceback of path through matrix to recover the
optimal alignment(s) that gave the optimal score
1..i
1..j
!
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
39
BCB 444/544 F07 ISU
Calculating Score of Optimum Alignment
0
0
1
Initial conditions:
S(0, j) = " j # $
Recursive definition:
!
For 1 ≤ i ≤ n, 1 ≤ j ≤ m:
Dobbs #5 - Dynamic Programming
1
N
S(i-1,j-1)
S(i-1,j)
S(i,j-1)
S(i,j)
S(N,M)
Recursion
% S(i "1, j "1) + # (x i , y j )
'
S(i, j) = max& S(i "1, j) + $
' S(i, j "1) + $
(
8/29/07
40
S(0,0)=0
M
% S(i "1, j "1) + # (Si ,T j )
'
S(i, j) = max& S(i "1, j) " $
' S(i, j "1) " $
(
BCB 444/544 F07 ISU
8/29/07
Computing the best current score
S(i,j) satisfies the following relationships:
S(i,0) = "i # $
Dobbs #5 - Dynamic Programming
41
BCB 444/544 F07 ISU
Initialization
S(i,0) = "i # $
S(0, j) = " j # $
Dobbs #5 - Dynamic Programming
8/29/07
42
!
!
BCB 444/544 Fall 07 Dobbs
!
7
#5 - Dynamic Programming
8/29/07
What happens at the last step in the
alignment of x[1..i] to y[1..j]?
DP Implementationn - 3 steps:
1. Construct sequence vs sequence matrix and fill in from
[0,0] to [N,M], the best possible scores for alignments
including the residues at [i,j]. Also, keep track of
dependencies of scores (in a pointer matrix).
1 of 3 cases:
xi aligns to a gap
yj aligns to a gap
x1 x2 . . . xi-1 xi
x1 x2 . . . xi-1 xi
x1 x2 . . . xi
y1 y2 . . . yj-1 yj
y1 y2 . . . yj
xi aligns to yj
—
S(i-1,j-1) + σ(xi,yj) S(i-1,j)
BCB 444/544 F07 ISU
2. For a global alignment of the sequences, find the score
S(N,M)
—
y1 y2 . . . yj-1 yj
+γ
S(i,j-1)
Dobbs #5 - Dynamic Programming
3. Trace back through pointer matrix to get the optimal
alignment. Do this position by position to retrieve
alignment of all residues of sequences, including gaps
(i.e., repeat alignment calculations in reverse order,
following path back through matrix, starting at from
position with highest score.
+γ
8/29/07
43
BCB 444/544 F07 ISU
Example
λ
Case 1: Line up x i with y j
x: C
y: C
A
-
T
T
i-1
C A
C A
j -1
T
T
Case 2: Line up x i with space
x: C
y: C
A
-
T
T
T
T
C
C
A
A
i
λ
C
G
j
-5
C
A -10
i-1
G
i
C
-
j
Case 3: Line up y j with space
x: C
y: C
A
-
T
T
T
T
i
A C
A j -1
C
C
BCB 444/544 F07 ISU
0
G
j
T
-15
T
-20
C
A
-25
C
-35
C
T
C
Dobbs #5 - Dynamic Programming
G
C
A
G
8/29/07
44
C
-5 -10 -15 -20 -25 -30 -35 -40
10
5
-30
+10 for match, -2 for mismatch, -5 for space
Dobbs #5 - Dynamic Programming
8/29/07
45
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
46
Affine Gap Penalty Functions
λ
C
T
C
G
C
A
G
C
λ
0
-5
-1 0
-1 5
-2 0
-2 5
-3 0
-3 5
-4 0
C
A
-5
10
5
0
-5
-1 0
-1 5
-2 0
-2 5
-1 0
5
8
3
-2
-7
0
-5
-1 0
T
-1 5
0
15
10
5
0
-5
-2
-7
T
-2 0
-5
13
8
3
-2
-7
-4
C
A
-2 5
-1 0
5
20
15
18
13
8
3
-3 0
-1 5
0
15
18
13
28
23
18
C
-3 5
-2 0
-5
10
13
28
23
26
33
Gap penalty = h + gk
where
*
10 *
k = length of gap
h = gap opening penalty
g = gap continuation penalty
Can also be solved in
O(nm) time using
dynamic programming
Traceback can yield both optimal alignments
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
BCB 444/544 Fall 07 Dobbs
8/29/07
47
BCB 444/544 F07 ISU
Dobbs #5 - Dynamic Programming
8/29/07
48
8
Download