Biological Sequence Comparison/Database Searching

advertisement
Biological Sequence Comparison / Database
Homology Searching
Aoife McLysaght
Summer Intern,
Compaq Computer Corporation
Ballybrit Business Park, Galway, Ireland
TM
Database Homology Searching
• Use algorithms to increase efficiency and to provide a mathematical
basis for searches which can be translated into statistical significance
• Assumes that sequence, structure and function are inter-related
• BLAST (Basic Local Alignment Search Tool) and FastA (Fast
Alignment)
– heuristic approximations of Needleman-Wunsch and SmithWaterman algorithms
– reduce computation
TM
Needleman-Wunsch Algorithm
• General algorithm for sequence comparison
• Maximise a similarity score, to give ‘maximum match’
• Maximum match = largest number of residues of one sequence that
can be matched with another allowing for all possible deletions
• Finds the best GLOBAL alignment of any two sequences
• N-W involves an iterative matrix method of calculation
– All possible pairs of residues (bases or amino acids) - one from
each sequence - are represented in a 2-dimensional array
– All possible alignments (comparisons) are represented by
pathways through this array
TM
Needleman-Wunsch Algorithm (cont.)
• Three main steps
1. Assign similarity values
2. For each cell, look at all possible pathways back to the beginning
of the sequence (allowing insertions and deletions) and give that
cell the value of the maximum scoring pathway
3. Construct an alignment (pathway) back from the highest scoring
cell to give the highest scoring alignment
TM
Needleman-Wunsch Algorithm (cont.)
Similarity values
• A numerical value is assigned to
every cell in the array depending on
the similarity/dissimilarity of the two
residues
• These may be simple scores or
more complicated, e.g. related to
chemical similarities or frequency of
observed substitutions
• The example shown has
– match = +1
– mismatch = 0
M P R
P
1
B
R
1
C
K
C
R
N
J
C
J
A
C L C Q R J N C B A
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
TM
Needleman-Wunsch Algorithm (cont.)
Score pathways through array
• For each cell want to know the
maximum possible score for an
alignment ending at that point
• Searches subrow and subcolumn,
as shown, for the highest score
• Adds this to the score for the
current cell
• Proceeds row by row through the
array
• Gap penalty for the introduction of
gaps in the alignment (presumed
insertions or deletions into one
sequence) … here = 0
P
B
R
C
K
C
R
N
J
C
J
A
M
0
0
0
0
0
0
0
P
1
0
0
0
0
0
0
R
0
1
2
1
1
1
2
C
0
1
1
3
2
3
2
L
0
1
1
2
3
3
3
C
0
1
1
3
3
4
3
Q
0
1
1
2
3
3
4
R
0
1
2
2
3
3
?
J
0
1
1
2
3
3
N
0
1
1
2
3
3
C
0
1
1
3
3
4
B
0
2
1
2
3
3
A
0
1
2
2
3
3
1
1
1
1
1
1
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}
TM
Needleman-Wunsch Algorithm (cont.)
Construct alignment
• The alignment score is cumulative
by adding along a path through the
array
• The best alignment has the highest
score i.e. the maximum match
• Maximum match = largest number
resulting from summing the cell
values of every pathway
• The maximum match will ALWAYS
be somewhere in the outer row or
column shown
• The alignment is constructed by
working backwards from the
maximum match
P
B
R
C
K
C
R
N
J
C
J
A
M
0
0
0
0
0
0
0
0
0
0
0
0
P
1
0
0
0
0
0
0
0
0
0
0
0
R
0
1
2
1
1
1
2
1
1
1
1
1
C
0
1
1
3
2
3
2
2
2
3
2
2
L
0
1
1
2
3
3
3
3
3
3
3
3
C
0
1
1
3
3
4
3
3
3
4
3
3
Q
0
1
1
2
3
3
4
4
4
4
4
4
R
0
1
2
2
3
3
5
4
4
4
4
4
J
0
1
1
2
3
3
4
5
6
5
6
5
N
0
1
1
2
3
3
4
6
5
6
6
6
C
0
1
1
3
3
4
4
5
6
7
6
6
B
0
2
1
2
3
3
4
5
6
6
7
7
A
0
1
2
2
3
3
4
5
6
6
7
8
MP-RCLCQR-JNCBA
| || | | | | |
-PBRCKC-RNJ-CJA
TM
Needleman-Wunsch Algorithm (cont.)
Statistical Significance
• Maximum match is a function of sequence relationship and
composition
• Would like to know probability of obtaining result (maximum match)
from a pair of random sequences
• Estimate this experimentally
– form pairs of random sequences by randomly drawing one
member from each set (I.e. have same composition as the real
proteins)
– if the value found for the real proteins is significantly different from
that for the random proteins then the difference is a function of the
sequences alone and not of their composition
TM
Smith-Waterman Algorithm
• Instead of looking at each sequence in its entirety this compares
segments of all possible lengths (LOCAL alignments) and chooses
whichever maximise the similarity measure
• For every cell the algorithm calculates ALL possible paths leading to
it. These paths can be of any length and can contain insertions and
deletions
TM
Smith-Waterman Algorithm (cont.)
• Only works effectively when gap
•
•
•
penalties are used
In example shown
– match = +1
– mismatch = -1/3
– gap = -1+1/3k (k=extent of
gap)
Start with all cell values = 0
Looks in subcolumn and subrow
shown and in direct diagonal for
a score that is the highest when
you take alignment score or gap
penalty into account
A
A
U
G
C
C
A
U
U
G
A
C
G
G
C
0.0
0.0
0.0
0.0
1.0
1.0
A
1.0
1.0
0.0
0.0
0.0
0.7
G
0.0
0.7
0.8
1.0
0.0
0.0
C
0.0
0.0
0.3
0.3
2.0
1.0
C
0.0
0.0
0.0
0.0
1.3
3.0
U
0.0
0.0
0.0
0.0
0.3
1.7
C
0.0
0.0
0.0
0.7
1.0
?
G
0.0
0.0
0.0
1.0
0.3
C
0.0
0.0
0.0
0.0
2.0
U
0.0
0.0
1.0
0.0
0.7
U
0.0
0.0
1.0
0.7
0.3
A
1.0
1.0
0.0
0.7
0.3
G
0.0
0.7
0.7
1.0
0.3
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}
TM
Smith-Waterman Algorithm (cont.)
•
Four possible ways of forming a path
For every residue in the query sequence
1. Align with next residue of db sequence … score is previous score
plus similarity score for the two residues
2. Deletion (i.e. match residue of query with a gap) … score is
previous score minus gap penalty dependent on size of gap
3. Insertion (i.e. match residue of db sequence with a gap) … score
is previous score minus gap penalty dependent on size of gap
4. Stop … score is zero
•
Choose whichever of these is the highest
TM
Smith-Waterman Algorithm (cont.)
Construct Alignment
• The score in each cell is the
maximum possible score for
an alignment of ANY
LENGTH ending at those
coordinates
• Trace pathway back from
highest scoring cell
• This cell can be anywhere
in the array
• Align highest scoring
segment
A
A
U
G
C
C
A
U
U
G
A
C
G
G
C
0.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
A
1.0
1.0
0.0
0.0
0.0
0.7
2.0
0.7
0.3
0.0
1.0
0.0
0.7
0.0
G
0.0
0.7
0.8
1.0
0.0
0.0
0.7
1.7
0.3
1.3
0.0
0.7
1.0
1.7
C
0.0
0.0
0.3
0.3
2.0
1.0
0.3
0.3
1.3
0.0
1.0
1.0
0.3
0.7
GCC-UCG
GCCAUUG
C
0.0
0.0
0.0
0.0
1.3
3.0
1.7
1.3
1.0
1.0
0.3
2.0
0.7
0.3
U
0.0
0.0
0.0
0.0
0.3
1.7
2.7
2.7
2.3
1.0
0.7
0.7
1.7
0.3
C
0.0
0.0
0.0
0.7
1.0
1.3
1.3
2.3
2.3
2.0
0.7
1.7
0.3
1.3
G
0.0
0.0
0.0
1.0
0.3
1.0
1.0
1.0
2.0
3.3
2.0
1.7
2.7
1.3
C
0.0
0.0
0.0
0.0
2.0
1.3
0.7
0.7
0.7
2.0
3.0
3.0
1.7
2.3
U
0.0
0.0
1.0
0.0
0.7
1.7
1.0
1.7
1.7
1.7
1.7
2.7
2.7
1.3
U
0.0
0.0
1.0
0.7
0.3
0.3
1.3
2.0
2.7
1.3
1.3
1.3
2.3
2.3
A
1.0
1.0
0.0
0.7
0.3
0.0
1.3
1.0
1.7
2.3
2.3
1.0
1.0
2.0
G
0.0
0.7
0.7
1.0
0.3
0.0
0.0
1.0
1.0
2.7
2.0
2.0
2.0
2.0
TM
Differences
• Needleman-Wunsch
• Smith-Waterman
1. Global alignments
2. Requires alignment score for a pair
of residues to be >=0
3. No gap penalty required
1. Local alignments
2. Residue alignment score may be
positive or negative
3. Requires a gap penalty to work
effectively
4. Score can increase, decrease or
stay level between two cells of a
pathway
4. Score cannot decrease between
two cells of a pathway
TM
Download