Sequence analysis of nucleic acids and proteins: part 1 by Minoru Kanehisa,

advertisement
Sequence analysis of nucleic
acids and proteins: part 1
Similarity search
Based on Chapter 3 of
Post-genome Bioinformatics
by Minoru Kanehisa,
Oxford University Press, 2000
Search and learning problems in sequence analysis
Similarity search
Proble ms in Biological Science
Pairwise sequence alignment
Database search for similar
sequences
Mult iple sequence alignment
Phylogenetic tree
reconstruction
Prot ein 3D structure
alignment
Structure/func tion ab initio prediction
prediction
Knowledge based
prediction
Mole cular classifi cation
RNA seconda ry struc ture
prediction
RNA 3D structure prediction
Protein 3D structure prediction
Motif extraction
Func tiona l sit e prediction
Cellular locali zation p rediction
Coding region p rediction
Transmembrane domain
prediction
Protein seconda ry structure
prediction
Protein 3D structure prediction
Supe rfamil y classification
Ortholog/p aralog grouping of
gene s
3D fold classification
Math/Stat/CompSci method
Optimi zation algorithms
 Dynamic progra mmi ng
(DP)
 Simulated annealing (SA)
 Genetic algorithms (GA)
 Markov Chain Monte
Carlo (MCMC:
Metropolis and Gibbs
sampl ers)
 Hopfield neural networ k
Pattern recogn iti on and
learning algo rit hms
 Discrimi nan t ana lysis
 Neural networks
 Suppor t vec tor machin es
 Hidden Markov models
(HMM)
 Forma l grammar
 CART
Clustering algorithms
 Hierarchical, k-means , etc
 PCA, MDS, etc
 Self -organ izing maps, etc
A comparison of the homology search and the motif search
for functional interpretation of sequence information.
Homology Search
New sequence
Retrieval
Sequence database
(Primary data)
Motif Search
Knowledge
acquisition
Similar
sequence
Expert
knowledge
New sequence
Motif library
(Empirical rules)
Inference
Expert
knowledge
Sequence interpretation
Sequence interpretation
Pairwise sequence alignment by the dynamic programming
algorithm. The algorithm involves finding the optimal path in the
path matrix. (a), which is equivalent to searching the optimal
solution in the search tree (b).
(a) Path Matrix
A
(b) Search Tree
I
M
S
A
M
O
S
X
.
Alignment AIM-S
A-MOS
.
X
.
.
.
.
.
.
.
.
.
.
.
.
Pruning by an optimization function
Methods for computing the optimal score in the dynamic
programming algorithm (a ) the gap penalty is a constant.
(b) the gap penalty is a linear function of the gap length.
(a)
Di, j-l
Di-1, j-1
(b)
Di-1, j-1
Di, j-l
d
ws(i), t(j)
Di-1, j
d
Di,j
b
Di-1, j
Di, j(2)
ws(i), t(j)
b
Di,j(3)
Di,j(1)
Concepts of global and local optimality in the pairwise
sequence alignment. The distinction is made as to how the
initial values are assigned to the path matrix.
(a) Global vs. Global
(b) Local vs. Global
0 0 . . . . . . 0
0
(c) Local vs. Local
X
0 0 . . . . . . 0
.
.
.
.
0
0. 0 . . . . . . 0
.
.
.
0
The order of computing matrix elements in the path matrix, which
is suitable for (a) sequential processing and (b) parallel processing.
(a)
(i -1, j -1)
(I, j -1)
(i +1, j-1)
(i -1, j )
(i, j)
(i +1, j )
(i, j -2)
(i+1, j -2)
(i, j -1)
(i +1, j -1)
(b)
(i -1, j -1)
(i -1, j )
(i, j)
The dynamic programming algorithm can be applied to limited
areas, rather than to the entire matrix, after rapidly searching the
diagonals that contain candidate markers.
1
1
i
n
1
j
l
m
n +m -1
l
m
The hashing technique for rapid sequence comparison. In this case
the horizontal sequence is converted to a hash table, which
contains the locations of the four nucleotides.
Query Sequence
Hash Table
A T C A C A C G G C
T
A
T
C
G
C
A
G
T
C
A
A
T
T
C
.
.
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
Key
A
C
G
T
Address
1 4 6
3 5 7 10
8 9
2
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Used in FASTA
An example of the finite state automaton for pattern matching
C
B
A
C
Q1
Q2
A
B
A
C
Q0
B
C
Q4 A
B
B
A
Q3
C
Bold arrows lead to ouputs
indicating patterns have been found
Used in BLAST
The tree-based progressive method for multiple sequence alignment,
which utilizes: (a) a dendrogram obtained by cluster analysis and
(b) group alignment for pairwise comparison of groups of sequences.
(a)
DEPGG3
DEBYG3
DEZYG3
DEBSG F
L R R - A R T A S A
L - R G A R A A A E
DEHUG3
L W R D G R G A L Q
L W R G G R G A A Q
D W R - G R T A S G
(b)
Possible tree topologies in the phylogenetic analysis of: (a) three
sequences or (b) four sequences. Filled circles represent extant
sequences, while open circles represent common ancestors.
(a)
A
B
C
D
D
A
B
A
B
A
B
D
C
C
C
Simulated annealing and Metropolis Monte Carlo methods are
based on the concept of thermal fluctuations in the energy functions.
DE = E (x’n) - E (x n)
E
1
When DE 
p =
exp(-DEl Tn ) When DE >
x
Dynamic programming to find edit distances
- Edit operation: M, R, I, D
- Edit transcript: A string over the alphabet M, R, I, D that describes a
transformation of one string into another. Example:
R
D
I
M
D
M
M
A
-
T
H
S
A
-
R
T
-
S
- Edit (Levens(h)tein) distance: The minimum number of edit operations
necessary to transform one string into another. (Note: matches
are not counted.) Example:
R
D
I
M
D
M
1+ 1+ 1+ 0+ 1+ 0
=
4
The recurrence
- Stage: position in the edit transcript;
- State: I, D, M, or R;
- Optimal value function: D(i, j)
where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j]
- Recurrence relation:
1 +D(i-1, j)
D(i, j) = min 1 +D(i, j-1)
t(i, j) +D(i-1, j-1) , where
t(i, j) =
{
1, Seq1 (i)  Seq2(j)
0, Seq1(i)  Seq2(j)
The tabulation , D(i, j)
Seq2(j)
Seq1(i)
0
0
M
1
A
2
T
3
H
4
S
5
A
R
T
S
1
2
3
4
The tabulation , D(i, j)
Seq2(j)
Seq1(i)
0
0
M
1
A
2
T
3
H
4
S
5
0
A
R
T
S
1
2
3
4
The tabulation , D(i, j)
Seq2(j)
Seq1(i)
0
M
1
A
2
T
3
H
4
S
5
A
R
T
S
0
1
2
3
4
0
1
The tabulation , D(i, j)
Seq2(j)
Seq1(i)
0
M
1
A
2
T
3
H
4
S
5
A
R
T
S
0
1
2
3
4
0
1
2
The tabulation , D(i, j)
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
A
2
2
T
3
3
H
4
4
S
5
5
Seq1(i)
The tabulation , D(i, j)
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
A
2
2
T
3
3
H
4
4
S
5
5
Seq1(i)
The tabulation , D(i, j)
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
A
2
2
T
3
3
H
4
4
S
5
5
Seq1(i)
The tabulation , D(i, j)
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
3
4
A
2
2
1
2
3
4
T
3
3
H
4
4
S
5
5
Seq1(i)
The tabulation , D(i, j)
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
3
4
A
2
2
1
2
3
4
T
3
3
2
2
2
3
H
4
4
S
5
5
Seq1(i)
The tabulation , D(i, j)
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
3
4
A
2
2
1
2
3
4
T
3
3
2
2
2
3
H
4
4
3
3
3
3
S
5
5
4
4
4
3
Seq1(i)
The traceback
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
3
4
A
2
2
1
2
3
4
T
3
3
2
2
2
3
H
4
4
3
3
3
3
S
5
5
4
4
4
3
Seq1(i)
The solutions - #1
1
0
1
1
0
D
M
R
R
M
M
A
T
H
S
-
A
R
T
S
=
3
The traceback
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
3
4
A
2
2
1
2
3
4
T
3
3
2
2
2
3
H
4
4
3
3
3
3
S
5
5
4
4
4
3
Seq1(i)
The solutions - #2
1
0
1
0
1
0
D
M
I
M
D
M
M
A
-
T
H
S
-
A
R
T
-
S
=
3
The traceback
Seq2(j)
A
R
T
S
0
1
2
3
4
0
0
1
2
3
4
M
1
1
1
2
3
4
A
2
2
1
2
3
4
T
3
3
2
2
2
3
H
4
4
3
3
3
3
S
5
5
4
4
4
3
Seq1(i)
The solutions - #3
1
1
0
1
0
R
R
M
D
M
M
A
T
H
S
A
R
T
-
S
=
“Life must be lived forwards and understood backwards.”
- Søren Kierkegaard
3
BLOSUM62 SCORING MA TRIX
134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI
|
|||
|
|
||||||
|
|| ||
137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
9
-1
-1
-3
0
-3
-3
-3
-4
-3
-3
-3
-3
-1
-1
-1
-1
-2
-2
-2
D:D = +6
4
1
-1
1
0
1
0
0
0
-1
-1
0
-1
-2
-2
-2
-2
-2
-3
5
-1
0
-2
0
-1
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-2
7
-1
-2
-2
-1
-1
-1
-2
-2
-1
-2
-3
-3
-2
-4
-3
-4
4
0
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-3
D:R = -2
6
0
-1
-2
-2
-2
-2
-2
-3
-4
-4
-3
-3
-3
-2
6
1
0
0
1
0
0
-2
-3
-3
-3
-3
-2
-4
6
2
0
-1
-2
-1
-3
-3
-4
-3
-3
-3
-4
5
2
0
0
1
-2
-3
-3
-2
-3
-2
-3
5
0
1
1
0
-3
-2
-2
-3
-1
-2
8
0
-1
-2
-3
-3
-3
-1
2
-2
5
2
-1
-3
-2
-3
-3
-2
-3
5
-1
-3
-2
-2
-3
-2
-3
5
1
2
1
0
-1
-1
4
2
3
0
-1
-3
4
1
0
-1
-2
4
-1 6
-1 3
-3 1
7
2 11
C S T P A G N D E Q H R K M I L V F Y W
From Henikoff 1996
Scoring Matrices
• Physical/Chemical similarities
- comparing two sequences
according to the properties of
their residues may highlight
regions of structural similarity
• Identity matrices
- by stressing only identities in
the alignment, stretches of
sequence that may have diverged
will not penalise any remaining
common features
Scoring Matrices (ctd)
• As the direct source of residue by
residue comparison scores the scoring
matrix you choose will have a major
impact on the alignment calculated
• The most commonly used will be one
of the mutation matrices
PAM, BLOSUM
• The matrix that performs best will
be the matrix that reflects the
evolutionary separation of the
sequences being aligned
Probability and Likelihood
Some probabilities of observations depend on
unknown parameters. E.g. if
O = SFFSFFF
then under independence
pr(O) = p2(1-p)5.
We can calculate this for any observation O,
so in a sense we have a 2-variable function
pr(O,p) or pr(O|p)
depending on O and p (0< p <1).
Likelihood:
holds O fixed, varies p.
Maximum Likelihood estimate: the p which
maximizes pr(O,p), O fixed, denoted .
E.g. above,
= 2/7.
Statistical motivation for alignment scores
AGCTGATCA...
Alignment: AACCGGTTA...
pr(data|H) = pr(
|H) = pr(
= (1-p)apd
pr(data|R) = pr(
log
Hypotheses:
H = homologous (indep. sites, Jukes-Cantor)
R = random (indep. sites, equal freq.)
|H) x ...
d = # disagreements, a = # agreements, p = 3(1-e-8at)
4
|R) = pr(
|R) x ...
= ( 1 )a( 3 )d
4
4
pr(data|H)
p
1-p
= a x log
+ d x log 3/4 .
pr(data|R)
1/4
{
}
score = a x s + d x (-m)
Since p <
3
p
1-p
, log
<0, log
>0
4
3/4
1/4
s>0 match score, -m<0 mismatch penalty
Note that if at  0, p  6at, 1-p  1 and so s  log4, while -m  log8at is large and
negative: a big difference in the two scores.
3
p
Conversely, if at is large, p = 4 (1-e), 3/4 = 1-e, and m log(1-e)  -e,
1
1-p
while 1-p = 4(1+3e), 1/4 = 1+3e, and so s log(1+3e)  3e. Thus the scores are
about 3:1.
We can do the same with any other Markov substitution matrix for
molecular evolution. E.g. with a PAM or BLOSUM matrix of
probabilities,
a1 ..... am
data = b ..... b
1
m
a gap free alignment of two a.a. sequence
fragments
m
pa pa b (2t)
P
1
log{pr(data|H) } = Slog{ pa b (2t)/ pb }
pr(data|R)
pr(data|H) =
i
i i
i
i i
pr(data|R) =
P pa pb
i
i
i
i
The elements of a log-odds score matrix are typically > 0 on the
diagonal and < 0 off the diagonal, but not always.
Also the relative sizes of match and mismatch penalties increase
as #PAMs (at) decreases. Thus PAM(120) is more stringent than
PAM(250), while PAM(360) is less stringent than it.
PAM(0) = the identity matrix is the toughest.
There are plenty of score matrices based on other principles.
Local alignment
aligns only the most similar regions of two
sequences
Why? Often distantly related proteins have
only isolated regions (e.g. active sites) of
similarity.
The modular nature of proteins
How? The dynamic programming algorithm we
have seen needs only a minor modification to
yield the best local alignment between two
sequences. It is called the Smith-Waterman
algorithm, and is named bestfit in GCG.
Similar Amino Acid Sequences:
Chance
or
Common
Ancestry?
Title of paper by Russell F. Doolittle, Science
214 (1981)1
The question arises every time an alignment is done without prior
knowledge of homology.
The usual caveats:
• the scientific goal is not necessarily the
same as the mathematical/statistical goal
•significance may not mean homology
•non-significance may not mean non-homology
Early use of statistics
•Generate random permutations of the
sequence(s)
•Obtain the average (av) and standard
deviation (SD) of the random similarity
scores
•Compute z=(observed score - av)/SD
•Think normal (e.g. 4 is a very large z)
This approach is still used for global
alignments, but is no longer seen as
appropriate for local alignments, since the
score is optimized, and random optimal
scores do not follow the normal law.
More recent statistical
developments:
Theory developed by Karlin and collaborators
in 1990-4 and, independently, by Waterman
and collaborators in 1988-94. Incorporates
the fact that the score has been optimized.
Immediately implemented in BLAST. Later
appears in a similar form in FASTA and
elsewhere.
The theory applies to the
ensemble of random
•pairs of sequences, with fixed
•possibly different lengths,
•possibly different residue distributions
•and ungapped alignments
(extensions to ungapped alignments coming now)
The theoretical distribution of
random similarity scores
•is universal in form (see diagram)
•with scale parameter depending on the two
residue distributions, and the substitution
scores used
•and location parameter depending on the
above, plus the lengths of the two sequences
For m, n large, the optimal random score S has
the extreme-value distribution with cdf
exp{-exp{-l(s-u)}}
where l is the unique positive solution (in t)
of
Sijpiqjexp(sijt)=1,
and
u =
1
l
log(Kmn)
and K is given by a series depending on the
compositions (pi) and (qj) and the scoring
matrix (sij).
Databases searches: why do them?
To find exact matches to sequences
To find homologous sequences
To infer structure and/or function
of new protein sequences
To locate genes in ESTs or genomic
sequences
To discover gene structure in DNA
sequence
And much more...
Database searching
Compares a query sequence to each sequence
in a database (also called a library).
Because of the large size of sequence
databases, comparisons are generally carried
out using faster heuristic approximations
to, rather than the exact Smith-Waterman
local alignment algorithm. The two most
common of these are FASTA and BLAST, where
each of these names corresponds to a family
of algorithms used in different contexts.
BLAST variants for different searchesa
(after S. Brenner, Trends Guide to Bioinformatics, 1998)
Program Query
aSimilar
Database
Comparison
Common use
blastn
DNA
DNA
DNA level
Seek identical DNA
sequences and
splicing patterns
blastp
Protein
Protein
Protein level
Find homologous
proteins
blastx
DNA
Protein
Protein level
Analyze new DNA
to find genes and
seek homologous
proteins
tblastn
Protein
DNA
Protein level
Search for genes in
unannotated DNA
tblastx
DNA
DNA
Protein level
Discover gene
structure
variant programs are available for FASTA. Proteinlevel searches of DNA sequences are performed by comparing
translations of all six reading frames.
cDNA, ORFs and ESTs
• Complementary DNA (cDNA)
– Single stranded DNA complementary to an RNA, from which
synthesized by reverse transcription.
• Open reading frames (ORFs)
– Contains a series of triplets coding for amino acids without any
termination codons (potentially translatable into proteins)
– Many derived from sequencing of cDNAs
• Expressed sequence tags (ESTs)
– Short (300-500 bp) single reads from mRNA (cDNA) sequencing
survey projects.
– A snapshot of what is expressed in a given tissue at a given
developmental stage.
Download