Sequences Alignment Statistics

advertisement
Sequences Alignment Statistics
Ka-Lok Ng
Asia University
Pairwise Sequence Alignment
• The most important class of bioinformatics tools – pairwise
alignment of DNA and protein seqs.
alignment 1
alignment 2
Seq. 1 ACGCTGA
ACGCTGA
Seq. 2 A - - CTGT
ACTGT - Seeks alignments  high seq. identity, few mismatchs and gaps
Assumption – the observed identity in seqs. to be aligned is the result
of either random or of a shared evolutionary origin
Identity ≠ similarity
Sequence identity/similarity = Homology (a risky assumption)
Sequence similarity ≠ Homology
Pairwise Sequence Alignment
Same true alignment arise through different evolutionary events
Scoring scheme: substitution  -1, indel  -5, match  3
indel
Score
9
5
4
4
Figure A Common evolutionary events and their effects on alignment
Pairwise Sequence Alignment
Find the optimal score  the best guess for the true
alignment
Find the optimal pairwise alignment of two seqs. 
inserted gaps into one or both of them  maximize
the total alignment score
Dynamic programming (DP) – Needleman and
Wunsch (1970), Smith and Waterman (1980), this
algorithm guarantees that we find all optimal
alignments of two seqs. of lengths m and n
BLAST is based on DP with improvement on speed
Pairwise Sequence Alignment
The score for alignment of i residues of sequence 1 against
j residues of sequence 2 is given by
S (i  1, j  1)  c(i, j )

S (i, j )  max  S (i  1, j )  c(i,)
 S (i, j  1)  c(, j )

where
c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or
-1 for a mismatch,
c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5
Pairwise Sequence Alignment
• The entry for S(1,1) is the maximum of the
following three events:
• S(0,0) + c(A,A) = 0 + 3 = 3
[c(A,A) = c(1,1)]
• S(0,1) + c(A, -) = -5 + -5 = -10 [c(A, -) = c(1, -)]
• S(1,0) + c(-, A) = -5 + -5 = -10 [c(- ,A) = c(-, 1)]
• Similarly, one finds S(2,1) as the maximum of
three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15
 the best is entry is the addition of the C indel
to the A-A match, for a score of -2 (see next
page).
Pairwise Sequence Alignment
The alignment matrix of sequences 1 and 2
A  35  27  19
9
3
1
G  30  22  14
4
2
2
T
 25  17
9
1
1
7
C
 20  12
4
0
4
3
G
 15
7
1
5
4
1
C
 10
2
6
1
4
9
A
5
3
2
7
 12
 17

0
5

A
 10  15  20  25
C
T
G
T
S(2,1)
= max {S(1,0) + c(2,1),
S(1,1) + c(2,-), S(2,0) + c(-,1)}
= max { S(1,0) + c(C,A),
S(1,1) + c(C,-), S(2,0) + c(-,A) }
= max { -5-1, 3-5, -10-5 }
= -2
Pairwise Sequence Alignment
Traceback  determine the actual alignment
From the top right hand corner  the (7,5) cell
A  35  27  19
9
3
1
G  30  22  14
4
2
2
T
 25  17
9
1
1
7
C
 20  12
4
0
4
3
G
 15
7
1
5
4
1
C
 10
2
6
1
4
9
A
5
3
2
7
 12
 17

0
5

A
 10  15  20  25
C
T
G
T
For example the 1 in the
(7,5) cell could only be
reached by the addition of
the mismatch A-T
ACGCTGA
A - - CTGT
or
ACGCTGA
AC - - TGT
4 matches
1 mismatch
2 indels
Ambiguity – has to do with
which C in seq. 1 aligns
with the C in seq. 2
Pairwise Sequence Alignment
Parameters settings - Gap penalties
• Default settings are the easiest to use but they are not necessarily yield
the correct alignment
• constant penalty  independent of the length of gap, A
• proportional penalty  penalty is proportional to the length L of the gap,
BL (that is what we used in the this lecture)
• affine gap penalty  gap-opening penalty + gap-extension penalty =
A+BL
• There is no rule for predicting the penalty that best suits the alignment
• Optimal penalties vary from seq. to seq.  it is a matter of trial and error
• Usually A > B, because of opening a gap (usually A/B ~ 10)
• Hint: (1) compare distantly related seqs. high A and very low B often
give the best results  penalized more on their existence than on their
length, (2) compare closely related seqs., penalize both of extension and
extension
BLAST
global alignment of a pair of seqs., in which all residues from both seqs. are
included.
BLAST – local alignment
Interpreting BLAST output
-Smith and Waterman algorithm  guaranteed to find the best local alignment of
two seqs.
-Too slow in practice !!
-BLAST  heuristic search method that is not guaranteed to find the best local
alignment, but has been especially effective in practice
-e.g. S45649 (from a fossilized insect)
>gi|256517|gb|S45649.1| 16S rRNA [Mastotermes electrodominicus=termites,
amber-preserved fossil, Mitochondrial, 94 nt]
AATAAAATTTTAATAAATATAAAGATTTATAGGGTCTTCTCGGCCTTTAAAAATA
TTTTAGCCTTTTGAC AAAAAAAAAAAAATCTACAAAAAA
BLAST
http://www.ncbi.nlm.nih.gov/BLAST/
E-value, with the most significant hits listed first
E-value is the number of hits with the same level of
similarity that you would expect by chance
E = 0.01  occur once every 100 searches even when
there is no true match in the database
E-value is similar in spirit to the p-value of statistical
hypothesis tests. E-value depends on the size of the
database.
P-value is the probability of finding a seq. similarity as
similar as the observed match if there were really no
true matches in the database.
E-value ≠ p-value
E-value ~ p-value when it is small (say < 0.1)
Since we are interested in unusual hits, it is safe to
interchange E-value with p-value.
E-value – the lower the better the alignment, matches
above 0.001 are often close to the twilight zone
Score (bits) – the higher the better the alignment, score
below 50 are unreliable
Searching Sequence Databases Using BLAST
Searching Sequence Databases Using BLAST
BLAST
The BLAST output may not be the same every time due to the upgrade of
several components :
Database, the BLAST program, the default parameters of the server
E-value, similarity and homology
Protein : >25 %, > 100 a.a., < 10-4
DNA : >70%, > 100 bp, < 10-4
Gap penalties
- constant penalty independent of the length of gap, A
- proportional penalty, penalty is proportional to the length L of the gap, BL
- Affine (『數』遠交的,『化學』親和的) gap penalty, gap-opening penalty +
gap-extension penalty = A+BL
Remark
• Prediction using similarity is a powerful idea in bioinformatics
• homologue  seqs. evolved by divergence from a common ancestor,
therefore to say two seqs. share 50% homology is nonsense; to say two
seqs. share 50% similarity and that they indicate possible homology is
the correct usage of the terms
• Similarity NOT necessary implied homology
BLAST (choosing the parameters)
BLAST - Most highly cited paper >12000 times
alternative methods  seeds + dynamics programming  speed up, faster
not guaranteed to find the best alignment  less accurate
BLAST (Sequence filters)
http://www.ncbi.nlm.nih.gov/BLAST/
BLAST
What is a coiled-coil?
Coiled-coil domains are characterized by a heptad
(成七的一組) repeat pattern in which residues in
the first and fourth position are hydrophobic, and
residues in the fifth and seventh position are
predominantly charged or polar. This pattern can
be used by computational methods, such as
MultiCoil (MIT) or SOCKET (University of
Sussex)to predict coiled-coil domains in amino
acid sequences.
BLAST programs
BLAST (Scoring matrices)
• How to determine the score scheme ?
• Dynamic Programming do not provide the user with a measure of
statistical similarity when regions of local similarity are found
• Take into account not just the position-position overlap between two
seqs. but the characteristics of the a.a being aligned  define
scoring matrices
• Protein scoring matrices take three major biological factors into
account:
• Conservation – the numbers within the scoring matrix provide a
way of representing what a.a. are capable of substituting for other
a.a. (characteristics such as charge, size, hydrophobicity)
• Frequency – a.a cannot freely substitute for one another, the
matrices need to reflect how often particular a.a occur among the
entire proteins.
• Evolution – scoring matrices implicitly represent evolutionary
patterns, and matrices can be adjusted to favor the detection of
closely related or more distantly related proteins.
BLAST (Scoring matrices)
Scoring matrices and the Log Odds Ratio
Si , j  log[
qi , j
pi p j
]
where pi[pj] = probability with which a.a i [j] occurs
among all proteins
qi,j = how often the two a.a i and j are seen to align
with one another in MSA of protein families or in
seqs. that are known to have a biological
relationship.
BLAST (PAM matrices)
Amino acid substitution matrix (PAM and BLOSUM)
• Leave most adjustable parameters to the default value except the
scoring matrix
• a simple scheme for scoring seq. matches and mismatches (all
mismatches received the same penalty)
• Scoring matrix allows some mismatches to be penalized less then
others
• Leucine-isoleucine mismatch < leucine-tryptophan mismatch
• PAM (Point Accepted Mutations) scoring matrices – derived from
closely related species (evolutionary point of view, avoid the
complications of unobserved multiple substitutions at a single position)
• PAM derived from the likelihood of amino acids substitution during the
evolutionary process
• PAM matrices with a smaller number represent shorter evolutionary
distance
• PAM1 – one a.a change per 100 a.a, or roughly 1% divergence
PAM matrix
Asp  Glu, 0.95%
The values in each column
sum to 10000100%.
The value 9730 at the top
left indicates that alanine
has a 97.30% chance that
the replacement aa will
also be an alanine.
The least mutable a.a,
tryptophan (W) has a
99.41% chance of
remaining the same.
The PAM250 = (PAM1)250
Construction of a PAM matrix – an example
•
1.
2.
3.
4.
5.
6.
7.
Construct a MSA. Below is an example of a MSA:
ACGCTAFKI
GCGCTAFKI
ACGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
GCGCTAFKI
ACACTAFKL
ACGCTAFKL
ASGCTAFKL
ACACTAFKL
GCGCTGFKI GCGCTLFKI
•
•
•
A phylogenetic tree is created on the basis of the alignment. Phylogenetic trees
are discussed in any book on bioinformatics.
The phylogenetic tree shows various substitutions among the amino acids.
For each aa type, calculate its frequency of substitution, and its assume that the
substitutions are equally likely in each direction, so a substitution FG,A = AG
would also count as a GA substitution. We count all AG and GA branches
in the tree. For the tree above, FG,A = 3.
Construction of a PAM matrix –
an example
•
•
Compute the relative mutability, mi, of each aa.
For example, consider the A residue. There are a total of 4 mutation involving A. Divide this number
by the number of mutations in the entire tree time two (6x2=12), times the relative freq. of A residue
(10 A’s out of 63 residues, i.e. 10/63 = 0.159), times 100.
Thus mA = 4/12 x 0.159 x 100 = 5.3.
The value 100 is used so that the PAM-1 matrix will represent 1 substitution per 100 residues in the
phylogenetic tree.
Compute the mutation probability, Mij , for each pair of aa.
•
•
•
M ij 
m j Fij
F
ij
i
for _ our _ example
M G, A 
F
5.3 * 3
 3.975.
4
is the total number of substitutions involving A in the phylogenetic tree, and it is equals
to 4 for A residue.
• Finally each Mij is divided by the frequency of occurrence, fi of residue i, and the log of the
resulting value becomes the entry Rij in the PAM matrix. For example, fG = 10/63=0.1587, i.e.
10 G’s divided by 63 aa., so RGA = log (3.975/0.1587) = 1.40.
• Repeat for each pair of aa  PAM matrix
ij
i
BLAST (BLOSUM matrices)
BLOSUM (BLOck SUM) – there are evidence it outperform PAM
• Block  proteins in the same family can be aligned without
introducing a gap (not the individual seqs.)
• So any given protein can contain one or more blocks, corresponding
to each of its functional or structural motif
• With these protein blocks, it is possible to look for substitution
patterns only in the most conserved regions of a protein  block
substitution matrices are generated
• BLOSUM scoring matrix – based on data from distantly related seqs.
(default BLOSUM62 for general use)
• The most commonly used matrices are PAM120, PAM250,
BLOSUM50 and BLOSUM 62
• BLOSUM matrices with a smaller number represent a longer
evolutionary distance
BLAST (BLOSUM matrices)
The BLOSUM62
substitution matrix
Values below zero
indicate amino acid
changes that are
more likely to have
a functional effect
than values of zero
and above.
BLAST (relating PAM to BLOSUM)
PAM250 ~ BLOSUM45
PAM160 ~ BLOSUM62
PAM120 ~ BLOSUM80
Selecting an appropriate scoring matrix
Matrix
Best use
Similarity(%)
BLOSUM90
Short alignments that are highly similar
70-90
BLOSUM80
Detecting members of a protein family
50-60
BLOSUM62
Most effective in finding all potential
similarities
30-40
BLOSUM30
Longer alignment of more divergent seqs.
<30
BLAST (Sensitivity and Specificity)
BLAST (Sensitivity and Specificity)
BLAST (Sensitivity and Specificity)
BLAST (Sensitivity and Specificity)
Accuracy (Q), Sensitivity (SN), Specificity (SP),and
Correlation coefficient (CC)
TP  TN
,
TP  TN  FP  FN
TP
SN 
,
TP  FN
TP
S PTP 
,
TP  FP
TN
S PTN 
,
TN  FP
TP  TN  FP  FN
CC 
(TP  FP)(TP  FN )(TN  FP)(TN  FN )
Q
If FP = FN = 0,  CC =1 (the prefect prediction),
If TP = TN = 0,  CC = -1 (the worst prediction)
BLASTing DNA sequences
Use of BLASTx to find ORF
AE008569
Use of BLASTx to find ORF
Frame = +1
Frame = -2
Use of BLASTx to find ORF
Use of BLASTx to find ORF
Use of BLASTx to find ORF
The BLAST algorithm
http://cbsu.tc.cornell.edu/resources/seq_comp/webtutorials/
Outside_tutorials.html
The BLAST report
The Gumbel Extreme Value Distribution
•
•
•
Test the significant of a local alignment score
For simplicity, consider un-gap alignment
Base on the distribution of scores expected by
aligning two random seqs. of the same length
and base composition (nucleotides or a.a) as
the two test seqs.
• These random seq. alignment scores follow a
distribution called the extreme value
distribution
• Somewhat like a normal distribution with a
positively skewed tail in the higher score range
Goal
• To valuate the probability that a score between
random or unrelated seqs. will reach the score
found between two real seqs. of interest. If
that probability is very low, the alignment score
between the real seqs. is significant.
• Reference: Mount D.W. Bioinformtics, CSHL,
2001.
The Gumbel Extreme Value Distribution
Yev ( x)  e
 x e  x
Behavior of Yev(x),
x  - ∞, Yev(x)  0,
x  ∞, Yev(x)  0
x =0, Yev(x)  e-1
The expectation value or mean of x is the value of the Euler-Massceroni
constant, g = 0.57722.., and the variance of x, s2 is p2/6 = 1.6449. The
probability that score S less than value x, P(S<x) is obtained by calculating
the area under curve from –∞ to x,
x
P( S  x)   e
 x e x

P( S  x)  1  e
e x
dx  e
e x
The Gumbel Extreme Value Distribution
To facilitate calculation, a sequence alignment score S is normalized to produce
a score S’.
P( S '  x)  e  x
where
S '  S  ln( Kmn)
where
  p /(s 6 )  1.2825 / s
e u
K
mn
u  x  g /   x  0.45s
m, n  length _ of _ seq _ and _ database
Example
Length of query : 250
Length of random query : 250
For PAM250 matrix,
Lambda : 0.229, K : 0.09
The raw score: 75
:
bit-score S' = 0.229*75 – ln(0.09*250*250)
S’ = 8.54 bits
P(S’ ≧8.54) = e-8.54 = 1.99*10-4
This means that the chance that an
alignment between two random
sequences will achieve a score greater
than or equal to 75 using the PAM250
matrix is 0.000199.
E-value = P*size of database
Position-Specific Iterated BLAST (PSI-BLAST)
• BLAST is a fast program quite capable of identifying the
close matches, but it is less sensitive for remote
homologous seq. search
• For example, find the homologous of a mouse protein in
the human genome using BLAST is easy
• But if you want to search for a homologous of a yeast
protein in the human genome (remote homologous seq. ),
the job is more difficult for BLAST
Position-Specific Iterated BLAST (PSI-BLAST)
Position-Specific Iterated BLAST (PSI-BLAST)
Query sequence – human hemoglobin
>gi|57013850|sp|P69905|HBA_HUMAN Hemoglobin alpha subunit (Hemoglobin alpha chain) (Alpha-globin)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN
ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR
0 ≦E-value < 10-40
Position-Specific Iterated BLAST (PSI-BLAST)
Query sequence – human hemoglobin
>gi|57013850|sp|P69905|HBA_HUMAN Hemoglobin alpha subunit (Hemoglobin alpha chain) (Alpha-globin)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHA
HKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR
Gene or
Structure
information
Position-Specific Iterated BLAST (PSI-BLAST)
More seqs. are identified than
Iteration 1
Position-Specific Iterated BLAST (PSI-BLAST)
Add or remove the hits that seems
to be relevant or irrelevant (non-human seq.)
Position-Specific Iterated BLAST (PSI-BLAST)
http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html
Position-Specific Iterated BLAST (PSI-BLAST)
B~C
Download