CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

advertisement
CS 5263 Bioinformatics
Lecture 6: Sequence Alignment
Statistics
Review of last lecture
• How to map gaps more accurately?
GACGCCGAACG
|||||
|||
GACGC---ACG
GACGCCGAACG
|||| | | ||
GACG-C-A-CG
Score = 8 x m – 3 x d
Score = 8 x m – 3 x d
Gaps usually occur in bunches
- During evolution, chunks of DNA may be lost or inserted
entirely
- Aligning genomic sequences vs. cDNAs: cDNAs are
spliced versions of the genomic seqs
Model gaps more accurately
• Previous model:
– Gap of length n incurs penalty nd

• General:
n
– Convex function
– E.g. (n) = c * sqrt (n)

F(i, j)
= max
F(i-1, j-1) + s(xi, yj)
maxk=0…i-1F(k,j) – (i-k)
maxk=0…j-1F(i,k) – (j-k)
– Running Time: O((M+N)MN)
– Space: O(NM)
(cubic)
n
Compromise: affine gaps
(n) = d + (n – 1)e
|
|
gap
gap
open
extension
Match: 2
Gap open: -5
Gap extension: -1
(n)
d
GACGCCGAACG
|||||
|||
GACGC---ACG
GACGCCGAACG
|||| | | ||
GACG-C-A-CG
8x2-5-2 = 9
8x2-3x5 = 1
• We want to find the optimal alignment with affine gap penalty in
• O(MN) time
• O(MN) or better O(M+N) memory
e
Dynamic programming
• Consider three sub-problems when aligning
x1..xi and y1..yj
– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns
to yj
– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap
– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap
xi
xi
yj
F(i, j)
xi
yj
Ix(i, j)
yj
Iy(i, j)
Input
Output
(xi,yj) / 
(xi,yj) / 
Ix
(-, yj) / d
F
Start state
(-, yj) / e
(xi,-) / d
Iy
(xi,-) / e
(xi,yj) / 
Current state
Input
Output
Next state
F
(xi,yj)

F
F
(-,yj)
(xi,-)
(-,yj)
…
d
d
e
…
Ix
F
Ix
…
Iy
Ix
…
(xi,yj) / 
(xi,yj) /

start
state
(-, yj) / e
Ix
(-, yj) / d
F
(xi,-) / d
(xi,yj) / 
F-F-F-F
Iy
(xi,-) / e
F-Iy-F-F-Ix
AAC
AAC
AAC-
ACT
|||
||
ACT
-ACT
F-F-Iy-F-Ix
AAC-
| |
A-CT
Given a pair of sequences, an alignment (not necessarily optimal)
corresponds to a state path in the FSM.
Optimal alignment: a state path to read the two sequences such that
the total output score is the highest
(-, yj)/e
(xi,yj) /
(xi,yj) /
Ix
(-, yj) /d
F
(xi,-) /d
Iy
(xi,yj) /
xi
yj
(xi,-)/e
F(i-1, j-1) + (xi, yj)
F(i, j) = max Ix(i-1, j-1) + (xi, yj)
Iy(i-1, j-1) + (xi, yj)
(-, yj)/e
(xi,yj) /
(xi,yj) /
Ix
(-, yj) /d
F
(xi,-) /d
Iy
(xi,yj) /
(xi,-)/e
F(i, j-1) + d
xi
Ix(i, j) = max
yj
Ix(i, j)
Ix(i, j-1) + e
(-, yj)/e
(xi,yj) /
(xi,yj) /
Ix
(-, yj) /d
F
(xi,-) /d
Iy
(xi,yj) /
(xi,-)/e
F(i-1, j) + d
xi
Iy(i, j) = max
yj
Iy(i, j)
Iy(i-1, j) + e
F(i – 1, j – 1)
F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)
Iy(i – 1, j – 1)
Ix(i, j) = max
Iy(i, j) = max
Continuing alignment
Closing gaps in x
Closing gaps in y
F(i, j – 1) + d
Opening a gap in x
Ix(i, j – 1) + e
Gap extension in x
F(i – 1, j) + d
Opening a gap in y
Iy(i – 1, j) + e
Gap extension in y
y= G
x= 0
-
C
-
C
-
x=
y= G
C
C
-
-
-
G -
G -5
C -
C -6
A -
A -7
C -
C -8
F: aligned on both
y= G
x = -
-5
C
-6
A
C
m=2
s = -2
d = -5
e = -1
Iy: Insertion on y
C
-7
F(i-1, j-1)
-
Iy(i-1, j-1)
(xi, yj)
G -
C
-
Iy(i-1,j)
F(i-1,j)
e
Ix(i-1, j-1)
d
F(i, j)
-
F(i,j-1)
-
Ix(i,j)
Ix(i,j-1)
Ix: Insertion on x
Iy(i,j)
d
e
y= G
x= 0
-
G -
C
-
C
-
2
x=
y= G
C
C
-
-
-
-
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x = -
-5
C
A
C
Iy
C
-6
C
-7
F(i-1, j-1)
G -
m=2
s = -2
d = -5
e = -1
Iy(i-1, j-1)
(xi, yj) = 2
-
Ix(i-1, j-1)
-
F(i, j)
-
Ix
y= G
x= 0
G -
C
-
-
2
-7
C
-
x=
y= G
C
C
-
-
-
-
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x = -
-5
C
A
C
Iy
C
-6
C
-7
F(i-1, j-1)
G -
m=2
s = -2
d = -5
e = -1
Iy(i-1, j-1)
(xi, yj) = -2
-
Ix(i-1, j-1)
-
F(i, j)
-
Ix
y= G
x= 0
G -
C
C
-
-
-
2
-7
-8
y= G
C
C
-
-
-
x=
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
Iy
C
C
-5
-6
-7
-
-3
-4
F(i,j-1)
C -
d = -5
Ix(i,j)
A -
Ix(i,j-1)
C -
Ix
e = -1
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
-
-
-
G -
2
-7
-8
C -
-7
y= G
C
C
-
-
-
-
-
-
x=
G -5
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
-5
-
m=2
s = -2
d = -5
e = -1
Iy
C
-6
-3
C
-7
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = -2
C -
Ix(i-1, j-1)
A -
F(i, j)
C -
Ix
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
C -
-7
4
-1
C -6
x=
A -
A -7
C -
C -8
F
y= G
x=
G -
-5
Iy
C
-6
-
m=2
s = -2
d = -5
e = -1
-3
C
-7
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = 2
C -
Ix(i-1, j-1)
A -
F(i, j)
C -
Ix
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
C -
-7
4
-1
C -6
x=
A -
A -7
C -
C -8
F
y= G
x=
Iy
C
C
-5
-6
-7
G -
-
-3
-4
C -
-
-12 -1
F(i,j-1)
d = -5
Ix(i,j)
A -
Ix(i,j-1)
C -
Ix
e = -1
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
C -
-7
4
-1
C -6
-3
x=
A -
A -7
C -
C -8
F
y= G
x=
-5
Iy
C
-6
C
-7
G -
-
-3
C -
-
-12 -1
-4
Iy(i-1,j)
F(i-1,j)
e=-1
d=-5
A -
Iy(i,j)
C -
Ix
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
A -7
-8
-1
-6
C -
-9
-6
1
C -8
-13 -2
-3
x=
F
y= G
x=
G -
m=2
s = -2
d = -5
e = -1
Iy
C
C
-5
-6
-7
-
-3
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj)
F(i-1,j)
e
Ix(i-1, j-1)
C -
-
A -
-
-13 -10
C -
-
-14 -11
Iy(i-1,j)
d
-12 -1
F(i, j)
F(i,j-1)
Ix(i,j)
Ix(i,j-1)
Ix
Iy(i,j)
d
e
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
x GCAC
A -7
-8
-1
-6
C -
-9
-6
1
|| |
C -8
-13 -2
-3
y GC-C
F
y= G
x=
x=
C
Iy
C
y= G
-5
-6
-7
G -
-
-3
-4
G
C -
-
-12 -1
C
A -
-
-13 -10
C -
-
-14 -11
Ix
m=2
s = -2
d = -5
e = -1
x=
A
C
C
C
Today: statistics of alignment
Where does (xi, yj) come from?
Are two aligned sequences actually related?
Probabilistic model of alignments
• We’ll first focus on protein alignments without
gaps
• Given an alignment, we can consider two
possible models
– R: the sequences are related by evolution
– U: the sequences are unrelated
• How can we distinguish these two models?
• How is this view related to amino-acid
substitution matrix?
Model for unrelated sequences
• Assume each position of the alignment is independently
sampled from some distribution of amino acids
• ps: probability of amino acid s in the sequences
• Probability of seeing an amino acid s aligned to an
amino acid t by chance is
– Pr(s, t | U) = ps * pt
• Probability of seeing an ungapped alignment between
x = x1…xn and y = y1…yn randomly is
i
Model for related sequences
• Assume each pair of aligned amino acids
evolved from a common ancestor
• Let qst be the probability that amino acid s in one
sequence is related to t in another sequence
• The probability of an alignment of x and y is give
by
Probabilistic model of Alignments
• How can we decide which model (U or R) is
more likely?
• One principled way is to consider the relative
likelihood of the two models (the odd ratios)
– A higher ratio means that R is more likely than U
Log odds ratio
• Taking logarithm, we get
• Recall that the score of an alignment is
given by
• Therefore, if we define
• We are actually defining the alignment
score as the log odds ratio between the
two models R and U
How to get the probabilities?
• ps can be counted from the available
protein sequences
• But how do we get qst? (the probability that
s and t have a common ancestor)
• Counted from trusted alignments of related
sequences
Protein Substitution Matrices
• Two popular sets of matrices for protein
sequences
– PAM matrices [Dayhoff et al, 1978]
• Better for aligning closely related sequences
– BLOSUM matrices [Henikoff & Henikoff, 1992]
• For both closely or remotely related sequences
BLOSUM-N matrices
• Constructed from a database called BLOCKS
• Contain many closely related sequences
– Conserved amino acids may be over-counted
• N = 62: the probabilities qst were computed
using trusted alignments with no more than 62%
identity
– identity: % of matched columns
• Using this matrix, the Smith-Waterman algorithm
is most effective in detecting real alignments
with a similar identity level (i.e. ~62%)
: Scaling factor to convert score to integer.
Important: when you are told that a
scoring matrix is in half-bits =>  = ½ ln2
Positive for chemically
similar substitution
Common amino acids
get low weights
Rare amino acids
get high weights
BLOSUM-N matrices
• If you want to detect homologous genes with
high identity, you may want a BLOSUM matrix
with higher N. say BLOSUM75
• On the other hand, if you want to detect remote
homology, you may want to use lower N, say
BLOSUM50
• BLOSUM-62: good for most purposes
45
Weak homology
62
90
Strong homology
For DNAs
• No database of trusted alignments to start
with
• Specify the percentage identity you would
like to detect
• You can then get the substitution matrix by
some calculation
For example
• Suppose pA = pC = pT = pG = 0.25
• We want 88% identity
• qAA = qCC = qTT = qGG = 0.22, the rest =
0.12/12 = 0.01
• (A, A) = (C, C) = (G, G) = (T, T)
= log (0.22 / (0.25*0.25)) = 1.26
• (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for
s ≠ t.
Substitution matrix
A
C
G
T
A
1.26 -1.83 -1.83 -1.83
C
-1.83 1.26 -1.83 -1.83
G
-1.83 -1.83 1.26 -1.83
T
-1.83 -1.83 -1.83 1.26
A
C
G
T
A
5
-7
-7
-7
C
-7
5
-7
-7
G
-7
-7
5
-7
T
-7
-7
-7
5
• Scale won’t change the alignment
• Multiply by 4 and then round off to get integers
Arbitrary substitution matrix
• Say you have a substitution matrix
provided by someone
• It’s important to know what you are
actually looking for when you use the
matrix
NCBI-BLAST
G
WU-BLAST
A
C
T
A
C
A
1
-2 -2 -2
C
-2
1
G
T
G
T
A
5
-4 -4 -4
-2 -2
C
-4
5
-2 -2
1
-2
G
-4 -4
5
-4
-2 -2
-2
1
T
-4 -4
-4
5
-4 -4
• What’s the difference?
• Which one should I use for my sequences?
• We had
• Scale it, so that
• Reorganize:
• Since all probabilities must sum to 1,
• We have
• Suppose again ps = 0.25 for any s
• We know (s, t) from the substitution
matrix
• We can solve the equation for λ
• Plug λ into
to get qst
NCBI-BLAST
WU-BLAST
A
C
G
T
A
C
G
A
1
-2
-2 -2
C
-2
1
G
T
T
A
5
-4
-4 -4
-2 -2
C
-4
5
-4 -4
-2 -2
1
-2
G
-4 -4
5
-4
-2 -2
-2
1
T
-4 -4
-4
5
 = 1.33
 = 0.19
qst = 0.24 for s = t, and 0.004 for s ≠ t
qst = 0.16 for s = t, and 0.03 for s ≠ t
Translate: 95% identity
Translate: 65% identity
Details for solving 
Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t.
Since
A
C
G
T
A
1
-2
-2
-2
C
-2
1
-2
-2
G
-2
-2
1
-2
T
-2
-2
-2
1
and s,t qst = 1, we have
12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1
Let e = x, we have
¾ x-2 + ¼ x = 1. Hence,
x3 – 4x2 + 3 = 0;
• X has three solutions: 3.8, 1, -0.8
• Only the first leads to a positive 
•  = ln (3.8) = 1.33
Today: statistics of alignment
Where does (xi, yj) come from?
Are two aligned sequences actually related?
Statistics of Alignment Scores
• Q: How do we assess whether an
alignment provides good evidence for
homology (i.e., the two sequences are
evolutionarily related)?
– Is a score 82 good? What about 180?
• A: determine how likely it is that such an
alignment score would result from chance
P-value of alignment
• p-value
– The probability that the alignment score can
be obtained from aligning random sequences
– Small p-value means the score is unlikely to
happen by chance
• The most common thresholds are 0.01
and 0.05
– Also depend on purpose of comparison and
cost of misclaim
Statistics of global seq alignment
• Theory only applies to local alignment
• For global alignment, your best bet is to do Monte-Carlo
simulation
– What’s the chance you can get a score as high as the real
alignment by aligning two random sequences?
• Procedure
– Given sequence X, Y
– Compute a global alignment (score = S)
– Randomly shuffle sequence X (or Y) N times, obtain
X1, X2, …, XN
– Align each Xi with Y, (score = Ri)
– P-value: the fraction of Ri >= S
Human HEXA
Fly HEXO1
Score = -74
45
40
Number of Sequences
35
30
25
20
15
-74
10
5
0
-95
-90
-85
-80
-75
-70
Alignment Score
-65
-60
-55
-50
Distribution of the alignment scores between fly HEXO1 and 200
randomly shuffled human HEXA sequences
There are 88 random sequences with alignment score >= -74.
So: p-value = 88 / 200 = 0.44 => alignment is not significant
Mouse HEXA
Human HEXA
Score = 732
……………………………………………………
45
45
40
40
35
30
Number of Sequences
35
Number of Sequences
30
Distribution of the
alignment scores
between mouse HEXA
and 200 randomly
shuffled human HEXA
sequences
25
20
15
10
25
5
0
-230
20
-220
-210
-200
-190
-180
Alignment Score
-170
-160
-150
15
732
10
5
0
-200
-100
0
100
200
300
400
Alignment Score
500
600
700
800
• No random sequences with alignment score >= 732
– So: the P-value is less than 1 / 200 = 0.05
• To get smaller p-value, have to align more random sequences
– Very slow
• Unless we can fit a distribution (e.g. normal distribution)
– Such distribution may not be generalizable
– No theory exists for global alignment score distribution
Statistics for local alignment
• Elegant theory exists
• Score for ungapped local alignment follows extreme value
distribution (Gumbel distribution)
Normal
distribution
Extreme value
distribution
An example extreme value distribution:
• Randomly sample 100 numbers from a normal distribution, and compute max
• Repeat 100 times.
• The max values will follow extreme value distribution
Statistics for local alignment
• Given two unrelated sequences of lengths M, N
• Expected number of ungapped local alignments
with score at least S can be calculated by
–
–
–
–
E(S) = KMN exp[-S]
Known as E-value
: scaling factor as computed in last lecture
K: empirical parameter ~ 0.1
• Depend on sequence composition and substitution matrix
P-value for local alignment score
• P-value for a local alignment with score S

P  x  S   1  exp  E ( S )   1  exp  KMNeS
 E ( S ) when P is small.

Example
• You are aligning two sequences, each has
1000 bases
• m = 1, s = -1, d = -inf (ungapped alignment)
• You obtain a score 20
• Is this score significant?
 = ln3 = 1.1 (computed as discussed on slide #41)
E(S) = K MN exp{- S}
E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5
P-value = 3 x 10-5 << 0.05
The alignment is significant
400
350
300
Number of Sequences
•
•
•
•
•
250
Distribution of 1000
random sequence pairs
200
150
100
20
50
0
9
10
11
12
13
14
15
Alignment Score
16
17
18
Multiple-testing problem
• Searching a 1000-base sequence against a database of
106 sequences (each of length 1000)
• How significant is a score 20 now?
• You are essentially comparing 1000 bases with 1000x106
= 109 bases (ignore edge effect)
• E(20) = 0.1 * 1000 * 109 * 3-20 = 30
• By chance we would expect to see 30 matches
– The P-value (probability of seeing at least one match with score
>= 30) is 1 – e-30 = 0.9999999999
– The alignment is not significant
– Caution: it does NOT mean that the two sequences are unrelated.
Rather, it simply means that you have NO confidence to say
whether the two sequences are related.
Score threshold to determine
significance
• You want a p-value that is very small (even after
taking into consideration multiple-testing)
• What S will guarantee you a significant p-value?
E(S)  P(S) << 1
=> KMN exp[-S] << 1
=> log(KMN) -S < 0
=> S > T + log(MN) / 
(T = log(K) / , usually small)
Score threshold to determine
significance
• In the previous example
– m = 1, s = -1, d = -inf =>  = 1.1
• Aligning 1000bp vs 1000bp
S > log(106) / 1.1 = 13.
So 20 is significant.
• Searching 1000bp against 106 x 1000bp
S > log(1012) / 1.1 = 25.
so 20 is not significant.
Statistics for gapped local alignment
• Theory not well developed
• Extreme value distribution works well
empirically
• Need to estimate K and  empirically
– Given the database and substitution matrix,
generate some random sequence pairs
– Do local alignment
– Fit an extreme value distribution to obtain K
and 
In summary
• How to obtain a substitution matrix?
– Obtain qst and ps from established alignments (for DNA: from
your knowledge)
– Computing score:
• How to understand arbitrary substitution matrix?
– Solve function to obtain  and target qst
– Which tells you what percent identity you are expecting
• How to understand alignment score?
– probability that a score can be expected from chance.
– Global alignment: Monte-Carlo simulation
– Local alignment: Extreme Value Distribution
• Estimate p-value from a score
• Determine a score threshold without computing a p-value
Download