Maximal Segment Pair (MSP)

advertisement
Information theoretic interpretation of
PAM matrices
Sorin Istrail and Derek Aguiar
Local alignments
• The preferred method to compute regions of
local similarity for two sequences of amino acids
is to consider the entire length of the sequence
and optimize a similarity matrix.
• PAM and BLOSUM both a number of different
matrices constructed to model similarity
between amino acid sequences at different
evolutionary distances.
• Here, we follow Altschul c99c to investigate PAM
matrices from an information theoretic
perspective.
Caveats and assumptions
• The following theory only applies to locally
aligned segments that lack gaps.
• Why is this assumption easier to tolerate in local
alignment vs. global alignment?
• Why is this assumption still restrictive for local
alignments?
Notation and definitions
• Amino acids: ai
• Substitution score of aligned amino acids ai and
aj: sij
• A Maximal Segment Pair (MSP) is a pair of equal
length segments from two amino acid sequences
that, when aligned, have maximum score.
Random model
• For any two amino acid sequences, there exists at
least one MSP.
• It is convenient to compute what MSP scores look
like for random sequences to serve as a basis for
comparison.
• We will consider a very simple model
• Each amino acid ai appears randomly with probability
pi reflecting actual frequencies of amino acid
sequences
• What could be a more biologically accurate (yet
mathematically less feasible) method for generating
amino acid sequences?
More assumptions...
• There is at least one positive score in the
substitution matrix
• Why is this reasonable? (think about what the
optimal alignment would be)
• The expected score for the matrix is negative
•
𝑖,𝑗 𝑝𝑖 𝑝𝑗 𝑠𝑖,𝑗
<0
πœ†
• Previous theory defines a key parameter πœ†
•
πœ†π‘ π‘–,𝑗
𝑝
𝑝
𝑒
=1
𝑖,𝑗 𝑖 𝑗
• But what is πœ†?
• Consider the case of multiplying a similarity
matrix by some constant c, what happens to an
alignment?
πœ†
𝑐∗
A
C
G
T
-
A
1
-1
-1
-1
-1
C
-1
1
-1
-1
-1
G
-1
-1
1
-1
-1
T
-1
-1
-1
1
-
-1
-1
-1
-1
MSP score = 8
...AGCGCTAC...
...AGCGCTAC...
A
C
G
T
-
A
c
-c
-c
-c
-c
C
-c
c
-c
-c
-c
G
-c
-c
c
-c
-c
-1
T
-c
-c
-c
c
-c
-∞
-
-c
-c
-c
-c
-∞
=
MSP score = c*8
...AGCGCTAC...
...AGCGCTAC...
The MSP score changes but the MSP and alignment does not.
How does this affect πœ†?
πœ†
• To preserve 𝑖,𝑗 𝑝𝑖 𝑝𝑗 𝑒 πœ†π‘ π‘–,𝑗 = c after scalar
multiplication of the similarity matrix, we can
πœ†
simply set πœ† =
𝑐
• So, one may view πœ† as a scaling parameter for a
similarity matrix.
Random Model
• Given two random sequences how many MSPs
with score at least S can we expect by chance?
• 𝐾𝑁𝑒 −πœ†π‘† where N is the product of the sequences’
lengths and K is a calculable parameter.
• This equation is related to the limiting
probability distribution for 𝑀 𝑁 = 𝑀 𝑁 −
ln 𝑛
∗ where M(N) is the MSP score.
πœ†
• Theorem 1 (single sequence, Karlin and Altshul
−𝐾𝑒 −πœ†π‘₯
1990): Prob 𝑀 𝑁 > π‘₯ ≈ 1 − 𝑒
Random Model
• Which leads to a Poisson approximation for the
ln 𝑛
number of MSP with scores exceeding ∗ + π‘₯
πœ†
∗
with parameter K ∗ e−πœ† x
• So, the probability of finding m or more distinct
segments with score greater than or equal to S is
−𝐾𝑛𝑒 −πœ†π‘†
𝑖
−πœ†π‘†
𝐾𝑛𝑒
π‘š−1
approximated by 1 − 𝑒
𝑖=0
𝑖!
which we can take m=1 for finding a single
segment (and yield the Theorem of the previous
slide)
Random Model
• An extension of Theorem 1 to two sequences of
length m and n yields an important result:
ln π‘›π‘š
πœ†∗
∗π‘₯
∗
−πœ†
𝐾 𝑒
• π‘ƒπ‘Ÿπ‘œπ‘ 𝑀 >
+π‘₯ ≤
where * denotes an
estimated parameter and M is the MSP score
• Which leads to the result in Altshul 1991
• The number of MSPs with score at least S is well
approximated by 𝑲𝑡𝒆−𝝀𝑺 where N is the product of
the sequences’ lengths and K and πœ† are calculable
parameters.
Substitution matrices
• Substitution matrices store scores which encode the
target frequencies of amino acids in a true alignment
and the background amino acid probabilities
ln π‘žπ‘–π‘—
𝑝𝑖 𝑝 𝑗
• The scores, 𝑠𝑖,𝑗 =
where π‘žπ‘–π‘— are the target
πœ†
frequencies, 𝑝𝑖 and 𝑝𝑗 are the background amino acid
probabilities, and πœ† is a scaling factor.
• This ratio compares the probability of an alternate
hypothesis (target frequencies) to the probability of
the null (product of frequencies)
Local alignment and information theory
• Because scaling the substitution matrix changes πœ†
but not the target frequencies, we have the freedom
of adjusting πœ†
• Let’s set πœ† = ln 2
• Then, we can set the number of MSPs with score at
least S to p and solve for S
• 𝑲𝑡𝒆−𝝀𝑺 = 𝒑 ⇒ 𝑺 = π’π’π’ˆπŸ
𝑲
𝒑
+ π’π’π’ˆπŸ 𝑡
• An alignment is significant when 𝒑 < 𝟎. πŸŽπŸ“ and K is
typically near 0.1 thus the score needed to
distinguish an MSP from chance is approximated by
the number of bits needed to represent the MSP
(π’π’π’ˆπŸ 𝑡)
Relative entropy and substitution matrices
• So, what substitution matrices are the most
appropriate for the comparison of two particular
sequences?
• To answer this question, consider the average
score per residue pair in an alignment
• 𝑯=
π’Š,𝒋 π’’π’Šπ’‹ π’”π’Šπ’‹
=
π’’π’Šπ’‹
π’Š,𝒋 π’’π’Š,𝒋 π’π’π’ˆπŸ 𝒑 𝒑
π’Š 𝒋
• H is exactly the notion of relative entropy of the
target and background probability distributions
Relative entropy
• Relative entropy (KL divergence) is a measure of
how closely related two probability distributions
are
• Given two probability distributions Q and P,
relative entropy can be informally stated in
several different manners
• The amount of additional bits required to code
samples from P when using Q
• The amount of information lost when Q is used and P
is the true distribution of the data
Relative entropy and substitution matrices
• But how does this relate to substitution
matrices?
• Well, if the target and background frequency
distributions are closely related, then the relative
entropy is low and it is very difficult to
distinguish between the target and background
frequencies. We would therefore require a much
longer alignment.
• On the other hand, if the target and background
frequency distributions are very different, the
relative entropy is high and we’re able to
compute much shorter alignments.
Example 1 – cystic fibrosis
• Variants in a transport protein have been
associated with cystic fibrosis
• A search of this gene in the PIR protein sequence
database yields the table on the following slide
Example 1 – cystic fibrosis
Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic
Perspective”, Journal of Molecular Biology, 2c9:555-565
Example 1 – cystic fibrosis
• Of note, the best PAM-250 score is not higher
than the highest score of a random alignment
given the background frequencies.
• On the other hand, PAM-120 gives alignments in
the same region with scores higher than the
highest chance alignment
• Why do you think PAM-120 a better fit here?
References
• Explains the connection between information theory
and substitution matrices
• Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from
an Information Theoretic Perspective”, Journal of
Molecular Biology, 2c9:555-565
• Provides much of the theory for the above article
• Karlin, S. Dembo, A. Kawabata, T. “Statistical Composition
of High-Scoring Segments from Molecular Sequences.” The
Annals of Statistics 18 (1990), (2), 571--581.
• Karlin, S. and Altschul SF. “Methods for assessing the
statistical significance of molecular sequence features by
using general scoring schemes” PNAS 1990 87 (6) 22642268
Download