Maximal Segment Pair (MSP)

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar Local alignments • The preferred method to compute regions of local similarity for two sequences of amino acids is to consider the entire length of the sequence and optimize a similarity matrix. • PAM and BLOSUM both a number of different matrices constructed to model similarity between amino acid sequences at different evolutionary distances. • Here, we follow Altschul c99c to investigate PAM matrices from an information theoretic perspective. Caveats and assumptions • The following theory only applies to locally aligned segments that lack gaps. • Why is this assumption easier to tolerate in local alignment vs. global alignment? • Why is this assumption still restrictive for local alignments? Notation and definitions • Amino acids: ai • Substitution score of aligned amino acids ai and aj: sij • A Maximal Segment Pair (MSP) is a pair of equal length segments from two amino acid sequences that, when aligned, have maximum score. Random model • For any two amino acid sequences, there exists at least one MSP. • It is convenient to compute what MSP scores look like for random sequences to serve as a basis for comparison. • We will consider a very simple model • Each amino acid ai appears randomly with probability pi reflecting actual frequencies of amino acid sequences • What could be a more biologically accurate (yet mathematically less feasible) method for generating amino acid sequences? More assumptions... • There is at least one positive score in the substitution matrix • Why is this reasonable? (think about what the optimal alignment would be) • The expected score for the matrix is negative • 𝑖,𝑗 𝑝𝑖 𝑝𝑗 𝑠𝑖,𝑗 <0 𝜆 • Previous theory defines a key parameter 𝜆 • 𝜆𝑠𝑖,𝑗 𝑝 𝑝 𝑒 =1 𝑖,𝑗 𝑖 𝑗 • But what is 𝜆? • Consider the case of multiplying a similarity matrix by some constant c, what happens to an alignment? 𝜆 𝑐∗ A C G T - A 1 -1 -1 -1 -1 C -1 1 -1 -1 -1 G -1 -1 1 -1 -1 T -1 -1 -1 1 - -1 -1 -1 -1 MSP score = 8 ...AGCGCTAC... ...AGCGCTAC... A C G T - A c -c -c -c -c C -c c -c -c -c G -c -c c -c -c -1 T -c -c -c c -c -∞ - -c -c -c -c -∞ = MSP score = c*8 ...AGCGCTAC... ...AGCGCTAC... The MSP score changes but the MSP and alignment does not. How does this affect 𝜆? 𝜆 • To preserve 𝑖,𝑗 𝑝𝑖 𝑝𝑗 𝑒 𝜆𝑠𝑖,𝑗 = c after scalar multiplication of the similarity matrix, we can 𝜆 simply set 𝜆 = 𝑐 • So, one may view 𝜆 as a scaling parameter for a similarity matrix. Random Model • Given two random sequences how many MSPs with score at least S can we expect by chance? • 𝐾𝑁𝑒 −𝜆𝑆 where N is the product of the sequences’ lengths and K is a calculable parameter. • This equation is related to the limiting probability distribution for 𝑀 𝑁 = 𝑀 𝑁 − ln 𝑛 ∗ where M(N) is the MSP score. 𝜆 • Theorem 1 (single sequence, Karlin and Altshul −𝐾𝑒 −𝜆𝑥 1990): Prob 𝑀 𝑁 > 𝑥 ≈ 1 − 𝑒 Random Model • Which leads to a Poisson approximation for the ln 𝑛 number of MSP with scores exceeding ∗ + 𝑥 𝜆 ∗ with parameter K ∗ e−𝜆 x • So, the probability of finding m or more distinct segments with score greater than or equal to S is −𝐾𝑛𝑒 −𝜆𝑆 𝑖 −𝜆𝑆 𝐾𝑛𝑒 𝑚−1 approximated by 1 − 𝑒 𝑖=0 𝑖! which we can take m=1 for finding a single segment (and yield the Theorem of the previous slide) Random Model • An extension of Theorem 1 to two sequences of length m and n yields an important result: ln 𝑛𝑚 𝜆∗ ∗𝑥 ∗ −𝜆 𝐾 𝑒 • 𝑃𝑟𝑜𝑏 𝑀 > +𝑥 ≤ where * denotes an estimated parameter and M is the MSP score • Which leads to the result in Altshul 1991 • The number of MSPs with score at least S is well approximated by 𝑲𝑵𝒆−𝝀𝑺 where N is the product of the sequences’ lengths and K and 𝜆 are calculable parameters. Substitution matrices • Substitution matrices store scores which encode the target frequencies of amino acids in a true alignment and the background amino acid probabilities ln 𝑞𝑖𝑗 𝑝𝑖 𝑝 𝑗 • The scores, 𝑠𝑖,𝑗 = where 𝑞𝑖𝑗 are the target 𝜆 frequencies, 𝑝𝑖 and 𝑝𝑗 are the background amino acid probabilities, and 𝜆 is a scaling factor. • This ratio compares the probability of an alternate hypothesis (target frequencies) to the probability of the null (product of frequencies) Local alignment and information theory • Because scaling the substitution matrix changes 𝜆 but not the target frequencies, we have the freedom of adjusting 𝜆 • Let’s set 𝜆 = ln 2 • Then, we can set the number of MSPs with score at least S to p and solve for S • 𝑲𝑵𝒆−𝝀𝑺 = 𝒑 ⇒ 𝑺 = 𝒍𝒐𝒈𝟐 𝑲 𝒑 + 𝒍𝒐𝒈𝟐 𝑵 • An alignment is significant when 𝒑 < 𝟎. 𝟎𝟓 and K is typically near 0.1 thus the score needed to distinguish an MSP from chance is approximated by the number of bits needed to represent the MSP (𝒍𝒐𝒈𝟐 𝑵) Relative entropy and substitution matrices • So, what substitution matrices are the most appropriate for the comparison of two particular sequences? • To answer this question, consider the average score per residue pair in an alignment • 𝑯= 𝒊,𝒋 𝒒𝒊𝒋 𝒔𝒊𝒋 = 𝒒𝒊𝒋 𝒊,𝒋 𝒒𝒊,𝒋 𝒍𝒐𝒈𝟐 𝒑 𝒑 𝒊 𝒋 • H is exactly the notion of relative entropy of the target and background probability distributions Relative entropy • Relative entropy (KL divergence) is a measure of how closely related two probability distributions are • Given two probability distributions Q and P, relative entropy can be informally stated in several different manners • The amount of additional bits required to code samples from P when using Q • The amount of information lost when Q is used and P is the true distribution of the data Relative entropy and substitution matrices • But how does this relate to substitution matrices? • Well, if the target and background frequency distributions are closely related, then the relative entropy is low and it is very difficult to distinguish between the target and background frequencies. We would therefore require a much longer alignment. • On the other hand, if the target and background frequency distributions are very different, the relative entropy is high and we’re able to compute much shorter alignments. Example 1 – cystic fibrosis • Variants in a transport protein have been associated with cystic fibrosis • A search of this gene in the PIR protein sequence database yields the table on the following slide Example 1 – cystic fibrosis Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic Perspective”, Journal of Molecular Biology, 2c9:555-565 Example 1 – cystic fibrosis • Of note, the best PAM-250 score is not higher than the highest score of a random alignment given the background frequencies. • On the other hand, PAM-120 gives alignments in the same region with scores higher than the highest chance alignment • Why do you think PAM-120 a better fit here? References • Explains the connection between information theory and substitution matrices • Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic Perspective”, Journal of Molecular Biology, 2c9:555-565 • Provides much of the theory for the above article • Karlin, S. Dembo, A. Kawabata, T. “Statistical Composition of High-Scoring Segments from Molecular Sequences.” The Annals of Statistics 18 (1990), (2), 571--581. • Karlin, S. and Altschul SF. “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes” PNAS 1990 87 (6) 22642268

Maximal Segment Pair (MSP)

Related documents

Products

Support

Maximal Segment Pair (MSP)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib