Alignment and Algorithms •Scoring Matrices: models of evolution •Dynamic Program: speed •Pairwise Sequence Alignment, variants •Significance •Pattern recognition, Hidden Markov Models (HMM) •Multiple Sequence Alignment (MSA) •Protein families •Structure features Scoring Models and Evolution • Several systems of scoring matrices. • Statistically based, curated data: I.e., based on confident alignments of multiple sequences. Two best known: PAM matrices: Dayhoff et al., 1967 BLOSUM matrices: Henikoff & Henikoff, 1992 Scoring matrices, cont. Recall: using p(a,b) = prob. that a (amino acid residue) mutated into b. PAM (Point Accepted Mutation) uses sequence data for proteins which can be aligned with 99% identical sequence. In such aligned sequences, one counts the frequency of each residue, and of each pair of residues in a “column”. Scoring matrices, cont. Basically, the score s(a,b) from this data is a “LOD score”: s(a,b) = log(p(a,b)/p(a)p(b)). Comparison of two “models”: significant association vs. independent randomness. Pruned: scaled to make matrices integervalued. Scoring matrices, cont. Interpretation: Dayhoff, et al. define a short unit of evolutionary time, a PAM unit. Only very conservative changes will be seen in the composition of evolving proteins in this scale. To extend to more distant similarities, used time independent Markov assumption. Scoring matrices, cont. p(a,b) = p(a-->b|t=1) gives the transition probability from a --> b (or b --> a: we cannot keep track of directionality) in one unit of PAM time. The time independent Markov model assumes p(a-->b|t=2) in two units is simply the sum over all pairs of intermediate one time unit transitions: a --> c followed by c --> b Scoring matrices, cont. If P(1) is the original 20x20 matrix of amino acid transition probabilities, then the above says simply P(n) = P(1) x P(1)x…x P(1) (n times) (matrix multiplication) Scoring matrices, cont. So, PAM(n) matrix of LOD scores from transitions in PAM time n. Available: n=1 to n=500: smaller n for more closely related sequences. Most commonly used: 50 to 250. Critique: accurate for small time, goes off at larger times. Compare nucleic acid changes to amino acid changes. Scoring matrices, cont. BLOSUM Matrices From the BLOCKS database. BLOSUM = BLOCKS Substitution Matrix DB of aligned sequence. Conserved regions are marked. The BLOSUM(L) transition probabilities are figured from sequences with L% identity of sequence. Score is a LOD score again. So, large BLOSUM L <--> small PAM n. Warning: corrections for oversampling of closely related sequence! Scoring matrices, cont. If the BLOSUM62 matrix is compared to PAM160 (it's closest equivalent) then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches. PAM is less sensitive to the effect of amino acid “return” substitutions, over long time periods. Dynamic Programming Problem: find (1) best score and (2) best alignment Naïve: evaluate each alignment and compare Too many alignments! Need more speed: DP Dynamic Programming, cont. No. of alignments is exponentially large. Use: problem has a linear (left to right) structure to it, from the linearity of proteins, nucleic acids. Analogous to the problem of distance; originally arose in CS, “edit distance”. Dynamic Programming, cont. The basic algorithms here are: (1) Needleman-Wunsch (global alignment) (2) Smith-Waterman (local alignment) [BLAST is a heuristic version of SW.] (3) Many variants. Dynamic Programming, cont. NW algorithm can be understood pictorially. It has three basic steps: • Initialization • Iteration • Traceback Basic insight: you only have to optimize the score one step at a time, and can discard looking at most alignments. Dynamic Programming, cont. Missing scoring feature: gap penalties. Not as well understood as substitutions. Use “geometric random variable models”. s(a,-) is independent of a, say s(a,-) = -d = “gap penalty” For longer gaps, two methods: (a) linear: k gaps in a row gives penalty k x (-d). (b) affine: k gaps in a row gives penalty e + (k-1) x (-d), Where e is the “gap initiation penalty”. Dynamic Programming, cont. Notice: affine gap does not just depend on the position, since you have to know whether you are continuing a gap from the left. First example of dependency in scoring. Key to the iterative step is to realize there are only three ways to extend an alignment: R1 R1 r1 r1 We have to update the score by taking the max of three possibilities: max(s0 + s(R1,r1), s0 + s(R1,-), s0+ s(-,r1)). Then to get the scores of all possible subalignments takes just about 5 MN computations, instead of an exponential number. Dynamic Programming, cont. To derive the actual alignment requires storage: one stores a traceback arrow from each position in the comparison array. This information can be translated into the correct alignment. Variants: Smith-Waterman: local alignment. Adjustment is to 0-out and start over again any time a NW score goes below zero, and procede. One chooses the largest scoreof a segment, and traces back as earlier. Linear storage: saves storage. Repeats: allows for duplication of segments. Dynamic Programming, cont. Variants (cont.): Non-overlapping: Repeats SW, avoiding previous best alignments. Sampling of highest hits: the best alignment score maybe an artifact of our scoring system, so many alignments may be high scoring. Affine gap penalties: actually have to have more states carried along, a precursor to Hidden Markov Models. Significance E-value in BLAST discussed last time. Equivalent to statistical concept of a pvalue. S =score for best alignment is a random variable. The p-value for an alignment is the probability that such a high score could have been found randomly. Significance, cont. This can be estimated two ways: Ungapped local alignment: Karlin-Altschul statistics. Generally, KA not available, rely instead on simulations: take two sequences A and B, and randomly “shuffle” one and recompare. This eliminates composition bias in the teest samples, but randomizes the rest of the biological information in the sequence, namely the order of the residues (or nt’s). SW computes this for 100’s of shuffles. HMM’s These are a special case of a general CS problem: pattern recognition, or machine learning. Can one make the machine a master at recognizing or classifying certain things. There are a large family of such graphical models, this is one of the simplest. Basic problem in MB: can one recognize different functional or structural regions in a protein from it’s a.a. sequence? It’s fold? Or less specifically, what are the common features of a group of proteins. HMM’s, cont. For example, there exist very many G-protein coupled receptors, which fall into seven classes. Can one classify a new protein into (a) being a GPCR, and (b) assigning its class to it from sequence data alone? Simplest example problem: dishonest casino. You are given a sequence of observed random variables or states of the system (faces of a die), but you don’t see hidden states (e.g., whether a casino is using a fair or biased die). Notice that this kind of example has a linear structure to it, like our a.a. sequences. HMM’s, cont. The problem is: given an observed path of (observable states) determine the underlying path through the hidden states. I.e., given the record of the coin flips, say when the biased coin was used and when the fair. Notice that in this model we know how many visible states there are and we are assuming we know how many hidden states there are. Of course, we cannot know with cerrtainty when the fair/biased coin is used, we must seek a probable answer, and the algorithms show how to choose a most probable path through the model. The harder problem (biological problem) is when we do not know what the hidden states are, except where we can define them, e.g., structurally. For example, you could run through protein a.a. sequence and ask a HMM to decide which residues are situated in alpha helices of the protein, or when they llie in the cell membrane. HMM’s, cont. In the first problem, if we know all the parameters, meaning mainly the transition problems for the Hidden Markov states, then it is easy to find a dynamic programming algorithm for finding the best path. This is the Viterbi algorithm. [Sometimes called parsing the hidden markov model, because the method originated in speech recognition work.] The harder problem is to figure out what the parameters are for the hidden states. This is called training the model, and is done by the Baum-Welch algorithm. This is a version of the so-called EM method, or algorithm (Expectation Maximization algorithm). BW, or EM, is a heuristic optimization algorithm. That is, it starts from a given initial guess for the parameters involved and performs a calculation which is guaranteed to improve (make no worse, actually) the previous guess. One runs for a certain time and turns things off by a suitable threshold. HMM’s, cont. Finally, a harder problem still is to “learn” the correct architecture of the model. There are packages for this in the case of protein families/ multiple sequence alignments. In the case of structure, for example, one sees a classical problem in statistics: one can let the data tell directly what the patterns should be (“unsupervised learning”) or you could use expertise to guide the data. This allows fainter structural signals to be “heard”. Problem: is this fittingthe data too tightly, making the model generated harder to be generally applicable? Example of TMHMM and unusual, but necessary faint signals.