Evolutionary Models for Multiple Sequence Alignment CBB/CS 261 B. Majoros Part I Evolutionary Sequence Models The Utility of Evolutionary Models Evolutionary sequence models make use of the assumption that natural selection operates more strongly on some genomic features than others (i.e., functional versus non-functional DNA elements), resulting in a detectable bias in sequence conservation for the features of interest. (Siepel et al., 2005) More generally, conservation patterns may differ between levels of DNA organization (i.e., amino acids in coding segments, versus individual nucleotides in conserved noncoding elements). Recall: Multivariate HMMs CCATATAATCCCAGGCTCCGCTTCA AGATATCATCAGATGCTACGTATCT ... push N tracks ACATAATATCCGAGGCTCCGCTTCG single emission from a single state Each state emits N residues, one per track—i.e., a column of a multiple alignment. Thus, each state must have a model for the joint distribution of the tracks (to represent emission probabilities). Q: how might we model dependencies between tracks? Non-independence of Sequences Due to their common ancestry, genomic sequences for related taxa are not independent. We can control for that non-independence by explicitly modeling their dependence structure using a phylogenetic tree: Branch lengths represent evolutionary distance, which conflates the distinct phenomena of elapsed time and substitution rate (as well as selection and drift). We will see later that a phylogenetic tree (or “phylogeny”) can be interpreted as a special type of Bayesian network, in which sequence conservation probabilities are expressed as a function of the branch lengths. PhyloHMMs A PhyloHMM is a discrete multivariate HMM in which each state qi has an associated evolution model i describing the expected rates and patterns of evolution in the class of features represented by that state. (Siepel et al., 2005) Evaluating the Emission Probability Bayesian network alignment P(G) ...A... ...B... ...C... ...D... G human unobservables chimp mouse E rat A P(A,B,C,D) F B C D observables P(G)P(E G)P(A E)P(B E)P(F G)P(C F)P(D F) G,E,F P(observables) P(root) P(v parent(v)) unobservables nonroot v A Recursion for the Emission Likelihood The likelihood can be computed using a recursion known as Felsenstein’s pruning algorithm: (u,a) if u is a leaf Lu (a) Lc (b)P(c b | u a) otherwise c b children(u) =P(descendents of u|u=a). P(c=b|u=a) = probability of observing b in the child, given that we observe a in the parent. We can model this using a matrix of substitution probabilities, parameterized by the evolutionary time t that has passed between the ancestor and descendant taxa: pC G pGC pGG pT C pT G pC T pGT pT T ancestor pC C A C G T A pAA pCA P(t) pGA pT A descendant C p G pT pAC AG AT Substitution Matrix vs. Rate Matrix There is an important distinction between a substitution matrix and an instantaneous rate matrix. Q = Instantaneous rate matrix. Gives instantaneous rates of substitutions (not parameterized by time). P(t) = Substitution matrix. Gives the probabilities of substitutions for a specific branch length, t (“time”). Given Q and a set of phylogeny branch lengths {ti}, we can compute a substitution matrix P(ti) for each branch... One Q, Many P(t)’s P(t4 ) P(t1 ) P(t5 ) P(t2 ) human dolphin P(t3 ) whale kangaroo P(t) e P(t6 ) Qt Continuous-time Markov Chains (CTMCs) Substitution models are typically based on continuous-time Markov chains. The Markov property for CTMCs states that: P(t s ) P(t )P( s ) We can derive an instantaneous rate matrix Q from P(t), where we make use of the fact that P(0)=I. dP(t ) P (t t ) P(t ) P(t )P(t ) P(t )I lim lim t 0 t 0 dt t t P(t ) P(0) P(t ) lim P(t )Q t 0 t does not depend on t The solution to this n n 2 2 Q t Q t I tQ ... differential equation is: P(t ) eQt n 0 n! 2! eQt (the “matrix exponential”) denotes a Taylor expansion, which we can solve via spectral (eigenvector) decomposition: 1 2 Q G 1 G n e t 1 t 2 e P(t ) G 1 G t n e Desirable Properties of Substitution Matrices Reversibility: (“detailed balance”) ij(πiPij(t)=πjPji(t)) where πi is the equilibrium frequency of base i. Transition/Transversion Discrimination: Using different parameters for transition and transversion rates. Transitions: purinepurine, pyrimidinepyrimidine Transversions: purinepyrimidine Some Common Forms for Q Jukes-Cantor: Q JK Kimura: Q K 2 P Felsenstein: C G T A G T Q FEL A C T A C G these assume uniform equilibrium frequencies! Hasegawa, Kishino, Yano: Q HKY A A A C G G C C G T T T General reversible model: C G T A G T Q REV A C T A C G Part II Multiple Alignment Recall: Pairwise alignment with PHMMs • Emission probabilities assess similarity between aligned residues BLOSUM62 • Transition probabilities can be used to penalize gaps δ = gap open ε = gap extend • Viterbi decoding finds the optimal alignment ε ε I δ 1-ε M 1-2δ D δ Pair HMM Triple HMM Quadruple HMM ? for 100bp sequences... pair HMM aligns 2 species triple HMM aligns 3 species #gallons of water in the Pacific ocean quadruple HMM aligns 4 species (Holmes & Bruno, 2001) Progressive Alignment: One Pair at a Time -AAGTAGCATACG--GGCA-ATT -AACT--CATACG--CCCAGATT T-AGT--CATACGGG--CA-ATC T-AGT--CGTATGGGGGCA-GTT profile-to-profile alignment TAGTCATACGGG--CAATC TAGTCGTATGGGGGCAGTT AAGTAGCATACGGGCA-ATT AACT--CATACGCCCAGATT AACTCATACGCCCAGATT AAGTAGCATACGGGCAATT TAGTCATACGGGCAATC sequence-to-sequence alignment TAGTCGTATGGGGGCAGTT • First, align the leaves of the tree (using a Pair HMM) • Then align ancestral taxa, using either a “consensus” sequence for ancestors, or averaging over all pairs of leaf residues A more principled approach: model the ancestral sequences explicitly, using a probabilistic evolutionary model... Networks of Residues Problem: Align the sequences ABC, DE, FG, and HI. Solution: AB-C-D-E--FG---HI PQRS T J KL ABC DE MNO FG H I The multiple alignment problem is precisely the problem of inferring the network of residue homologies—i.e., the evolutionary history of each base. ABC Building the Network FG DE H I ABC DE FG HI ABC Building the Network DE FG FG-HI H I ABC DE FG HI ABC Building the Network DE unobservables FG FG-HI MNO FG H I H I observables ABC DE FG HI ABC Building the Network ABC -DE DE FG FG-HI FG H I ABC DE MNO FG HI H I ABC Building the Network ABC -DE ABC DE FG FG-HI DE FG HI DE MNO FG H I ABC J KL H I ABC Building the Network ABC DE FG MNO FG H I ABC DE DE FG HI H I J KL J KL MNO ABC Building the Network ABC DE FG MNO FG H I ABC DE DE FG HI H I J KL J KL MNO JK-L--MNO ABC Building the Network J KL DE J KL ABC DE FG MNO FG H I MNO H I JK-L--MNO PQRS T J KL ABC DE FG HI ABC DE MNO FG H I ABC Building the Network J KL FG MNO FG H I AB-C-D-E--FG---HI ABC DE J KL ABC DE DE FG HI MNO H I PQRS T J KL ABC DE MNO FG H I Evaluating Emission Probabilities B K X MNO X D DE J KL ABC P(X) N FG H I G P(B,D,G,H) H P(X)P(K X)P(B K)P(DK)P(N X)P(G N)P(H N) X ,K ,N P(observables) P(root) P(v parent(v)) unobservables nonroot v Sampling Alignments = a sequence S = (B,H), a “Branch-HMM” (transducer) B describing the evolutionary process whereby the child evolves from the parent, and the actual indel history H which is a specific realization of this process (a “draw”) Sampling of alignments proceeds by sampling pairwise “branch alignments” (or “indel histories”) H that live within the yellow squares. An indel history is simply a path through a Pair HMM. Sampling branch alignments is simple: just sample from a PHMM via Forward or Backward: B P (y y )P (s ,s y ) i1, j 1,y k 1 k Bi, j,y k1 Bi, j 1,y Pt (y k y k 1)Pe (,s j,2 y k ) k P(y k y k 1,i, j,S1,S2 ) Bi, j,y k1 Bi1, j,y Pt (y k y k 1)Pe (si,1, y k ) k Bi, j,y k1 t k e i,1 j,2 k if y k QM if y k QI if y k QD Posterior Alignment Matrix sequence 2... pixel intensity = posterior probability of a match in that cell (posterior probability: conditional on the full input sequences) sequence 1... Banding: Reduce the Search Space Block Rearrangements are a Problem! The simple case: Translocation Duplication Inversion Summary • Optimal MSA computation is intractible in the general case • Progressive alignment is more tractible, but is greedy • Iterative refinement attempts to undo greedy decisions • PairHMM’s provide a principled way to perform pairwise steps • Felsenstein’s algorithm computes likelihoods on phylogenies • Substitution models can use continuous-time Markov chains • Large-scale rearrangements are a problem • Banding can improve alignment speed