The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension A simple start Suppose we have two sequences, A and B A = {a1, a2, …, am} and B = {b1, b2, …, bn} and we want to know how similar they are. What is the basis for their similarity? The practical measure What we usually do is obtain an alignment and then score using the sum of the pairwise scores: n S AB s (ai , bi ) i 1 The nice metric Wouldn’t it be nice if we could simply say that the distance between sequences was the geometric sum of the distances between loci in the sequences? DAB n 2 ( a b ) i i i 1 Dynamic programming methods • Compuational method dating from the 40’s, introduced to biology as “Needleman-Wunsch” in 1969. • A numerical value is assigned to every cell in the array giving the similarity/dissimilarity of residues • The example shown – match = +1 – mismatch = null (value 0) 1 a 1 a 2 b 3 c 4 n 5 j 6 r 7 q 8 c 9 l 10 c 11 r 12 p 1 2 j 1 3 c 1 1 4 j 1 1 5 n 1 6 r 1 7 c 1 1 8 k 9 c 1 10 r 11 b 12 p 1 1 1 1 1 13 m Dynamic programming methods a b • GOAL: For each cell find the maximum possible score for an alignment ending at that point • Searchs subrow and subcolumn, as shown, for highest score • Adds this to the score for the current row • Proceeds row by row through the array a c n j r q c l c r p m 1 j 1 c 1 1 j 1 1 n 1 r 1 4 3 3 2 2 0 0 c 3 3 4 3 3 3 3 4 3 3 1 0 0 k 3 3 3 3 3 3 3 3 3 2 1 0 0 c 2 2 3 2 2 2 2 3 2 3 1 0 0 r 2 1 1 1 1 2 1 1 1 1 2 0 0 b 1 2 1 1 1 1 1 1 1 1 1 0 0 p 0 0 0 0 0 0 0 0 0 0 0 1 0 Maximum bipartite matching Series of solutions, starting with Dijskta, 1950’s A s(a1, b1) B O VE Find the set of matches that provide maximum flow. Each match, ai to bj, has a capacity equal to its pairwise score. Alignment’s not really the problem • Optimal alignment falls into a set of problems with a long history in computer science. • The underlying metric for distances between sequences falls in the province of biology. Beguiled by a matrix (PAM) PAM • PAM starts with closely related sequences from 34 superfamilies, grouped into 71 evolutionary trees. • PAM rests on a measure of amino acid “mutability”. • PAM attempts to capture a representative slice of evolutionary behavior. PAM (From Dayhoff, Schwartz and Orcutt) • Obtain alignments for homologous proteins • Compute scoring matrix elements using: sij (t 0 ) Gmi aij a ,i j ij i sii (t 0 ) 1 – G m i where aij is substitution frequency, mi is the mutability of i and G is a proportionality constant. • Extrapolate to longer evolutionary distances by using {S(t0)}n Limitations of PAM matrices PAM matrices are built from alignments with > 85% identity. The entries in the initial scoring matrix, S(t=1) arise from short time interval substitutions; raising S(1) to a higher power may not capture some interesting substitutions with longer rate constants. The Gutzwiller temptation • An abstract dynamic system (M, m, ft) – a measurable space, M, composed of the set of all sequences. – a measure m based on transition probabilities. – a group of automorphisms, ft, that map M onto itself, that preserves m and where the variable t runs through the integers. What’s Bernoulli got to do with it? • A scheme with subshift – The measure on M is generated by the sets Ai,j,k = {a |ai = j, ai+1 = k} whose measure is given by a matrix of transition probabilities pjk >= 0. – A future event a1 depends on a0; hence, memory. – Realized in the geodesic flow on a compact closed surface of constant negative curvature. System behaviors • Ergodicity: Transition probabilities are positive recurrent and aperiodic. • Mixing: Inheritance and Mendelian exceptions lead to mixing. • K-systems: Speciation events rigidly segregate M; other segregations exist. Our salad days • • • • • Jukes-Cantor HGY Kimura 2-Parameter PAM BLOSUM General Stationary Timereversible Model R= . pCrCA pGrGA pTrTA pArAC . pGrGC pTrTC pArAG pCrCG . pTrTG pArAT pCrCT pGrGT . (Diagonal elements such that rows sum to zero) Time reversibility: pirij = pjrji General Stationary Timereversible Model P(t) = eRt Given rates, one can find transition probabilities, and vice-versa. Jukes-Cantor R= -3a a a a a -3a a a a a -3a a a a a -3a Kimura 2-Parameter R= A C G T . b a b b . b a a b . b b a b . a/b = transition/transversion bias HKY (Hasegawa, Kishino, Yano) R= . mpC mkpG mpT mpA . mpG mkpT mkpA mpC . mpT mpA mkpC mpG . k = transversion / transition The BLOSUMn matrices • Start with multiple, ungapped alignments of proteins found using PROTOMAT. • Build clusters by placing together sequences with N% identity. • Measure the score for each pair defined as: sij = 2*log2(pij/eij) eij is expected probability of occurrence of the i,j pair pij is observed probability of the i,j pair. Limitations Naive approach: measure frequencies of aligned pairs and gaps in randomly selected confirmed alignments to get pij, use a “random” set of sequences to obtain eij. • Difficulty 1: it is difficult to get a good random sample of sequences or alignments – databases are biased. • Difficulty 2: When sequences diverge from a common ancestor recently, pij is small and s is strongly negative. When sequences diverged long ago, pij tends to eij and s approaches zero. A short compendium of distances and scores • • • • • • Jukes-Cantor distance Kimura distance Dayhoff evolutionary distance BLOSUM scores Profile scores Average scores References • Gu, X. & Li, W, 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93: 4671-4676. • Hasegawa, M., Kishino, H., & Yano, T., 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174. • Sanderson, M. J. & Shaffer, H. B., 2002. Troubleshooting molecular phylogenetic analyses. Annu. Rev. Ecol. Syst. 33: 49-72. The distance between sequences, Part II. Careful Measures M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension Exceptions to Mendel’s Laws The theory: a chromosomal basis of inheritance Some so-called exceptions: • linkage and recombination • gene conversion • transposition and mobile genetic elements • A plethora of other mutations: point mutations, reversions, deletions, frameshifts, duplications, inversions “Exceptions” do not result in rejection of Mendelian genetics but a better understanding of the mechanisms underlying Mendelian inheritance. Mutation frequencies (#mutations/generation) • • • Frequency of point mutation: 10-7 to 10-8 Reversion of point mutations: ~10-8. Sometimes called back mutation, sometimes called convergence. Reversion of deletion mutations: undetectably small. “Loss of function” mutations result in grossly lower biological fitness. The rate of extinction due to gross “loss of function” is much great than the rate of reversion, so the line will die long before reversion can occur. In the aggregate, the record will show a pseudoreversion. Mutation frequencies • • • • • Deletions: 10-6 – dependent on chromosomal region. Caveat: May be underestimated; less detectable because they are often lethal . Frameshifts: 10-6 – often repaired. Duplications: 10-3 - E. coli: approximately 0.l% of a culture for a given region of the chromosome. Inversions: hard to detect, not always mutations Gene Conversions: still unknown. Reparative. mutators increase mutation frequencies by ~100, they work on “hot spots” Protein-based inheritance – Prions • Proteins that change their shape in response to fluctuating environmental pressures, and then maintain that shape during mitosis and meiosis, constitute a form of cellular memory. • Various structural conformations are propagated outside of the traditional genetic framework. Hsp90 and Sup35 • A buffer for silent polymorphisms: Hsp90 – promotes the folding of signal tranducers – buffers the effects of many silent polymorphisms – may serve as a capacitor of evolutionary change – storing and releasing genetic variation • “Epigenetic inheritance”: The Sup35 prion James Joyce’s List Milk Call mom! Lettuce Plumb the smithy of my soul for the unborn raceconsciousness… Rent -------------------------------------Thriving in fluctuating environments by exploiting pre-existing genetic variations. References Recent Publications on Conformational Change and Evolution • Queitsch, C., Sangster, T.A. and Lindquist, S. 2002. Hsp90 as a capacitor of phenotypic variation. Nature 417: 618-624. • Jensen, M.A., True, H.L., Chernoff, Y.O., and Lindquist, S., 2001 Molecular Population Genetics and Evolution of a Prion-like Protein in Saccaromyces cerevisiae. Genetics 159: 527-525. • True, H.L., and Lindquist, S.L. 2000. A yeast prion provides an exploratory mechanism for genetic variation and phenotypic diversity. Nature 407: 477-483. • Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336-342. Mutations and time Take a series of sequences and figure out how different they are by counting up their substitutions. A 6 substitutions 5 substitutions B 3 substitutions C Mutations and time What process takes us from A to B to C? A gene conversion No direct ancestry C B frameshift (repairable) 2 point accepted mutations Counting mutations Consider a counting process {N(t), t e T} where N(ti) – N(tj) is the number of mutations in the time interval (ti,tj]. A No direct ancestry but we can still count substitutions: N (tAC) = 6 PM N(tAB) = 1 GC B C N(tBC) = 1 FS, 2 PM Times on the edges of the tree The “interoccurrence” times between mutations, t1 = 0, , t1 = t2 – t1, … ti = ti – ti-1, are exponential variables with mean 1/b such that P[ti > h] = e-bh and P[ti <= h] = 1 - e-bh for h>= 0. Edge times Gene conversions bgc = 1 gc/2,000* years Frame shifts bfs = 1 shift/5,000 years Point mutations bpm = 1 pm/10,000 years A 1/bgc = 2,000 yrs 1/bpm = 60,000 yrs B C *Just an wild guess 2/bpm+ 1/bfs = 25,000 yrs Edge times Population of A = 105 Population of B = 106 Population of C = I don’t care. A 1/bpm = 60 * 10-2 yrs 1/Na bgc = 20 * 10-2 yrs B C 2/ Nb bpm+ 1/ Nb bfs = 25*10-3 yrs Calculating divergence times Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and Little, E. “Determining Divergence Times of the Major Kingdoms of Living Organisms With a Protein Clock.” Science, 271, pp. 470-477, 1996. Calculating divergence times Task: Build a model for evolutionary time based on pairwise distances, dij, and the fossil record – Start with the vertebrate fossil record - the biogeochemistry gives reliable times. – Map the fossil-based phylogeny to the sequence based phylogeny and compare edge lengths. – Adjust the sequence-based time model to match the vertebrate fossil record. Using the fossil record Vertebrates: Time of first appearance in fossil record versus sequence similarity Distance Measure 30 25 20 15 10 5 0 0 100 200 300 Time (ma) 400 500 600 Readjusting the clock After sampling the vertebrate fossil-record and fitting the sequence data to the fossilrecord , they maintain the same clock. Result: Eukaryotes and Prokaryotes diverged about 2.5 billion years ago. On fitting the fossil record to sequence data Challenges: unequal rates of change in different species due to: – different reproductive cycles in different species – different base population sizes in different species. Obtaining bacterial mutation rates using vertebrate mutation rates when we are looking at the evolution of populations: how viable is it? Population mutation Suppose an average rate of mutation per site is about 10-7 (ignoring duplications). Compare lengths of reproductive cycles: – Prokaryotes (blue-green algae and bacteria): 20 minutes to an hour per generation. – Humans: US, average time to first child is 24.8 years. How many times does a bacteria reproduce in the time it takes a human being to reproduce? 24 * 365 * 25 = 219,000 So if we are comparing bacterial mutation rates to human mutation rates and we looking at aggregate populations, we have to adjust by a factor of 106. Population mutation Size of the base population on planet earth: 5 * 1030 prokaryotes (UG, Bill Whitman) - including about a mole of bacteria 3 * 109 humans How many bacteria are there, propagating how fast, in comparison to humans? Worst case ratio? Calculate using base population * rate of generation * number of mutable genes (1023 * 106*103) -------------------------- = 1018 (109 * 1*104) One final issue: The Success Question When mutations succeed, they succeed within an ecological niche. So when we ask “When did a species arise?”, it is not enough to ask about the likelihood of a certain kind of mutation, one must also ask: what is the likelihood that that mutation arose in a niche that would support it? So, don’t forget about acceptance rates. The FOXP2 point mutation Enard et al, “Molecular evolution of FOXP2, a gene involved in speech and language”, Nature, Vol. 418, August 22, 2002 Silent/expressed mutations in FOXP2 OHG 1/2 Orangutan 0/7 2/2 Human HG 0/2 Gorilla Edge labels are: Amino Acid / DNA substitutions Selective sweeps Measures for determining the existence of a sweep: – Tajima’s D: from Genetics, 1989 (conservative) – Fay and Wu’s H: from “Hitchhiking under positive Darwinian selection”, Genetics, 2000. Also, Griffiths and Tavare estimate selection using linked SNP data Population mutation rates Pia = 4Na bi - the population mutation rate for site i in species a, where Na is the effective population size of species a and bi is the mutation rate per generation at site i. 0.2 # of point mutations 0 1 4 7 10 13 16 Tajima’s D for FOXP2 2.20 P 0.03% S/an = 0.079% S is the sample size an is the number of segregating sites. Discovering different rate constants Finding the time of appearance of the FOXP2 segregation • Sample current human population worldwide. • Generate trees with different times for the human sequence data. • Measure the likelihood of the different trees. Multiple rates The automorphism f, mapping M onto itself, used to be a simple shift operation. Now, it incorporates several underlying processes, including: – mutation of the bases (mutation rate) – expression of the mutations (expression rate) – stabilization of a conformational phenotype (stabilization rate) – success of the substitution (acceptance rate) The distance between sequences, Part III. Algorithms for phylogenies M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension Motivation • Phylogenies provide measures of similarity and can lay a foundation for scoring alignments. • Rate structures provide indicators for motifs. • Branch points allow us to identify and classify interesting bases. – If the branch points are in phenotypic trees, the mutating bases can be used as phenotypic identifiers. – If the branch points are in genotypic trees, mutating (nonsilent) bases can be used as genetic identifiers. What goes into a phylogeny? Pairwise Alignment Multiple Alignment Distance measures (UPGMA, NN) Site info (MLE and Parsimony) Phylogenies Substitution scores Equilibrium distributions for MLE Transitional probability data What do we get in return? Pairwise Alignment Multiple Alignment Guide trees Phylogenies Rates and probabilities Scoring matrices Scoring matrices Transitional probability data Part III: Goals • Depict methods for finding guide trees for progressive multiple alignment. • Clarify the differences between MLE, Maximum Parsimony and Distance Methods and identify the optimization techniques appropriate for each. • Define a new approach for faster identification of near-optimal phylogenies. Progressive multiple alignment • Choose a set of scores for sequence comparison – Alignment scores from Needleman-Wunsch, Smith-Waterman and variants. – Consensus word score from BLAST, PSI-BLAST and others – Substitution (scoring) matrices – PAM, BLOSUM, Jukes-Cantor, etc. • Construct a reputable guide tree – Hierachical clustering (UPGMA, Neighbor-Joining, Fitch and Margoliash) – Maximum Parsimony (simple or weighted). – Maximum Likelihood Estimation (MLE) • Use the guide tree to produce an alignment Tree evaluation - Parsimony • Given a semi-labeled tree, it is possible to determine the tree’s internal nodes (ancestral sequences) using a parsimony algorithm. • Evaluation function: A summation of the scored mutations in the parsimonious tree. Parsimony - Illustrated ABC node 3, cost is 3 ABC A(B or D)C A(B or C) (E or C) node 1, cost is 1 node 2, cost is 2 ADC ABE ACC Example: Simple Parsimony Initialization: Set the cost, C = 0. Set k = 2n-1, where n is the number of sequences. Recursion to compute node, Nk: if k is a leaf node, Nk= sequence k if k is not a leaf node Compute Ni and Nj for the daughter nodes of Nk. where the intersection of Ni and Nj is nonempty, N k Ni N j otherwise increment the cost by the number of nonmatching N residues N N and set k i j Termination: Minimum cost of tree = C. Tree evaluation – Distance methods • Given a set of alignment scores, but without assuming a tree topology, it is possible to determine a tree and its edge lengths using a distance method. This is sometimes called minimum evolution and includes the hierarchical clustering methods. • Evaluation function: The sum of the edge lengths. Hierarchical Clustering – Illustrated UPGMA 1 2 1 2 3 3 4 4 5 t1= t2= ½d12 6 1 6 5 1 2 8 1 2 6 1 2 4 9 1 2 8 7 2 4 6 5 3 1 3 4 7 2 4 3 4 5 5 From Durbin et al, 2001 7 5 3 5 ½d68 Algorithm: UPGMA Input: N sequences and their relative distances, dij Initialization: Assign each sequence to its own cluster, Ci. Define a leaf of T for each sequence and place at height = 0. Iteration Pick two clusters Ci, Cj such that dij is minimal. Define a new cluster k by Ck = {Ci,Cj}. Define a new set of distances {dkl} between Ck and all current clusters. Define a node k with daughter nodes i and j, and place it at height hik = ½dik. Add k to the set of current clusters and remove i and j. Termination: Rooted: When only two clusters i, j, remain, add the root at height ½dik. Tree evaluation - MLE • Given a tree topology and sequences preassigned to each leaf, it is possible to determine a tree’s edge lengths using maximum likelihood estimation. • Evaluation function: the likelihood of the tree. Estimating Likelihood • Estimate branch lengths by viewing evolution as a random process • Requires a probability model of evolution as a function of time. – For DNA one can use Jukes-Cantor model (all nucleotides have same substitution rates), or Kimura model (different rates for transitions and transversions). – For proteins one can use Dayhoff, but in the probability form not the log-odds form. Estimating Likelihood S1, etc. are the bases or residues observed in the extant and ancestral taxa. v = lt where l is the substitution rate and t is absolute time Pi,j(v) is the probability that the residue at node si becomes residue at node sj in time v p0 is the prior probability of the bases or nucleotides at any position The likelihood for this tree is: L = p0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6) P6,3(v3) P6,4(v4) Example: Likelihood For each mutating site in a set of sequences Initialization: Set k = 2n-1, where n is the number of sequences. Recursion: Compute P(Lk|a) for each symbol, a, in the alphabet as follows: If k is a leaf node: if xk,u = “a”, then P(Lk|a) = 1, else Pk(a) = 0. if k is not a leaf node: Compute P(Li|a), P(Lj|a) for all a at daughters i,j Set P(Lk|a) = Sb,cP(b|a,ti) P(Li|b) P(c|a,tj) P(Lj|c). Termination: Likelihood for site u = pa SaP(L2n-1|a) (pa is the equilibrium value of the probability distribution for a.) Concluding step: Combine the likelihoods for each site. Maximizing Likelihood Estimation over edge times Likehood estimation includes a step for computing the likelihood of some character “a” at node k given the subtree of k. While we know that there is the possibility of substitutions leading to a, these depend on how long a time we have to make those substitutions and we do not know the edge times of the tree. We must explore a series of possible times in order to to maximize the likelihood. • A method that maximizes likelihoods over edge times is what is usually referred to as MLE. • Standard MLE procedures do not maximize likelihoods over all topologies of the tree. Comparisons between MLE, Parsimony and Distance Methods Algorithm Requires semilabeled tree Requires scored alignments Order Results – Edge weights Results – Internal tree nodes Resulting Tree Is Ultrametric MLE Yes No La2n-1 2an2 Transitional probabilities subtree probability Yes Parsimony Yes No 2an2 Mutation counts Ancestral sequences No Distance Methods No Yes 2n2 Distance measures – e.g. alignment scores UPGMA: a cluster of sequences UPGMA - no NN - yes. Exploring different topologies • Successive addition and rearrangement – Very common method (see Phylip programs including: PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and DOLLOP) – Sequences are taken in the order that they appear in the input file and successively added to a tree. • MCMC Successive addition • Initialization: – Place the set of sequences into L. – Create a tree,T, with one node – the root. • Iteration: for each sequence in L – Remove a sequence from L and add it as a leaf to T. – Apply a process of local rearrangement (in Felsenstein’s package, there are (n-1)(2n-3) arrangements.) – Score each locally arranged tree. – set T to equal the best scoring tree. • Termination: Globally rearrange the tree by swapping subtrees, score each globally rearranged tree and accept the tree with the best score. Markov Chain Monte Carlo A Bayesian method for phylogenetic inference – Moderately new method rooted in molecular dynamics. – Topologies are randomly generated and scored so that a representative set of most likely tree topologies can be identified. Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem. The following explanation is based on their methodology - the mistakes are mine, the facts and foundations, theirs. Introduction to the method {t } is the set of all semi-labeled trees Introduction to the method Sampling the set of trees Q2 Q1 Q3 Introduction to the method to a bc d Q01 t1 a c bd Q12 t2 Q23 a bc d A Chain of Accepted Samples … Introduction to the method The partitioned space with representatives {t1,, t3} t1 t3 MCMC propaganda • allow exact inference provided certain convergent criteria are demonstrated. • are efficient and can handle many more taxa or sequences. • measure uncertainty during tree construction (no bootstrapping needed.) Summary of the Algorithm 1. Choose a starting tree 2. Perturb the current tree’s topology and branch lengths to find a new tree. 3. Measure the likelihood for the new tree. 4. Compare the new tree to the last tree and decide whether or not to accept it into the chain. 5. If you’ve got a sufficiently long chain, check the characteristics of your sample to see if there is convergence to a set of representative topologies. If so, stop. Otherwise, to to 2. Subproblems to be discussed 1. How do we represent the tree so it that is easy to operate on? Cophentic matrices. 2. What is our perturbation operator? 3. How do we build our sampling chain? 4. When are we done sampling? The Cophenetic Matrix Some Notation t – a topology n – a node a(n) – the ancestor of a node L – a leaf node (the leaves are the current record) I – an internal node (the historical record) Cophenetic Trees Labeled history (t1, t2) provides an order on coalescent levels. I0 level 0 t1 { t2 { level 1 I1 level 2 I2 L1 L2 L3 Example: A Cophenetic Tree t1= 0.8 t2=0.3 t3=0.7 t4=0.5 t5=0.9 t6=1.5 total: 4.7 These trees are described in terms of nodes coalescing or merging backwards in time. Example: Cophenetic Matrix The cophenetic matrix for the previous tree. Leaf 5 7 4 1 2 6 3 5 0 9.4 9.4 9.4 9.4 9.4 9.4 0 1.6 4.6 6.4 6.4 6.4 0 4.6 6.4 6.4 6.4 0 6.4 6.4 6.4 0 3.6 3.6 0 2.2 7 4 1 2 6 3 0 The tree representation (s, a) is {(5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)} The Cophenetic Matrix Theorem: For any weighted binary tree with labeled leaf nodes, the tree topology and branch lengths can be uniquely determined using the within-tree distances between all pairs of leaf nodes. (Lapoint and Legendre, 1992) Note, each permutation of the leaf labels generates a different n x n symmetric matrix of distance distances. What is the perturbation operator? Q is the proposal function and it has two stages: • Q1 randomly selects a new leaf order • Q2 perturbs the values of the matrix supradiagonals. The proposal mechanism is symmetrical Q(tn,tn+1) = Q(tn+1,tn) Details on Q1 and Q2 Q1 samples one of the 2n-1 leaf orderings of the current tree model. Q2 simultaneously and independently modifies the elements of the superdiagonal by creating a uniform distribution (ai d) where d is constant. By applying both types of perturbations, Q1 and Q2, all the permutations of trees can be reached. Illustration of Q2 Subproblems to be discussed 1. How do we represent the tree so it that is easy to operate on? Use cophenetic matrices. 2. What is our perturbation operator? Q. 3. How do we build our sampling chain? Apply Metropolis-Hastings 4. When are we done? Acceptance with Metropolis-Hastings Given a tree t, Metropolis-Hastings: 1. Applies Q to build a new tree, t*. 2. Always accepts the new tree when it is more likely than the old one and sometimes accepts it when it is less likely than the old one. Acceptance with MetropolisHastings – the algorithm If P(t*) > P(t) accept t* into the chain. else accept t* into the chain with probability P(t*) / Pt Acceptance with Metropolis-Hastings The final step in evaluating the acceptance test is evaluating P(t*) / Pt This is easy: P(t) is approximated using the LE of t. Size of chain and convergence • How many trees do you have to propose before you begin to get a good enough sample? Mau et al 1998 sample over about 2500 trees for Clarkia, a phylogeny with 9 leaves • How do you test that you are done? At the end of the run, we say that we have convergence if there is a small set of topologies with high relative frequency in the chain. • What’s the result? The topologies with the highest frequencies are the reported reconstructions. Mixing • To obtain a confidence measure, the algorithm must be run more than once: each run generates a chain of accepted trees. • When chains “mix” well when they come up with the same representative topologies, starting from different tree topologies. • If running a sufficient number of independent chains is computationally prohibitive, Suchard et al, 2002, provide a “poor man's estimate of the uncertainty”. Example with binary data (from Mau, et al, 1998) 9 species of genus Clarkia (California plants) 120 restriction sites Data translated into a 9 x 120 matrix of zeroes and ones, representing the absence or presence of a restriction site in the genome of each species. Running the MCMC algorithm Random starting trees Chains of length 250,000 were subsampled at rate of 1/100 = 2500 trees Each run took 20 minutes on a Sparc 10. Convergence was inferred by reproducibility across runs with very different starting trees. The most common topologies for Clarkia A = 1,2; B = 3,4; C = 5,6; D=8,9 References Smouse and Li (1989) introduced the Bayesian paradigm, but not the notation, to the phylogeny reconstruction problem. Goldman (1993) used non-Bayesian Monte Carlo tests of significance to assess the adequacy of evolutionary models. Griffiths and Tavare (1994) constructed Markov chains to compute likelihoods for ancestral inference. Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem. Drill-down: Rates The way I use it, and I admit this is quirky, motif means the genetic profile for a functional structure. Using the following definitions: – Let rG be the rate of mutation for a gene. – Let rE be the rate of expressed mutation for the protein G encodes. – Let rS be the rate of structural mutation for the protein G encodes. – Let rF be the rate of functional mutation for the protein G encodes. rG > r E > r S > r F Note that the rate of neutral mutations is rN = rG – rF. The “true” rate of mutation for a motif is rF, the observed rate of mutation for members of a motif in a genotypic tree is rG. If we want motif branchings, we eliminate all branchings in the phylogeny occuring with rates rN. Drill-down: Semi-labeled trees Trees with a defined branching pattern and defined leaf labels but WITHOUT edge lengths or internal node labels. nccbac nacbac ncbbbc nccnaa In our terms, phylogenies with known branching patterns but without information about ancestors or mutation times. Drill-down: Progressive Alignments • As you move up the tree, add to sum of characters in growing alignment Progressive Alignments Sum of characters in growing alignment can be represented in a table of values called afrequency matrix or a profile Progressive Alignments Alignments are frozen once they are made. Scores are then calculated between aligned positions tabulated in a frequency matrix, using a scoring table A G A 4 3 G 1 2 S T 2 8 14 S 3 6 Sij = 2 × G:G + 1×A:G T 10 Algorithm: Neighbor-joining Input: N sequences and their relative distances, dij Initialization: Define a leaf of T for each sequence Iteration Pick two nodes i,j such that dij – (ri + rj) is minimal. Define a new set of distances, {dkl} between k and all current nodes. Define a node k with daughter nodes i and j, and place it at edge length eik = ½(dij + ri – rj) and ejk = dij –dik. Add k to the set of current nodes and remove i and j. Termination: Unrooted: When only two nodes i, j, remain, add an edge of length dij/2. Comparison: Neighbor-joining and UPGMA Minimization: – UPGMA uses dij – Nearest-neighbor uses dij – (ri + rj) where ri Distance measures: 1 dik . | L | 2 kL For distances between leaves i and j: • dij is the same in both algorithms. For distances between nodes k and m • UPGMA uses dik = 1/|Ci||Cj| Sp in Ci, q in Cjdpq • Nearest-neighbor uses dkm = ½ (dim + djm – dij) where i and j are the daughters of k. Edge lengths: UPGMA set the height of node k to ½ the distance between daughters i,j (½ dij). Nearest neighbor sets the edge length between k and daughters j to ½(dij + ri – rj), daughter k to dij – dik. Drill-down: MLE a P(Lk|a) = P(c|a,ti) P(b|a,tj) P(c|a,ti) P(b|a,tj) nccbabc ncbbcbc P(Li|c) = 1 P(Lj|b) = 1 site u = 3 simplest case Drill-down: Enumerating topologies 1 2(n 1) | t unlabeled | n (n 1) (2n - 2)! | t semilabeled | (2n - 3)!! n -1 2 (n - 1)! Drill-down: Acceptance with Metropolis-Hastings A proposed tree t* is accepted with probability:min 1, P(t *)Q(t *,t ) P(t )Q(t ,t *) However, by detailed balance you can step forward or backward with equal probability: Q(t,t*) = Q(t*, t) P(t *) min 1, Hence our test becomes P(t )