The Coalescent and Measurably Evolving Populations The Coalescent and Measurably Evolving Populations Alexei Drummond Department of Computer Science University of Auckland, NZ The Coalescent and Measurably Evolving Populations Overview 1. Introduction to the Coalescent 2. Hepatitis C in Egypt • An example using the coalescent 3. Measurably evolving populations 4. HIV-1 evolution within and among hosts • An example using MEP concepts 5. Summary + Conclusions The Coalescent and Measurably Evolving Populations The coalescent • The coalescent is a model of the ancestral relationships of a small sample of individuals taken from a large background population. • The coalescent describes a probability distribution on ancestral genealogies (trees) given a population history. – Therefore the coalescent can convert information from ancestral genealogies into information about population history and vice versa. • The coalescent is a model of ancestral genealogies, not sequences, and its simplest form assumes neutral evolution. The Coalescent and Measurably Evolving Populations The history of coalescent theory • • • • • • • • • • 1930-40s: Genealogical arguments well known to Wright & Fisher 1964: Crow & Kimura: Infinite Allele Model 1966: (Hubby & Lewontin) & (Harris) make first surveys of population allele variation by protein electrophoresis 1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So do King & Jukes 1971: Kimura & Ohta proposes infinite sites model. 1975: Watterson makes explicit use of “The Coalescent” 1982: Kingman introduces “The Coalescent”. 1983: Hudson introduces “The Coalescent with Recombination” 1983: Kreitman publishes first major population sequences. 1987: Cann et al. traces human origin and migrations with mitochondrial DNA. The Coalescent and Measurably Evolving Populations The history of coalescent theory • • • • • • • • 1988: Hughes & Nei: Genes with positive Darwinian Selection. 1989-90: Kaplan, Hudson, Takahata and others: Selection regimes with coalescent structure (MHC, Incompatibility alleles). 1991: MacDonald & Kreitman: Data with surplus of replacement interspecific substitutions. 1994-95: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces sampling techniques to estimate parameters in population models. 1997-98: Krone-Neuhauser introduces Ancestral Selection Graph 1999: Wiuf & Donnelly uses coalescent theory to estimate age of disease allele 2000: Wiuf et al. introduces gene conversion into coalescent. 2000-: A flood of SNP data & haplotypes are on their way. The Coalescent and Measurably Evolving Populations Population processes COALESCENT THEORY Genealogy The Coalescent and Measurably Evolving Populations Coalescent inference Randomly sample individuals from population Obtain gene sequences from sampled individuals Reconstruct tree / trees from sequences Infer coalescent results from tree / trees Infer coalescent results directly from sequences The Coalescent and Measurably Evolving Populations Demographic History • Change in population size through time • Applications include – – – – Estimating history of human populations Conservation biology Reconstructing infectious disease epidemics Investigating viral dynamics within hosts The Coalescent and Measurably Evolving Populations Idealized Wright-Fisher populations Grand parents Parents Now Haploid Diploid The Coalescent and Measurably Evolving Populations Random mating in an ideal population •A constant population size of N individuals •Each individual in the new generation “chooses” its parent from the previous generation at random The Coalescent and Measurably Evolving Populations Genetic drift: extinction and ancestry If you trace the ancestry of a sample of individuals back in time you inevitably reach a single most recent common ancestor. If you pick a random individual and trace their descendents forward in time, all the descendents of that individual will with high probability eventually die out. Past Discrete Generations The Coalescent and Measurably Evolving Populations A sample genealogy from an idealized Wright-Fisher population A sample genealogy of 3 sequences from a population (N =10). Past Present Present The Coalescent and Measurably Evolving Populations The coalescent: distributions and expectations on a sample genealogy Past t2 ~ Exp(N) E[t2 ] N t 3 ~ ExpN /3 E[t3 ] N /3 Present 1 E[troot ] 2N1 n E[t k ] 2N k(k 1) The Coalescent and Measurably Evolving Populations The coalescent: probability density distribution Past t2 ~ Exp(N) t 2 1 P(t2 | N) exp N N 3t 3 3 t 3 ~ ExpN /3 P(t3 | N) exp N N Present k(k k(k 1) 1)t k fG (g | N) exp dg N N k 2 n Kingman (1982a,b) g E g ,t The genealogy is an edge graph Eg and a vector of times t. The Coalescent and Measurably Evolving Populations The coalescent: estimating population size from a sample genealogy Past t2 7 Nˆ 2 7 Nˆ 15.5 t3 8 Present Nˆ 3 24 k(k 1) ˆ Nk t k 2 n 1 k(k 1) ˆ N tk n 1 k 2 2 Felsenstein (1992) -6 Nˆ 15.5 (5.1, 93.1) -8 relative log likelihood The Coalescent and Measurably Evolving Populations The coalescent: estimating population size confidence limits via ML -10 -12 -14 -16 -18 -20 1 10 100 Population size (N) Maximum likelihood can be used to estimate population size by choosing a population size that maximizes the probability of the observed coalescent waiting times. 1000 The confidence intervals are calculated from the curvature of the likelihood. For a single parameter model the 95% confidence limits are defined by the points where the loglikelihood drops 1.92 logunits below the maximum log-likelihood. Evolving Populations Measurably The Coalescent andThe Coalescent The coalescent: shapes of gene genealogies Exponential growth Constant size The coalescent can be used to convert coalescent times into knowledge about population size and its change though time. Evolving Populations Measurably The Coalescent andThe Coalescent Constant population size: N(t)=N0 small N0 large N0 TIME Evolving Populations Measurably The Coalescent andThe Coalescent Coalescent and serial samples Constant population Exponential growth Evolving Populations Measurably The Coalescent andThe Coalescent Uncertainty in Genealogies How similar are these two trees? Both of them are plausible given the data. We can use MCMC to get the average result over all plausible trees, Evolving Populations Measurably The Coalescent andThe Coalescent Coalescent Summary • The coalescent provides a theory of how population size is related to the distribution of coalescent events in a tree. • Big populations have old trees • Exponentially growing populations have star-like trees • Given a genealogy the most likely population size can be estimated. • MCMC can be used to get a distribution of trees from which a distribution of population sizes can be estimated. The Coalescent and Measurably MCMC Evolving Populations Markov chain Monte Carlo (MCMC) • Imagine you would like to estimate two parameters (,) from some data (D). • You want to find values of and that have high probability given the data: p(,|D) • Say you have a likelihood function of the form: Pr{D| ,} • Bayes rule tells us that: – p(,|D) = Pr{D| ,}p(,) / Pr{D} – So that p(,|D) Pr{D| ,}p(,) The Coalescent and Measurably MCMC Evolving Populations Markov chain Monte Carlo (MCMC) • p(,|D) is called the posterior probability (density) of , given D • In an ideal world we want to know the posterior density for all possible values of ,. • Then we could pick a “credible region” in two dimensions that contained values of , that account for the majority of the posterior probability mass. • This credible region would serve as an estimate that includes incorporates our uncertainty and this credible set could be used to address hypotheses like: is greater than x. • In reality we have to make due with a “sample” of the posterior - so that we evaluate p(,|D) for a finite number (say 10,000,000) pairs of ,. • So which pairs should we choose? The Coalescent and Measurably MCMC Evolving Populations Markov chain Monte Carlo (MCMC) • Lets construct a random walk in 2-dimensional space • In each step of the random walk we propose to make an (unbiased) small jump from our current position (,) to a new position (’,’) • If p(’,’|D) > p(,|D) then we make the proposed jump • However, if p(’,’|D) < p(,|D), then we make the proposed jump with probability = p(’,’|D) / p(,|D), otherwise we stay where we are. • It can be shown (trust me!) that if you proceed in this fashion for an infinite time then the equilibrium distribution of this random walk will be p(,|D)! • That is, the random walk will visit a particular region [0, 1] x [0, 1] of the state space this often: 1 1 p(, | D)dd 0 0 The Coalescent and Measurably MCMC Evolving Populations Markov chain Monte Carlo (MCMC) p(, | D) Z Pr{D | ,g} f ( | g) f (, ) g Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Hepatitis C Virus (HCV) • • • • Identified in 1989 9.6kb single-stranded RNA genome Polyprotein cleaved by proteases No efficient tissue culture system Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand How important is HCV? • • • • 170m+ infected ~80% infections are chronic Liver cirrhosis & cancer risk 10,000 deaths per year in USA • No protective immunity? Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand HCV Transmission Percutaneous exposure to infected blood • Blood transfusion / blood products • Injecting & nasal drug use • Sexual & vertical transmission • Unsafe injections • Unidentified routes Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Estimating demographic history of HCV using the coalescent • • • • Egyptian HCV gene sequences n=61 E1 gene, 411bp All sequence contemporaneous • Egypt has highest prevalence of HCV worldwide (10-20%) But low prevalence in neighbouring states Why is Egypt so seriously affected? Parenteral antischistosomal therapy (PAT) • • • Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Demographic model • The coalescent can be extended to model deterministically varying populations. • The model we used was a const-exp-const model. • A Bayesian MCMC method was developed to sample the gene genealogy, the substitution model and demographic function simultaneously. NC N(t) NC exp[r(t x)] N A if t x if x t y if t y Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Estimated demographic history Based on a single tree Averaged over all trees Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Parameter estimates Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Uncertainty in parameter estimates Demographic parameters Mutational parameters Growth rate of the growth phase Rates at different codon positions, Grey box is the prior All significantly different Evolving Measurably The Coalescent Egypt C inPopulations of Hepatitis genetics Populationand Full Bayesian Estimation • Marginalized over uncertainty in genealogy and mutational processes • Yellow band represents the region over which PAT was employed in Egypt Evolving Populations and Measurably The Coalescent evolving populations Measurably Measurably evolving populations (MEPs) • MEP pathogens: – HIV – Hepatitis C – Influenza A • MEPs from ancient DNA – – – – • • Present time point (n = 5) Bison Brown Bears Adelie penguins Anything cold and numerous Even over short periods (less than a year) HIV sequences can exhibit measurable evolutionary change Time-structure can not be ignored in our models Earlier time point (n = 5) Evolving Populations and Measurably The Coalescent evolving populations Measurably Time structure in samples Contemporary sample no time structure Serial sample with time structure time 1980 1990 2000 Evolving Populations and Measurably The Coalescent evolving populations Measurably Molecular evolution and population genetics of MEPs • Given sequence data that is time-structured estimate true values of: – substitution parameters • Overall substitution rate and relative rates of different substitutions – population history: N(t) – Ancestral genealogy Ne time A B • Topology • Coalescent times C D E The Coalescent and Measurably Evolving Populations Molecular evolutionary model: Felsenstein’s likelihood (1981) AA b 1 GA b4 AC b3 b2 b5 GC The probability of the sequence alignment, Pr{D | T,Q} can be efficiently calculated given a tree and branch lengths (T), and a probabilistic model of mutation represented by an instantaneous rate matrix (Q). In phylogenetics, branch lengths are usually unconstrained. The Coalescent and Measurably Evolving Populations Combining the coalescent with Felsenstein’s likelihood AA b 1 GA b4 AC b3 b2 t2 The “molecular clock” constraint t3 t4 b5 GC 2n–3 branch lengths AA GA AC GC n–1 waiting times p(N,g,Q | D) Pr{D | g,Q} f G (g | N) f N (N) f Q (Q) The joint posterior probability of the population history (N), the genealogy (g) and the mutation matrix (Q) are estimated using Markov chain Monte Carlo (Drummond et al, Genetics, 2002) Evolving Populations and Measurably The Coalescent evolving populations Measurably Full Bayesian Model Probability of what we don’t know given what we do know. Likelihood function P(g, , Ne, Q | D) = 1 Z P(D | g, , Q)fG(g | Ne) f()fN(Ne )fQ(Q) Unknown normalizing constant Q = substitution parameters Ne = population parameters g = tree = overall substitution rate other priors coalescent prior In the software package BEAST, MCMC integration can be used to provide a chain of samples from this density. Evolving Populations and Measurably The Coalescent evolving populations Measurably HIV-1 (env) evolution in nine infected individuals Pt.9 HIV1U35926 Pt.7 Patient #6 from Wolinsky et al. HIVU95460 HIV1U36148 HIV1U36073 HIV1U36015 HIV1U35980 Pt.6 Pt.8 Pt.2 Pt.1 Shankarappa et al (1999) Pt.3 10% Pt.5 Evolving Populations and Measurably The Coalescent evolving populations Measurably Molecular clock: HIV-1 (env) evolution in 9 individuals Viral Divergence 10% 8% 6% 4% 2% 0 2 4 6 8 Years Post Seroconversion Shankarappa et al (1999) 10 Evolving Populations and Measurably The Coalescent evolving populations Measurably MEP Summary • Most RNA viruses, including HCV and HIV are measurably evolving • Most vertebrate populations that have well-preserved recent fossil records are MEPs. • If sequence data comes from different times the timestructure can’t be ignored • Time structure permits the direct estimation of: – – – – substitution rate Concerted changes in substitution rate coalescent times in calendar units Demographic function N(t) in calendar units The Coalescent and Measurably Evolving Populations Intermission My brain is fried! and Measurably The Coalescent of HIV Populations geneticsEvolving Population HIV virion HIV gp120 binds to CD4 T cell surface receptors What is HIV? viral core inserted into cell • • replication of virus genome by reverse transcription (ssRNA to dsDNA) host cell migration of dsDNA to nucleus nucleus LTR integration of proviral DNA into DNA of infected cell LTR viral RNA transcription translation viral genomic RNA structural proteins and viral enzymes regulatory genes RNA packaging and virion assembly budding of virus from cell and maturation HIV is a retrovirus. Within infected individuals HIV exhibits extremely high genetic variability due to: – Error-prone reverse transcriptase (RT) that converts RNA to DNA (error rate is about one mutation per genome per replication cycle). – DNA-dependent polymerase also errorprone – High turnover of virus within infected individual throughout infection. and Measurably The Coalescent of HIV Populations geneticsEvolving Population Patient 2 (Shankarappa et al, 1999) Number of sequences obtained per sample 0 11 22 20 8 20 20 20 10 0 12 20 30 40 51 61 68 73 80 85 91 8 20 9 20 22 103 126 Time in months (post seroconversion) • 210 sequences collected over a period of 9.5 years • 660 nucleotides from env: C2-V5 region • Effective population size and mutation rate were co-estimated using Bayesian MCMC. and Measurably The Coalescent of HIV Populations geneticsEvolving Population A tree sampled from the posterior distribution ‘Ladder-like’ appearance Lineage A Lineage B and Measurably The Coalescent of HIV Populations geneticsEvolving Population Estimated substitution rate • Patient 2: – 0.77–1.0% per year • BUT…. Long term rates in HIV – Korber et al: • 0.24% (0.18-0.28%) per year • Only 1/4 of the intrapatient rate Evolving Populations and Measurably The Coalescent evolving populations Measurably Bayesian MCMC of Shankarappa data Patient Rate Bottleneck (at Best-fitting Effective heterogeneity seroconversion) demographic Estimated rate population (alpha model (per site per year) size* parameter) estimate upper limit p1 p2 plasma p2 provirus p3 p5 plasma p5 provirus p6 p7 p8 p9 p11 Logistic Logistic Logistic Logistic Exponential Logistic Logistic Logistic Exponential Logistic Logistic Overall Logistic 0.0123 0.0166 0.0090 0.0175 0.0223 0.0215 0.0195 0.0085 0.0162 0.0071 0.0128 882 1708 2798 620 938 1345 581 3320 2309 2757 2502 0.278 0.242 0.278 0.237 0.192 0.293 0.221 0.322 0.455 0.346 0.239 1.57% 0.63% 0.04% 1.80% 11.80% 8.19% 0.52% 2.94% 28.70% 17.90% 0.15% 6.68% 3.34% 0.18% 4.97% 27.50% 15.20% 1.35% 13.60% 48.50% 44.80% 0.53% 0.0148 1796 0.282 6.75% 15.15% * At the time of last sample assuming a generation length of 2.6 days 3.00E-02 2.50E-02 p1 - p11 rate (per site per year) and Measurably The Coalescent of HIV Populations geneticsEvolving Population Intra- and inter- patient rate estimates (C2V3 envelope) 2.00E-02 Intrapatient estimates 1.50E-02 Interpatient estimates 1.00E-02 C 5.00E-03 A B 0.00E+00 0 50 100 150 Sampling interval (months) 200 250 and Measurably The Coalescent of HIV Populations geneticsEvolving Population Summary: HIV intra-patient evolution • HIV evolutionary rates appear to be faster intrapatient then across pandemic – Different selection pressure at transmission? – Transmitted viruses undergoing less rounds of replication? – Latent viruses? – Reversion of escape mutants? • Effective population size is changing over time (bottleneck in envelope at least) Populations Measurably Evolving The Coalescent and tests Goodness-of-fit But how good is our best model? • We can use standard statistical model-choice criteria to choose between different models of substitution and demography, but are any of the models we consider any good at all? • One way to look at this is ask the following question: – Does our real data look anything like what we would expect data from our model to look like? • So what aspect of the data should we look at? • And what should we expect? Populations Measurably Evolving The Coalescent and tests Goodness-of-fit We could look at branch length distributions… Ln troot E[Ln ] 2Ne Jn Ln 1 E[troot ] 2N e 1 n n1 1 k1 k E[J n ] 2N e Populations Measurably Evolving The Coalescent and tests Goodness-of-fit Tree imbalance measures might also be interesting… 4 cherries Ic 0 3 cherries Ic 0.24 2 cherries Ic 0.81 N 3 N 3.125 N 4.125 Evolving Populations and Measurably The Coalescent predictive simulation Posterior Posterior predictive simulation • A method of testing the goodness-of-fit of a Bayesian model. 1. Run a Bayesian MCMC analysis on the data 2. Calculate the value of your favourite summary statistic, T(.) from the data, D 3. For each state in the chain 1. Simulate a synthetic dataset, Di, using the parameter values of state i. 2. Calculate T(Di) from the simulated data set. 4. Compare the T(D) value with predictive distribution of T(Di) Evolving Populations and Measurably The Coalescent predictive simulation Posterior So we need some summary statistics • Summary statistics that • Summary statistics that can be measured directly can be measured directly from an genealogy: from sequence – Genealogical mean alignment: pairwise distance () – Mean pairwise distance () – Tajima’s D – Fu & Li’s D – Number of segregating sites (S) – … – – – – – Genealogical Tajima’s D Genealogical Fu & Li’s D Tree-imbalance statistics Age of the root Length of the tree Evolving Populations and Measurably The Coalescent predictive simulation Posterior Posterior predictive simulation (2) • Testing the goodness-of-fit of the neutral coalescent model under variable demographic functions. 1. Run a Bayesian MCMC analysis on the data 2. For each state in the chain 1. Simulate a coalescent genealogy (GiS) using the population parameter values of state i. 2. Calculate T(GiS) from the ith simulated genealogy 3. Calculate T(GiP) from the ith posterior genealogy 3. Calculate the predictive probability by comparing the posterior distribution of T(.) with predictive distribution of T(.): Populations Measurably Evolving The Coalescent and tests Goodness-of-fit Human influenza A (HA gene) trees State 5m N e 9.12 t2 11.03 years State 10m N e 5.00 t2 15.29 years Posterior genealogy Predictive simulations Populations Measurably Evolving The Coalescent and tests Goodness-of-fit Human influenza A trees: Genealogical Fu & Li’s D statistic Populations Measurably Evolving The Coalescent and tests Goodness-of-fit Puerto Rican Dengue-4 gene trees: multivariate summary statistics Populations Measurably Evolving The Coalescent and tests Goodness-of-fit Results of test of neutrality Table 2. The predictive probabilities ( PT ) for summ ary statistics on each of the exa mple data sets are shown . Signifi cant departures from neu trality are marked (*) and marginall y signifi cant departures (x < 0.05 or x > 0.95) are marked wit h (†). Signifi cant departures on the best fit ting model for each data set are in bold. Dataset Brown bear Demograph ic model Constant T Predictive probab ilities troot DFL IC Cn B1 0.739 0.815 0.863 0.693 0.163 0.103 0.623 0.800 0.679 0.163 0.111 RSVA Exponential growth 0.615 Constant 0.956† 0.964† 0.946 0.163 0.152 0.134 (g gene) Exponential growth 0.693 0.656 0.884 0.206 0.149 0.134 Dengue-4 Constant 0.9574† 0.9958* 0.9997* 0.562 0.608 0.427 (E gene) Human influenza A Exponential growth 0.745 Constant 0.9510† 0.809 0.9792* 0.559 0.653 0.505 0.900 0.9999* 0.0462† 0.605 0.610 (HA) Exponential growth 0.620 0.9995* (d-loop) 0.910 0.0866 0.575 0.677 Fu & Li's D 1.0 0.9 0.8 0.7 Sim Constant size Sim Exponential Growth Target Constant size Exponential Growth 0.6 P* Populations Measurably Evolving The Coalescent and tests Goodness-of-fit Results for 28 HIV-1 infected individuals 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.2 0.4 0.6 0.8 proportion of data sets 1.0 Pop size 1000 Ne / 30 and Measurably The Coalescent of HIV Populations geneticsEvolving Population Is the population size constant? mean lower upper 100 10 0 20 40 60 80 100 120 months (post seroconversion) Patient 2 Measurably Evolving Populations The Coalescent andPhylodynamics Virus population dynamics Measles virus Human influenza virus Measurably Evolving Populations The Coalescent andPhylodynamics 80 Dengue-4: Modeling complex demography 40 0 Den4.Neby byYear.m Year Den4.Isol 120 120 80 80 40 40 0 1980 1985 Den4.Ne by Year 1990 0 1995 2000 120 80 N(t) = N0exp(-rt): N(t) = scaled translated case data: -10566.421 -10478.572 40 0 Hospital case data courtesy of Shannon Bennett Evolving Populations and Measurably The CoalescentPopulation size changes Population size changes Evolving Populations and Measurably The CoalescentPopulation size changes The generalized skyline plot • Visual framework for exploring the demographic history of sampled DNA sequences • Input: a single estimated ancestral genealogy (a tree) • Output: nonparametric plot of the population size through time – Groups adjacent coalescent intervals – Converts information within these intervals to estimates of population size k(k 1) ˆ Nk tk 2 Estimate of population size from single coalescent interval k k(k l) ˆ N k,l ti 2l i kl 1 Estimate of population size from l adjacent coalescent intervals. Evolving and Measurably The CoalescentGeneralized Plot Populations Skyline Examples I: Constant population size N(t)=N(0) Evolving and Measurably The CoalescentGeneralized Plot Populations Skyline Skyline Plot I: Constant population size N(t)=N(0) II: Exponential growth N(t)=N(0)e-rt Evolving and Measurably The CoalescentGeneralized Plot Populations Skyline Skyline Plot III: HIV-1 group M (tree estimated in Yusim et al (2001) Phil. Trans. Roy. Soc. Lond. B 356: 855-866) – Black curve is a parametric estimate obtained from the same data under the “expansion model” – Results follow accepted demographic pattern for the HIV pandemic Dengue-4 Bayesian skyline plot (15 epochs) Estimate a demographic function that has a certain fixed number of steps (in this example 15) and then integrate over all possible positions of the break points. Population size * generation length 1000 100 10 M lo u 1 0.1 0.01 0 2 4 6 8 10 12 14 16 18 20 Years ago Dengue-4 Bayesian skyline plot (15 epochs) 100 90 Explains the Dengue data quite well (test of neutrality do not reject the data if we use the Bayesian skyline plot to describe the demographic history. Population size * generation length Evolving Populations and Measurably The CoalescentPopulation size changes The Bayesian skyline plot 80 70 60 Media lower upper 50 40 30 20 10 0 0 2 4 6 8 10 Years ago 12 14 16 18 20 Evolving Populations and Measurably The CoalescentPopulation size changes Prior/Model: population is autocorrelated through time In add iti on to this model we also introduce a simple smoothing on which represents our beli ef that effective popu lation size is auto-correlated through tim e. The prior distribution we assume in all subs equen t simulations and ana lyses is that, going back in tim e, each new population size is drawn from an exponen tial distribut ion wit h a mean equa l t o the previou s popula tion size: j ~ Exp( j1 ), 2 j m . (5) In add iti on we introduce a scale-inva riant prior (Jeffreys 1946) on the fir st element 1 y that our prior beli ef is inva riant to changes in tim e scale. Th is f 1 (1) to signif 1 results in a foll owing simple prior distribution on : f () 1 1 m j 2 exp j / j1. j 1 (6) Evolving Populations and Measurably The CoalescentPopulation size changes Validating the Bayesian skyline plot (1) Simulated data: Constant population Simulated data: Exponential growth Bayesian skyline (49 or 12 epochs) 100 10 Median (49) lower (49) upper (49) truth Median (12) lower (12) upper (12) 1 Theta Evolving Populations and Measurably The CoalescentPopulation size changes Validating the Bayesian skyline plot (2) 0.1 0.01 0.001 0.0001 0.00001 0 0.002 0.004 0.006 Time (mutations) 0.008 100000 Median Upper Lower Smoothed translated incidence 10000 Effective number of infections Evolving Populations and Measurably The CoalescentPopulation size changes Comparing Bayesian skyline plot of Dengue-4 with incidence data 1000 100 10 1 0 50 100 Months (before November 1998) 150 200 Evolving Populations and Measurably The CoalescentPopulation size changes Example of Bayesian skyline plot (1920-1980) Anti-schistosomal needle-based treatment Effective population size jumped from 300 to 100 to 10,000 Evolving Populations and Measurably The CoalescentPopulation size changes Comparison to parametric model The Coalescent and Measurably Evolving Populations http://evolve.zoo.ox.ac.uk/BEAST Evolving Populations and Measurably The CoalescentStructured populations Coalescent with population structure Evolving Populations and Measurably The CoalescentStructured populations Population subdivision - two demes Evolving Populations and Measurably The CoalescentStructured populations Population subdivision - two demes Evolving Populations and Measurably The CoalescentStructured populations Stepping stone model of subdivision Evolving Populations and Measurably The CoalescentStructured populations Human migration From Cavalli-Sforza,2001 Evolving Populations and Measurably The CoalescentStructured populations Simplified model of human evolution Past Rate of common ancestry = 1 Present Africa Mutation rate = 2.5 0.2 Non-Africa The Coalescent and Measurably Evolving Populations Why Bayesian? • Probabilistic model-based inference – Can make simple statements about the probability of alternative hypotheses given the data • Markov chain Monte Carlo – Convenient computational technique – Allows for complex models: “if you can simulate you can sample” • Incorporates prior probabilities – P(|D) P(D| )P() – Convenient means of assessing alternative sets of assumptions – Allows incorporation of independent sources of information • Easy to include sources of uncertainty – Don’t need to assume perfect knowledge of tree (for example) – Can treat the tree and a nuisance parameter and focus on parameters of interest (strength of selection, mutation rate, growth rate, etc) The Coalescent and Measurably Evolving Populations Conclusions & cautionary remarks • Bayesian MCMC has advantages – a useful tool for exploring prior hypotheses – Good for assessing levels of uncertainty – Complex models can be investigated on practical datasets • Bayesian MCMC has disadvantages – Diagnostics are difficult, and it is essentially impossible to guarantee correctness – Model comparison can be difficult – Requires large programs that are difficult to optimize and debug. The Coalescent and Measurably Evolving Populations Conclusions & cautionary remarks (2) • Population genetics has advantages – provides a framework for objective analysis of genetic data – Allows interpretation of genetic data in terms of biological properties of virus – Can be extended to include selection, recombination et cetera • Population genetics has disadvantages – Models are still too simple – Assumptions are too strong – Extending to complex models that include changing selection pressures and recombination are possible in MCMC but still very difficult!