TCGA seminar series “Introduction to doing genetic history” Coalescent Theory Models of microsatellite evolution Dating methods Coalescent Theory The following series of slides shows how you can build up a genealogical tree to relate a sample of 22 individuals, collected in the present day, at a single locus (e.g. the nonrecombining Y chromosome). Because (for the Y chromosome) one son has only one father, but one father can have more than one son, coalescent events occur in the genealogy which inevitably result in a reduction of ancestors. Eventually, one ancestor remains – the Most Recent Common Ancestor (MRCA). Present Time 22 individuals Present 22 individuals 18 ancestors Time Present 22 individuals 18 ancestors 16 ancestors Time Present 22 individuals 18 ancestors 16 ancestors 14 ancestors Time Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors Time Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors Time Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors 3 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors 3 ancestors 3 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors 3 ancestors 3 ancestors 2 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors 3 ancestors 3 ancestors 2 ancestors 2 ancestors Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors Time 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors 3 ancestors 3 ancestors 2 ancestors 2 ancestors 1 ancestor Present Time Present Time Most recent common ancestor (MRCA) Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours) Present TCGAGGTATTAAC TCTAGGTATTAAC Time mutation Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC Time mutation Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC Time Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC Time Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC Time Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC Time Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC Time Most recent common ancestor (MRCA) Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC Time Most recent common ancestor (MRCA) Present Time TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC Most recent common ancestor (MRCA) Present Time TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC Most recent common ancestor (MRCA) Present Time TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC * ** * * Most recent common ancestor (MRCA) If we now imagine that our sample of 22 was in fact sampled from different subpopulations, then allele frequencies will vary between subpopulations, with the degree of difference reflecting the time at which the population split took place. Of course, population splitting is only one model we can think of. Another would be a migration model, where lineages occasional swap from one subpopulation to another. Note that population splits do not always have to involve just one sub-clade of the tree – in the figure, this is just a limitation of drawing the splitting in 2 dimensions only. Population 1 Present Time Population 2 Population 3 TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC * ** * * Most recent common ancestor (MRCA) The information on mutational differences in the tree can be used to describe a genealogical cladogram or network. The next slide shows an example using Y chromosome haplogroups. 3 M 17 G- G+ Y1 08 3 1 29 SR A G 1 C Ta t C T M20 26 A 65 Y4 C SR T 28 G •These are the results of very rare mutation events G M9 20 • Differences are usually base substitutions, insertions and deletions T 92R7 16 C Root P YA sY 81 21 G 8 A A G SR Y G + 10 83 A 12f2 SR Y4 0 64 4 2 - 9 Y chromosome markers (1): Unique Event Polymorphisms (UEPs) +(2) -(1) 1 37 M G 13 C 7 •In humans, each mutation event is likely to have happened only once (hence UEPs) •UEPs are used to classify Ychromosomes into broad ‘haplogroups’ To move beyond mere description, and to attempt such things as estimating the TMRCA (Time to Most Recent Common Ancestor) of the tree, it is necessary to adopt certain modelling assumptions. One classic model is the Wright-Fisher model. It assumes no recombination (i.e. a single locus), no selection, constant population size, random mating and a random number of offspring (strictly, following a Poisson distribution) Wright-Fisher (WF) model • Forwards-in-time model of a neutral locus in a constant-size, random-mating, haploid population evolving in discrete generations. • Each individual in generation t has a random number (possibly 0) of offspring in generation t+1. Each is: – identical to the parent with probability 1-μ; – otherwise a mutation occurs. generation 0 generation 1 generation 2 generation 3 generation 4 generation 5 generation 6 When we consider the same set of assumptions but now simulate going “backwards in time”, we arrive at the standard coalescent model. Coalescent theory • Backwards-in-time “cousin” of the WF model: similar assumptions, but traces the ancestry of n observed alleles. • Coalescent theory can be derived from the WF model in the limit as N → ∞. • Ancestry is represented via a genealogical tree: leaves are observed alleles, root is the most recent common ancestor (MRCA). generation 6 generation 5 generation 4 generation 3 generation 2 generation 1 generation 0 t3 = TMRCA (Time since Most Recent Common Ancestor) t2 = 2nd coalescence t1 = 1st coalescence t0 = present Time is measured in units of N generations: 1 coalescent time unit = NG years, where G is generation time in years. Time back to the next coalescence when there are j lineages has the exponential distribution with mean and standard deviation both 2/j(j-1); e.g. n = 4: mean = sd = 1 mean = sd = 1/3 mean = sd = 1/6 Height of tree: mean = 3/2 sd = 1·07 Total branch length: mean =11/3 sd = 2·33 The TMRCA under the coalescent • The TMRCA (height of the genealogical tree) is on average 2(n-1)/n; the average time in which there are just two ancestral lineages is 1. – the number of ancestors of a sample drops rapidly (backwards in time); – for more than half its history, on average, a sample has only two ancestors; – data often clustered. • The standard deviation of the TMRCA is 1 when n=2, increasing with n to about 1·08. – the height of the tree is highly variable from locus to locus; When we simulate from the standard coalescent, we find that there is considerable variation in the TMRCA from one simulation to the next. Most coalescent event occur in the recent past (at the tips of the tree) The situation changes if we expand the coalescent model to incorporate a factor of exponential population growth. No there is less variation in the TMRCA between simulations, and more coalescent events occur in the more distant past (near the root of the tree). Generalisations of the standard coalescent model • Variable population size: coalescences occur more rapidly when the population size is small. • Population subdivision with migration. • Some forms of selection. • Recombination: the ancestral recombination graph (ARG) What can we date? - TMRCA (or subclade TMRCA) - Demographic parameters The next two slides briefly consider how we can use a method called BATWING to get estimates both of the TMRCA and of demographic parameters such as the time that growth starts of the timings of population split events. The first slide describes the demographic model used (population splitting with growth). The second slide briefly describes how BAWTING explores “tree space” to arrive at probability distributions for parameters of interest. BATWING Model BATWING Method (briefly!) Mutational models - UEP (“infinite sites” model) - Microsatellites (stepwise models) The “infinite sites” model is appropriate for UEP events on the Y chromosome. Because these mutations occur rarely, and because the Y chromosome is big (~60 Mb), each mutation most likely occurs in a different site (this would hold perfectly if the number of sites were infinite). This allows each mutation to be distinguished as illustrated in the next slide. MUTATION MUTATION 10 00 00 00 01 01 01 01 The “stepwise mutation” model (SMM) is appropriate for microsatellites. When a mutation occurs, the new mutation length depends on the existing length. In the simplest case of the “single” SMM, illustrated in the next slide, the new length = old length +/- 1. Microsatellites GAGGCGTAGTAGTAGTAGTAGTAGTAGGCTCTA GAGGCGTAGTAGTAGTAGTAGTAGGCTCTA or GAGGCGTAGTAGTAGTAGTAGTAGTAGTAGGCTCTA • Microsatellites mutate very fast (~1 change every 500 generations) • Mutation events usually involve a gain or a loss of a single repeat unit The “stepwise mutation” model (SMM) can be expanded in many ways. One useful model that is gaining acceptance is the linear length-dependent SMM. Here, when mutations occur they still go up (+1) or down (-1) with equal probability, but the rate at which mutations occur increases as a linear function of microsatellite length. The model has a plausible mechanism – if mutations occur by replication slippage, and can occur between any two adjacent motifs with equal probability, then the linear length-dependent SMM results. Mutation rate Length-dependent SMM Simple SMM 8 9 10 Repeat size 11 12 “YTime” “YTime” is a software package written by Mike Weale for estimating TMRCA’s using microsatellite data at a single locus (as one would find on the Y chromosome – hence the name). See www.ucl.ac.uk/tcga/software The software assumes that the haplotype of the MRCA is known. This means that the point estimate of the TMRCA is unaffected by demographic history, although the standard error of the estimate will be affected. The method works by simulating under different TMRCA values until the observed ASD falls with the 95% confidence interval. Vary TMRCA until obs. ASD is likely D T = 150 gens Likelihood T = 200 gens Likelihood D Simulated distribution of ASD values D Simulated distribution of ASD values T = 80 gens Likelihood T = 97 gens Likelihood D Simulated distribution of ASD values Simulated distribution of ASD values