Population genetics Population genetics Population genetics concerns the study of genetic variation and change within a population. While for evolving species there is no model for the branching process (speciation), in population genetics there is. This allows a detailed modelling of the interplay between mutation, selection, and stochastic effects (genetic drift). Simplifying assumptions that are initially made include: - No selection - No recombination - No fluctuations in population size - No population structure (subdivision; migration) - No assortative mating (individuals mate randomly) - No interaction between loci (no epistasis; no linkage) - No environmental effects (e.g. climate/habitat change etc.) RA Fisher JBS Haldane Sewell Wright Motoo Kimura Kimura’s Neutral Theory Darwin(ism): * Something causes minute (phenotype) variations in a population (ideas: perhaps over-use during lifetime might cause variations (Lamarckism; think giraffes); perhaps traits are transmitted through blood and blend) * Natural selection causes adaptive variants to rise in frequency, while non-adaptive ones die out. Neo-darwinism: * The “something” is replaced by Mendelian genetics + random mutations * Panselectionism; adaptionism: most traits are optimal; selection main driving force of evolution (R.A. Fisher; Richard Dawkins; John Maynard Smith) Population genetics / neutral theory: * Most mutations are neutral; genetic drift underlies most of evolution (Fisher; Haldane; Wright; Kimura) Modern evolutionary synthesis: * Takes onboard (parts of) all of the above. * Neutral theory relevant for DNA data in populations; considered less relevant for phenotypes. Wright-Fisher Model - Constant population size N diploid individuals = 2N alleles - Each descendant chooses a parent randomly - Everyone reproduces simultaneously (no overlapping generations) http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf Wright-Fisher Model Suppose i(t) individuals carry a particular mutation A in generation t. The probability of any individual in generation t+1 to be of type A is x = i(t) / 2N The number of individuals of type A in generation t+1 is binomially distributed: 2N k x (1 x) 2 N k P( i(t 1) k ) k This distribution has mean and variance E(i(t+1)) = i(t) Var(i(t+1)) = 2N x (1-x) The expected number of individuals carrying a mutation A does not change, but because the variance will increase, eventually the mutation will either be lost (i=0) or reach fixation (i=2N). http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf Wright-Fisher Model Suppose the initial frequency of the mutant A is i. Since E( i(t+1) ) = i(t), the expectation of the frequency remains constant throughout. However, eventually it will either be lost or go to fixation. If the probability of eventual fixation is p, we have i = E( i(0) ) = E( i() ) = 2N p + 0 (1-p) = 2 N p The probability p that A will go to fixation is therefore p = i / 2N A simpler argument is this: without selection all alleles are equivalent; the one that gets fixed is chosen uniformly from the present-day population; the probability that this is an A mutant is i / 2N. This also means that for neutral sites, the rate ρ of substitution = the rate u of mutation. Wright-Fisher Model Since x=i / 2N and Var(i(t+1)) = 2N x (1-x) we get Var ( x ) = x (1-x) / 2N, in other words, the sampling variance in the allele frequency x is inversely proportional to the population size. This effect is called (random) genetic drift. The Wright-Fisher model is highly idealized; e.g. populations do vary in size, there is structure, and individuals do not mate randomly. Therefore, N does not directly relate to the actual population size. A more accurate way of putting this is to say that N is the Wright-Fisher population size that generates the same amount of genetic drift as there is in the actual population. To emphasize this, the parameter N is often called the effective population size (and written Ne). http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf The coalescent model Whole population; Wright-Fisher Ancestry of a random sample Ancestry of current population Coalescent http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf Kingman’s coalescent Probability that two given lineages coalesce in one generation: J.F.C. Kingman P(coalescence) = 1/2N Expected number of generations before coalescence, i.e. the time to the most recent common ancestor (MRCA): E( TMRCA ) = 2N Probability of coalescence (of 2 lineages) when k lineages are present = 1-P(no coalescence): 1 2 k 1 1 2 ... (k 1) 1 1 1 1 1 2 N 2N 2N 2N 1 k 2 N 2 Other argument: Coalescence rate per pair is 1/2N; there are k-choose-2 pairs. http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf Variation in the population Suppose the mutation rate is u (per generation, and per locus or site). The expected number of differences between two individuals (diversity) is = 2 * u * E( TMRCA ) = 4 N u (assuming all mutations are unique). The quantity 4 N u often appears in population genetics, and is usually treated as an independent parameter, . Real-life populations do not, of course, follow the Wright-Fisher model. The parameter N that makes the W-F diversity equal to the observed diversity is called the effective population size, Ne. Other definitions (based on other aspects of the model) are used as well. Allele frequency spectrum By going to the continuous (diffusion) limit, the equilibrium distribution of allele frequencies can be derived. This is called the “allele frequency spectrum”. Assuming that mutations and back-mutations occur at the same rate u, the allele frequency spectrum P(x)dx is P(x) dx = x-1 (1-x)-1 dx (apart from normalization). Here = 4 N u. 80 60 40 20 0.2 0.4 0.6 0.8 1 Suppose a mutation occurs at frequency x. The probability of sampling two individuals that are different at that locus is 2 x (1-x). Multiplying with P(x) dx gives the contribution to the heterozygosity (= probability that two random alleles differ) per unit of frequency: H(x) dx = x (1-x) dx Since is small, every frequency contributes nearly equally to the total heterozygosity . Under the influence of selection, the allele frequency spectrum becomes skewed towards the advanta-geous allele, and depleted of intermediate-frequency alleles. This is one way to test for selection. Linkage disequilibrium (LD) Richard Lewontin (1929-) Relates to 2 polymorphic sites DAB = fAB – fAfB = fAB fab - fAb faB (DAB = -DaB = -DAb = Dab ) Correlation coefficient (Hill & Robertson 1968) : r2AB = DAB2 / fafAfbfB Dynamics of LD Genetic drift causes reduction in diversity, so that (expected) LD0 at equilibrium. Recombination decreases LD. Sweep Effect of selective sweep (rapid increase of frequency of an advantageous allele) on LD: – Diversity is reduced – Polymorphisms on selected haplotype are carried along: hitchhiking – More correlations between sites: many share ancestry – Result: LD increases Prior observations • “Extent of enzyme polymorphism is surprisingly constant between species. So constant, in fact, that the effective sizes of most species must be within 1 order of magnitude of each other.” (Lewontin 1974; Maynard Smith & Haigh 1974) • Variation is reduced in regions with low recombination (Aguade 1989; Begun & Aquadro 1992, etc.) Neutral locus Selected locus Assumptions: - Rate of neutral mutations = u - Rate of advantageous mutations = v - Selective advantage of adv. mutations = σ Without linkage to selected locus: mean sum-of-site heterozygosities (ssh; diversity) = 4 N u ( = mean time to coalescence * 2 lineages * neutral mutation rate) Neutral locus Selected locus Assumptions: - Rate of neutral mutations = u - Rate of advantageous mutations = v - Selective advantage of adv. mutations = σ - Times of fixation at selected locus: Poisson process, rate ρ - Fixations are fast compared to drift, can be regarded as instantaneous With linkage to selected locus: Rate of coalescence due to drift = 1/2N Rate of fixation of adv. muts. at selected site = ρ Total coalescence rate: ρ + 1/2N Average time to coalescence: 1 / (ρ + 1 / 2N) ssh = 2 u / (ρ + 1 / 2N) = 4 N u / ( 1 + 2 N ρ ) Limit for N infinity: ssh = 2 u / ρ ssh = 2 u / (ρ + 1 / 2N) = 4Nu/ (1+2Nρ) Rate of fixation ρ v * 2 N σ (provided 1/2N < σ < 1 ) Fixation due to hitchhiking: Current frequency of allele A = x New frequency of allele = z z=1 z=0 z=x with probability with probability with probability freq = z-x E(freq) = 0 Var(freq) = ρ x (1-x) Var(freq) = (1/2N) x (1-x) Var(freq) = (ρ + 1/2N) x (1-x) ρx ρ (1-x) (1-ρ) (hitchhiking; allele A) (hitchhiking; allele a) (no hitchhiking) (infinite population) (finite population; no hitchhiking) (finite population + hitchhiking) Same form as standard W-F model, but with Ne = N / (1 + 2 N ρ) Now assume some recombination between neutral & selected loci (instead of total linkage). Suppose allele linked to advantageous mutation rises to frequency y (rather than frequency 1). z = y + (1-y)x z = (1-y)x z=x with probability with probability with probability freq = z-x E(freq) = 0 Var(freq) = ρ y2 x (1-x) Var(freq) = (1/2N) x (1-x) Var(freq) = (ρ y2 + 1/2N) x (1-x) ρx ρ (1-x) (1-ρ) (hitchhiking; allele A) (hitchhiking; allele a) (no hitchhiking) (infinite population) (finite population; no hitchhiking) (finite population + hitchhiking) Same form as standard W-F model, but with Ne = N / (1 + 2 N ρ y2) Coalescence rate due to drift = 1/2N Coalescence rate due to hitchhiking = ρ E( y2 ) If 2 ρ y2 > 1/N, “draft” (due to hitchhiking, sweeps) is more important than “drift” (population size effect). In the “draft” regime, nucleotide diversity is independent of population size. Numerical example: Fruitfly Limit for N infinity: Neutral mutation rate Site heterozygosity Assume = ssh = 2 u / ρ y2 u = 10^-9 per generation, per site = 0.006 y=1 Rate of advantageous substitutions ρ ~ 10-7, “typical of rate of amino acid substitutions in coding regions” Questions in (population) genetics • Effective population size of human population ~10000. Why the huge discrepancy with actual population size? • The amount of genetic diversity is “surprisingly constant between species” (Lewontin 1964). Is this (i) not a problem / not true, (ii) caused by Gillespie’s “genetic draft”, or (iii) caused by something else? • What is the cause of the variation in recombination rate (including hotspots) across the human genome. Are the latest measurements accurate? • Roughly the same 3-5% of mammalian genome is conserved within the mammalian clade. Does this represent most/all of the functional genome, or is a large fraction functional and fast evolving? What can population genetics (rather than species comparisons) bring to this question? • Common (high-frequency) genetic variants associated with common disease are hard to find and usually explain only a small fraction (~1%) of variability of susceptibility variation. Are common diseases often caused by rare genetic variants instead? If so, how can these be found? (Not by association studies – but linkage studies are expensive and have low-resolution)