Modeling Evolution: The Wright-Fisher Model Rick Stuart and Samantha Biehl Introduction to Evolution Evolution is broadly defined as a change over time. When looking at evolution biologically, it can be defined as a change that is heritable and results in diversity. The process that leads to evolution is known as natural selection. Natural selection is defined as a change in allele frequencies over time due to variation, inheritance, and differential survival. An allele is defined as a variation of the same gene. For example, a gene that codes for eye color has several variants – one that codes for blue eyes, one that codes for green eyes, and one that codes for brown eyes. In a diploid organism, the individual contains two alleles for each gene, and in a haploid organism, the individual has only one allele for each gene. One allele may be dominant to another, therefore the dominant allele will be expressed when in the presence of the less dominant allele, often referred to as the recessive allele. One of the mechanisms by which evolution can occur is genetic drift. In this mechanism, the change in allele frequency is entirely random and independent of any selection pressures. The event where an allele is completely eliminated from a population is referred to as extinction. The opposite of extinction, when only one allele remains in the population because the other has gone extinct, is referred to as fixation. When fixation occurs, the genetic variation of that population is gone. The only ways to get variation back would be through the process of new alleles migrating into the population, or through a mutation of an allele. Mutation is the number one source of variation and is defined as a change to the DNA sequence. Some mutations do not affect the expression of the gene associated with the DNA sequence, and these mutations are known as silent or neutral mutations. Other mutations, however, may affect the expression of a gene and thus have an effect on how well the organism fares. How well an organism does in its environment is called fitness. Fitness is formally defined as the measure of an individual’s reproductive success relative to the average reproductive success in a population. Therefore, an individual who fares better and produces more offspring will have a higher fitness than an individual who might fare well, but doesn’t produce many offspring. Introduction to Wright-Fisher The Wright-Fisher model, in its most basic form, is a stochastic model often used to model genetic drift in a finite population.1,2 The model can be used for both haploid and diploid individuals as it only looks at the allele frequencies in the population. Often the model only looks at what happens when two alleles are present, although it could also be used to look at more than two alleles. The variables used in the basic Wright-Fisher model are as follows: 𝑁 = Number of individuals 2𝑁 = Number of alleles (type A and B) 𝑖 = Number of type A alleles at step 𝑛 − 1 2𝑁 − 𝑖 = Number type B alleles at step 𝑛 − 1 𝑋𝑛 = Number of type A alleles at step 𝑛 𝑖 2𝑁 = Probability of getting type A allele 𝑖 1 − 2𝑁 = Probability of getting type B allele We can then look at the probability that the number of type A alleles will change in the next generation, which is given as: 𝑖 𝑗 2𝑁−𝑖 2𝑁−𝑗 (𝑋𝑛 = 𝑗|𝑋𝑛−1 = 𝑖) = (2𝑁 ) (2𝑁) ( 2𝑁 ) . 𝑗 The number of alleles in a particular step only depends on the number of alleles in the previous step, therefore the Wright-Fisher model is a Markov chain. The transition matrix for the Markov chain can be constructed using the following: 𝑖 𝑗 𝑇𝑖𝑗 = (2𝑁 ) (2𝑁) ( 𝑗 2𝑁−𝑖 2𝑁−𝑗 2𝑁 ) . Probability of Fixation One interesting aspect of the model to investigate is the probability that an allele will fixate or go to extinction.2 Since the model is a binomial distribution, we can modify the general formula of an expected value given as 𝐸(𝑋) = 𝑛𝑝 to fit our model. It then follows that 𝑋 𝐸(𝑋𝑡+1 ) = 2𝑁 2𝑁𝑡 = 𝑋𝑡 . Therefore, 𝐸(𝑋0 ) = 𝑖. By induction it can then be shown that 𝑖 = 𝐸(𝑋𝑡 | 𝑋0 = 𝑖) for all values of 𝑡. This makes the model a bounded martingale. One property of bounded martingales is that they converge as the number of steps approach infinity.3 It is apparent that the model will converge on either extinction (𝑋 = 0) or fixation (𝑋 = 2𝑁). Given that the probability of the model being in one of the absorbing states as 𝑛 → ∞ is 1, this implies that the probability of being in any other state is 0. Therefore the expected value of 𝑋𝜏 as 𝑛 → ∞, where 𝜏 is the step at which the model goes to extinction or fixation, is 𝐸(𝑋𝜏 | 𝑋0 = 𝑖) = 0 ∗ 𝑃(𝑋 𝑛 = 0) + 2𝑁 ∗ 𝑃(𝑋𝑛 = 2𝑁) as 𝑛 → ∞. Therefore, 𝑖 = 0 ∗ 𝑃(𝑋 𝑛 = 0) + 2𝑁 ∗ 𝑃(𝑋𝑛 = 2𝑁) as 𝑛 → ∞, or 𝑖 𝑃(𝑋𝑛 = 2𝑁) = 2𝑁 as 𝑛 → ∞. Assumptions The assumptions of the basic Wright-Fisher Model are as follows: 1) generations do not overlap, 2) population size remains constant, 3) no mutation or recombination is occurring, 4) the genders of the individuals are ignored, 5) familial ties are not taken into consideration, 6) no selection pressure is acting in the population. Basic Model In our basic model, we used a diploid population with 𝑁 = 10 individuals, therefore we looked at 2𝑁 = 20 alleles. First, we wrote a code that would randomly assign an allele (0 for a B allele and 1 for an A allele) to an array with a specified number of A and B alleles. The code would then randomly choose an individual from this initial array for the first individual in the new generation (Figure 1), and this would continue until all 20 individuals were chosen for the new generation. The process would then repeat, randomly choosing an individual from this new generation for the next generation. Figure 1 An allele is chosen at random from the previous generation and is placed into the new generation. We ran this process for 20 generations and assumed that both alleles were initially represented equally in the original population (Figure 2). As can be seen, the A allele ends up faring better than the B allele, and the path of the B allele is just a mirror image of the path of the A allele. We chose the A allele as our allele of interest and therefore only graphed the A allele in future simulations to keep our graphs from looking too clustered. We then wanted to look at the variation that could occur in different simulations, so we ran 10 simulations with 2𝑁 = 20 (Figure 3). As can be seen, in 5 of the 10 simulations the A allele goes to fixation, while in 2 of the 10 it goes to extinction, and in 3 of the 10 it is still present in the population, but in varying numbers. This simulation demonstrates the randomness associated with genetic drift. Figure 2 One simulation of the basic Wright-Fisher model with 2𝑁 = 20 showing both the A and B alleles. Figure 3 Ten simulations of the basic Wright-Fisher model with 2N=20, showing only the A alleles. Figure 4 Ten simulations of the basic Wright-Fisher model with 2𝑁 = 1000, showing only the A alleles. Figure 5 Five simulations of the basic Wright-Fisher model with 2𝑁 = 100000, showing only the A alleles. We then looked at varying population sizes, starting with 2𝑁 = 1000 (Figure 4) and then looking at 2𝑁 = 100000 (Figure 5), and found that as the population size increased, the randomness of genetic drift had a much lower effect and the A allele neither went to fixation nor extinction in the generation span we allowed. Modeling Probability of Fixation Remember that we derived that the probability of fixation of an allele with initial value (𝑋0 = 𝑖) 𝑖 is (𝑋𝑛 = 2𝑁) = 2𝑁 . To test this we performed 100 simulations in our basic model. Each simulation consisted of 100 trials with 𝑖 = 1 and 2𝑁 = 20. We would expect to get an average of 5 trials per simulation that would result in the allele going to fixation. Running the simulation gave a mean of 5.02 trials that went to fixation over the 100 simulations. Fitness Model Once we had our basic model running, we then wanted to see if we could add fitness to the model. To do so, we still had our code randomly choose an individual from our initial array, but this time we set it up so that it was more likely to choose an A allele than a B allele. To do this, we derived the probability of choosing an A allele given the fitness of that allele and came up with the probability of choosing an A allele of 𝑋𝑛−1 2𝑁 𝜔𝐴 𝑃(𝐴) = 𝑋𝑛−1 𝑋𝑛−1 𝜔 + (1 − 𝐴 2𝑁 2𝑁 ) 𝜔𝑩 where 𝜔𝐴 = the fitness of the A allele and 𝜔𝐵 = the fitness of the B allele. We then set the fitness of the A allele to 1 and the fitness of the B allele to 0.9 and ran 10 simulations (Figure 6). Since the fitness of allele A was greater than the fitness of allele B, we expected to see more of the simulations going to fixation. This was in fact the case, with 6 of the 10 simulations going to fixation and only 1 of the 10 simulations going to extinction. Figure 6 Ten simulations of the Wright-Fisher model with fitness 𝑤𝐴 = 1.0, 𝑤𝐵 = 0.9, with 2𝑁 = 20, showing only the A alleles. We ran several simulations, with the fitness of A set at 1.0 and the fitness of B set at 0.95, varying numbers of alleles and numbers of initial A alleles (Figure 7, Figure 8) and found that in each simulation, given a large enough population with a large enough proportion of A alleles, the A alleles eventually went to fixation. 1000 1000 900 800 # of Alleles # of Alleles 800 700 600 400 600 200 500 400 0 20 40 60 80 Generations 100 120 Figure 7 Ten simulations of the WrightFisher model with fitness 𝑤𝐴 = 1.0, 𝑤𝐵 = 0.95, with 2𝑁 = 1000, showing only the A alleles. 0 50 100 Generations 150 Figure 8 Ten simulations of the WrightFisher model with fitness 𝑤𝐴 = 1.0, 𝑤𝐵 = 0.95, with 2𝑁 = 1000, showing only the A alleles. Fitness with Mutation Model Next we wanted to introduce mutation into our model. We then set our probability of a mutation occurring to 𝜇 = 0.01 and used the same probability formula from our fitness model but modified it to take into account each new mutation that arose. 𝑋𝑖:𝑛−1 2𝑁 𝜔𝑖 𝑃(𝑖) = 𝑋 ∑(𝑗=𝑎𝑙𝑙𝑒𝑙𝑒𝑠) 𝑛−1 𝜔𝑗 2𝑁 We also set the fitness of the A and B alleles to 1.0 and assigned a random fitness of between 0.5 and 1.5 to each new mutated allele. We ran one simulation with 2𝑁 = 100 (Figure 9) and found that in this simulation, both of our original alleles went to extinction and a mutant allele came close to going to fixation. Figure 9 One simulation of the Wright-Fisher model with fitness 𝑤𝐴 = 1.0, 𝑤𝐵 = 1.0, 0.5 ≤ 𝑤𝑚𝑢𝑡 ≤ 1.5 and mutation probability 𝑢 = 0.01, with maximum number of mutations set at 10, 2𝑁 = 100, showing all alleles. Summary and Conclusion Our objectives of this assignment were to study a stochastic model and report our findings. We chose the Wright-Fisher model for genetic drift of a finite population. Even though there are other models that have been developed for these scenarios, we wanted to modify our basic Wright-Fisher model to include factors such as fitness and mutation. The results that we are seeing with these modifications indicate that this model can indeed be used for these modifications. The basic model does a nice job of demonstrating the concept of genetic drift. The model that includes fitness illustrates the power of competition and how an allele with a higher fitness has a competitive advantage. Finally, the model that includes fitness and mutation shows the true mechanism for natural selection, the process of competitive mutations supplanting existing alleles. Some limitations of the model include the fact that it’s a generation by generation model, therefore it does not take into account generational overlap. The model also does not take into account the gender of each individual, mating patterns, or sexual selection. Future studies could look at integrating solutions to these limitations into the model. References 1. “Genetic Drift.” Wikipedia: The Free Encyclopedia. 3 May 2013. Web. http://en.wikipedia.org/wiki/Genetic_drift 2. Mitrofanova, Antonina. "Lecture 2: Absorbing States in Markov Chains. Mean Time to Absorption. Wright-Fisher Model. Moran Model." Lecture. Lecture Notes. NYU, Department of Computer Science, New York. 18 Dec. 2007. Web. http://cs.nyu.edu/mishra/COURSES/09.HPGP/scribe2 3. "Martingale Convergence Theorem.” Planetmath.org. N.p., n.d. Web. 06 May 2013. http://planetmath.org/martingaleconvergencetheorem Pseudocode Basic Model Set number of simulations Loop 1 to number of simulations o Set number of alleles, number of A alleles, generations o % Note that an A allele will be 1 and B allele will be 0 o Randomly choose the A alleles for generation 1 o Pre-allocate matrices to track alleles, # of A’s and # of B’s over generations o Seed generation 1 o Loop 2 to number of generations Loop 1 to number of alleles Randomly choose an allele in step n-1 to copy allele o Fill # of A’s and # of B’s matrices o Plot single simulation Fitness Incorporation To determine the type of the n step allele the following algorithm was used: Loop 1 to number of alleles o Get a random variable R o Get the number of A alleles in step n-1 o Calculate the probability that the allele will be type A o If R<probability that the allele will be type A Assign it type A o else Assign it type B Mutation Incorporation Changes included: Keeping track of the mutations, and mutation fitness in matrices Altering the fitness algorithm to determine the probability of each of the alleles that are or have been present up to that time step. Walking through the continuous random variable to determine which of the alleles that are present will be assigned.