Population Genetics: Building the Coalescence Theory

advertisement
2008
USC3002
ANSHUL GUPTA
U067327N
[POPULATION GENETICS:
BUILDING THE COALESCENCE
THEORY ]
This paper deals with building a model to trace all the alleles of a gene to a single ancestral copy. By
various refinements and modifications, the coalescence theory has also been used to study phylogeny
and reconstruct phylogenic trees.
What is the Coalescence Theory?
Coalescence Theory is a model in population genetics that provides a statistical method to trace various
biological entities (usually alleles) back to a common ancestral copy. One of the many examples would
be an attempt to trace all the alleles that produce various eye pupil colors to a single allele in the past
that fathered all of the ones that we see today. A statistical gene genealogy of this particular gene can
be constructed using this theory and the time at which all of these alleles coalesce to a single allele
predicted.
Bare Beginnings
To fully understand and appreciate the simplicity and aid in understanding of more complex coalescence
model we first start with the simplest one.
We discard all the assumptions of selection, mutation, recombination and genetic drift.
We adopt a few assumptions that generations remain discrete, the population’s size remains constant at
N throughout all of its generations and that all of the N members equally share the probability of
producing an offspring in each generations.
We can imagine this to be a constant population of N bacteria that reproduce by replications and all of
the assumptions mentioned above hold true.
This implies that the probability that any bacterium in a generation t has a father is 1. The bacteria to
exist should have been fathered by someone in the previous generation.
The probability that a second bacteria has the same father as the previous bacteria is 1 out of the total t1 population which is N.
Thus, the probability that two bacteria in t generation have the same father in t-1 generation is 1/N
(please see figure 1 below).
Figure 1: The probabilities of two offspring sharing the same parent.
Thus we can see that the probability of two bacteria not sharing a parent is 1 – 1/N or (N-1)/N.
Figure 2: The lineages in subsequent generations as we go back for two bacteria. (The time increases as
we go from bottom to top. Please note that the fusion of the two lineages is called coalescence).
Now we can write the probability distribution that any two bacteria share a same ancestor in any
generation.
The probability distribution of coalescence of two bacteria (their lineages) for t generations ago will be
the probability that the bacteria do not coalesce for t-1 generations multiplied by the probability that
they do coalesce once. Because we are talking about following lineages, there is no such thing as two
lineages fusing into one twice.
We can imagine this probability distribution above to be a binomial distribution with the combination
missing.
So we write:
We now know that for
When we make n >> k (the following derivation is modeled on Wikipedia article on Poisson distribution)
To calculate the F we take the log
Using Sterling Approximation
Thus, the probability distribution becomes a Poisson distribution
In our situation
Thus, we can see that our probability distribution is well approximated by an exponential distribution.
This type of exponential distribution has an expected value of 1/λ, which is simply N.
It can be shown easily.
Using integration by parts, we get
Thus, for our probability distribution
We have an expectation value of N. This means that we would expect t = N before one coalescence
event takes place. This means that if we have 10 bacteria in the population. Then the expected number
of generations we have to go back before any two bacteria’s lineages fuse into one (one coalescence
event) will be N number of generations.
Building a more generalized model
Now we will try to solve for the probability distribution of a coalescence event when we trace back 3
lineages (3 bacteria) and subsequently generalize it to tracing n bacterial lineages in a population of N.
All the previous assumptions hold.
Now for 2 bacteria as we discussed, the probability that these two share distinct parents was (N-1)/N.
This means that if bacterium 1 has a parent then bacterium 2 at random selection has a chance of (N1)/N probability of not sharing a parent with bacterium 1. Now if we add a 3rd bacterium that the chance
that it too does not share any parent with either the 1st or the 2nd bacterium implies that it has a
probability of (N-2)/N of not choosing the same parent as the other two.
Thus the probability of all 3 bacteria having distinct parents in the previous generation is
Thus for n bacteria each having unique parents is given by
We ignore the higher order term since N is large we expect the function F of 1/N2 to be small.
This approximation is usually true since there are no more than 5 alleles in a gene for the 6 billion
number of humans. So you can imagine it by saying that we will only trace lineages of insignificant
number of bacteria out of the total population.
If this is true then we know that the probability distribution of a coalescence at t generations behind is
given by
Where (D(n))t-1 is the probability of no coalescence t-1 number of times and 1-D(n) is the probability of
coalescence once.
This distribution thus gives us the probability of one coalescence when we trace n lineages t generations
behind.
We can see that this is again a binomial distribution and following the similar argument as above we can
conclude that this will approximate an exponential probability distribution.
The mean of this distribution is 1/λ =
We can see that this becomes the same equation above if we choose two bacteria, we get mean time
for one coalescence to be N. For three bacteria we get the mean time for one coalescence to be N/3.
This means that if we trace 3 lineages we need to go back only a third the number of generations back
to get one fusion as compared to 2 lineages.
This can be intuitively seen if we manipulate the mean time for one coalescence a little bit.
Now it can be seen that when we choose 3 bacteria and therefore 3 lineages, there are
number of
distinct ways these lineages can fuse whereas for 2 bacteria there is only one. So the probability of one
coalescence will be
will reduce by
times higher for n bacterial lineages. Thus, the mean time for one coalescence
.
This model as described above is known as standard coalescent or Kingman n-coalescent model.
Actually, Kingman’s model is a little different because it applies to alleles. He assumed that every
individual carries 2 alleles. When we trace back these alleles, the variation that arises in their lineages is
solely because of genetic drift (mutation is ignored).
Thus in our formula calculated above, we will replace N with 2N meaning 2 alleles for every individual in
a population size N
With mean time being
Analysis of our model
Though the model seems incompetent, as it does not account for migration, mutation, or selection etc,
the model is sufficiently powerful to give us an overall topology of the genealogies. In addition, these
factors such as mutation, selection, migration can be added to the model later on.
Our model shows that as the number of lineages (or common ancestors) increase, the coalescence time
decreases.
We also saw that as we decrease the size of the population, our coalescence time also decreases.
However, when we observe our exponential distribution, we see that its standard deviation is very large.
For example, in a population of N = 1000, 90% of our coalescent event will take place between 50-3000
generations. The upper limit is 60 times larger than the lower limit. This is a very large variation.
When we produce these genealogy trees with a simulator stochastically but maintaining the same
probability distribution for the same entries we unsurprisingly get very different and largely varying
trees.
Figure 3: The large variations in stochastically produced trees using the same probability distribution and
same entries.
Modifying the model
To bring this model closer to reality, we would now try to accommodate mutations. The mutations are
added into the model stochastically. Let us say that the mutation rate is known to be µ (per generation
per person). Then a lineage which is L generations long will have undergone approximately µL
mutations. So we add in the mutations stochastically to our lineages with a constant probability
distribution of µ. Take the figure below as an example.
Figure 4: The coalescent model with mutations accommodated.
The different colors correspond to different mutations that have been added in stochastically using a
simulator.
We can also observe that because of the mutations, our alleles now have a genetic variation.
Interestingly, if the length of the lineage is known as well as the mutation rate, one can calculate the
genetic diversity.
Usually, what one does with the coalescent model is we can examine the genetic diversity of the current
population and work backwards to calculate the mutation rate of a specific gene.
This is how it is done.
We first estimate the total length of the genealogy. We do this by summing the coalescent interval T(n)
over n lineages that share this interval.
In this total length of the genealogy we expect the number of mutations to be mutation rate X the
number of generations.
So now, we can substitute T into expected value of mutations to get:
Now we can assume that the length of each of these coalescent interval is the mean internal that we
calculated above.
is known as the Watterson’s estimate of genetic diversity (Watterson’s theta) and is usually
denoted as θ.
Now if we know that for a population of N there are 2N alleles so we can fit this derived formula for
alleles by substituting 2N for N.
Also, one can see that if one can estimate the total number of alleles (more specifically the number of
lineages) and the total number of mutations present among them then one can estimate the mutation
rate for that specific gene.
If you observe figure 4 again, you can see that by looking at different mutations present in current
alleles (different colors), you can approximately guess the number of mutations.
This is obviously not a very accurate way of determining the mutation rate but it does give a lower limit.
It is entirely possible that certain mutations that took place were either fatal or highly selected against
and so do not exist in the present population anymore. These events would mean that the mutation rate
is higher than the one we will estimate.
To illustrate an example if we have a constant population size of 10 and 2 lineages (1 coalescent event)
and we observe 4 variations in base pair comparison for two individuals. Then the mutation rate
Relaxing the constant population size N
We are aware that either population tends to grow such as the human population in recent years or
they tend to shrink such as certain endangered species. This implies that the total number of alleles also
grow or shrink. This has obvious implications on our coalescent model.
We can easily accommodate this by assuming N is not a constant but a function of number of
generations (time). Usually it is written as an exponential function so the population is increasing
exponentially or decreasing exponentially. Other population functions such as the logistic model etc can
also be accommodated but they tend to make the calculations very difficult and must be solved
numerically by a computer.
Since it is not the scope of my paper, I will not show you these modified models but one can simply
replace the N in the equations above with N(t).
The changing population size has a substantial impact on the genealogy tree as can be seen in the figure
below.
Figure 5: Genealogy trees with coalescent trees with varying population sizes.
It is immediately clear that to witness a coalescent event one have to go farther back in time for an
expanding population in comparison to the population that was constant. This owes to the fact that in
an increasing population the probability of one lineage separating into two reduces. The opposite
happens in a declining population.
Another interesting thing to note is that in an expanding population, there are long external branches.
This means that singleton mutations take place on external branches and will lead to a larger genetic
diversity. Whereas for a shrinking population, the long internal branches will be subject to singleton
mutations and this will tend to reduce the genetic diversity (since a lot more of common the mutations
will show up in more individuals in a population).
This coalescent model can be further modified to include other effects such as:







Mutations (shown above)
Fluctuating populations (explained above)
Migration
Recombination
Sexes
Selection
Meta-populations (extinction/ re-colonization)
To find out how these would be accommodated please refer to the references and further readings
section below.
In general, adding these in would not change the number of coalescent events (nodes) but it would
modify the time for each coalescent (tree length).
Conclusion
Coalescent theory or model may be guilty of over simplification but it does form a foundation to its type
of analysis. Coalescent theory has a large practical as well as historical significance. Though one of the
criticisms of this model is the high standard deviations associated with its predictions.
People have used this theory with major modifications to try to predict the number of years one has to
go back in time to find Eve (the mother of all humans).
The theory has been also used with major modifications to construct phylogeny trees (the separation of
ancestor species to give rise to new species) (please refer to Crandall and Templeton’s paper).
Overall, it is an important theory to know and forms a core in population genetics.
References and further readings
C.R. Young, Coalescent Theory. Lecture.
<bio.classes.ucsc.edu/bio107/Class%20pdfs/W05_lecture14.pdf>
Magnus Nordborg. Coalescent Theory. Department of Genetics: Lund University.
<www-cse.ucsd.edu/classes/sp05/cse291-a/doc/nordborg_coalescent.pdf>
Peter Beerli. Population Genetics: Coalescent Theory I.
<https://people.scs.fsu.edu/~beerli/BSC-5936/10-31-05/lecture_16.pdf>
Keith Crandall and Alan Templeton. Empirical Tests of Some Predictions From Coalescent Theory
With Applications to Intraspecific Phylogeny Reconstruction. Getetics Society of America:
1998. <www.genetics.org/cgi/content/abstract/134/3/959>
Noah Rosenberg and Magnus Nordborg. Genealogical trees, Coalescent theory and the analysis of
Genetic Polymophisms. Nature Publishing Group: 2002.
<rosenberglab.bioinformatics.med.umich.edu/papers/coalnrg.pdf>
John Wakeley. The Coalescent Theory: An Introduction. Harvard University. Roberts and Company
Publishers: 2008.
I would advise to read further on Coalescent Theory. To focus my paper I only discussed the Kingman’s
Model who was the first one to come up with this theory. Since then many other famous models have
emerged such as Wright-Fisher Population Model, Moran’s Population Model, and Canning’s Population
Model and so on. These models are a bit different but give new approach and new perspectives and
have their own strengths and weaknesses.
Download