S. chilense
S. peruvianum
Winter Semester 2012-2013
FG Populationsgenetik
random genomic processes
( mutation , duplication, recombination, gene conversion) molecular diversity natural selection random spatial process ( migration ) random demographic process (drift)
We can calculate many aspects of a genealogical (coalescent) tree for a population of size 2 N
Time to MRCA : E[T
MRCA
] = 4 N (1 – 1/ n )
Length of a tree: E[L] ≈ 4 N log( n -1)
Time of coalescence of last two lineages : E[T
2
] = 2 N
2N
2N/3
2N/6
2N/10
The real physical population is likely not to behave as in the Wright – Fisher model
Most populations show some kind of structure:
Geographic proximity of individuals,
Social constraints…
The number of descendants may be > 1 for the Poisson distribution
Effective population size = size of a Wright – Fisher population that would produce the same rate of genetic drift as the population of interest
One consequence of drift: do two randomly picked offspring individuals have a common ancestor in the parent generation?
We will use here the inbreeding effective size = N e
Also called identity by descent population size
Ne = 1/ (2 * P[ T
2
= 1])
Where T
2 is given in generations, T
2
= time until two lineages coalesce
This depends on the immediate previous generation!
An extension is:
Ne (t) = E[T
2
] / 2
This relates to the number of generations until a MRCA is found in the population
For the haploid Wright – Fisher model
Ne = 1/ (2 * P[ T
2
= 1])
With P[ T
2
= 1] = 1 / (2 N )
So that Ne = N
The extension is:
Ne (t) = E[T
2
] / 2
With E[ T
2
] = 2 N
So that Ne (t) = N
For the Wright – Fisher model the two definitions agree
Diploid model with different numbers of males and females
N f
= number of females
N m
= number of males
N f
+ N m
= N
P[T
2
= 1] = (1- 1/(2N)) * N /(8N f
N m
)
Ne (t) = Ne = 4N f
N m
/(N f
+ N m
)
For example: when some men have a harem, N f
= 20 and N m
= 1
What is Ne ?
Example based on human population: many human genes have an MRCA less than
200,000 years ago
If one generation = 20 years
So if 4Ne < E[MRCA]
Ne < 200,000 / (4*20) => Ne < 2,500 !!!!!!!!!
Of course N is bigger in human population, but Ne maybe be very small ☺
We will see how to estimate Ne from sequence data later on
The distribution of mutations amongst individuals can be summarized as a tree (on a genealogy)
“Coalescent theory” John Wakeley, 2009
The distribution of mutations amongst individuals can be summarized as a tree (on a genealogy)
How to add mutation on a coalescent tree?
In a Wright Fisher model: see drawing
Probability of mutation = µ that an offspring changes its genotype
And P[no mutation] = 1µ
This means for example: for a two allele model A and a : mutation to go from a to A , and vice and versa
Classical model for DNA sequences is the so called infinite site model
Definition: each new mutation hits a new site in the genome
So it cannot be masked by back mutation
Not affected by recurrent mutation
Every mutation is visible except if lost by drift
There are other models of sequence evolution, but these will not be used for now.
Infinite allele model
Definition: each mutation creates a new allele
Example on a tree
Finite site model
Definition: mutations fall on a finite number of sites
Example on a tree
How to add mutation on a coalescent tree?
Probability of mutation = µ that an offspring changes its genotype
And P[no mutation] = 1µ
Do you see where this is going?
After t generations, what is the probability that there was no mutations?
P[X>t] = (1µ ) t = e - µt
So we can draw again in an exponential distribution the time until a new mutation occurs
And put this on a tree, drawing for each branch the time to new mutation
How to add mutation on a coalescent tree?
The mutation will be visible in all descendants from that branch
4 sites
AAAA
AAAA TTAA TTTT
How to add mutation on a coalescent tree?
The mutation will be visible in all descendants from that branch
4 sites
AAAA
One more mutation
5 sites
AAAAA
AAAA TTAA TTTT AAAAG TTAAA TTTTA
For neutral mutations we can do this process without changing the shape of the tree or the size of the tree
Tree topology = shape and branching of the tree
Branch lengths = length of branches usually in units of 2 N generations
BECAUSE
Forward in time: a neutral mutation does not change the offspring distribution of an individual
Backward in time: mutation does not change the probability to be picked as a parent
For neutral mutations we can do this process without changing the shape of the tree or the size of the tree
Tree topology = shape and branching of the tree
Branch lengths = length of branches usually in units of 2 N generations
Definitions : external branches and internal branches
We define mutations = SNPs depending on their frequency a b
1 2 3 4
Mutation a is found in two sequences = doubleton
Mutation b is found in one sequence = singleton
We are now interested in the number of mutations on each branch of the tree
For a branch of length llll
The number of mutations follows a Poisson distribution with parameter ( l l l l µ )
So for the total tree: Poisson (L µ )
Remember [ ] = 4 N
− 1 n
∑ 1 i i = 1
So we define S as the total number of mutations on a tree (on a set of sequences)
[ ] = 4 µ [ ]
θ
=
4 N
4 N
− 1 n
∑ 1 i i = 1
= θ
− 1 n
∑ 1 i i = 1
With θ=4N e
µ
θ=4N e
µ
This is the crucial parameter: combines mutation and Ne
θ is called the population mutation rate or scaled mutation rate
We can estimate θ based on sequence data
Two estimators have been derived:
θ
̟ derived by Tajima (1983)
θ
S
(or θ
W
) derived by Watterson (1975)
θ
S
= θ
W is based on the number of segregating sites in a tree S, compared to the average branch length of sample of size n defined as θ
S
=
S
− 1 n
∑ 1 i i = 1 remember: [ ] = 4 N
− 1 n
∑ 1 i i = 1
This is the expected average number of segregating sites per given length of tree branch
θ
̟ is defined as the number of average differences for all pairs of sequences in a sample
Based on ̟ ij which is the number of differences between two sequences i and j
Defined as θ
π
=
1
∑
π ij i ≠ j
=
2
( − 1)
∑
π ij i ≠ j
Because there are n(n-1)/2 pairs of sequences
So take all sequences, and count for all pairs the number of differences,
And then do the average
Based on π ij which is the number of differences between two sequences i and j a b
1 2 3 4
Different mutations counts differently
Mutation a is counted in four pairwise comparisons
Mutation b is counted in three comparisons
π ij and thus θ
π depends on how many mutations fall on internal or external branches
Example of calculation
4 sites
AAAA
θ
S
S
=
− 1 n
∑ 1 i i = 1
=
1 +
4 8
=
1 3
2
θ
π
=
2
( − 1)
∑
π ij i ≠ j
=
3 3 2 8
=
3 3
ATAA TAAT TATA
θ
S
= θ
W is based on the number of segregating sites in a tree S, compared to the average branch length of sample of size n defined as
θ
S
=
S
− 1 n
∑ 1 i i = 1 remember: [ ] = 4 N
− 1 n
∑ 1 i i = 1
2 * 1
3 * 1/3
4 * 1/6
5 * 1/10
[ ] = 2 N 2 1
2 1
+ + + =
3 2
2 N
2(1 +
1 1 1
+ +
2 3 4
)
= 4 N
− 1 n
∑ 1 i i = 1
θ=4N e
µ
Very important result:
θ
S
= θ
̟
If the population follows a neutral model of coalescent with constant population size!!!!
It is possible to estimate Ne based on the two estimators
IF and only IF you have independent data on the mutation rate
Ne = θ
̟
/ 4µ = θ
S
/ 4µ
This assumes:
Infinite site model
Constant Ne over time
Homogeneous population (equal coalescent probability for all pairs)
Exercise Calculate θ
̟
, θ
S and estimate Ne
For two datasets:
In human populations: TNFSF-5-Humans.fas
In Drosophila populations: 055-Droso.nex
Define populations in Dnasp using : data => define sequence sets
Then => Polymophism analysis
For droso: europe and africa
Mutation rate in humans = 1.2 * 10 -8 per base per generation
(Scally and Durbin,
Nat Rev Genetics October 2012)
Mutation rate in Drosophila = 10 -8 per base per generation
What are the differences?
Definition: Heterozygosity H is the probability that two alleles taken at random from a population are different at a random site or locus.
It is a key measure of diversity in populations
If H
0 is the heterozygosity at generation 0, then at generation 1:
Assuming no new mutations
H
1
=
1
2 Ne
0 + (1 −
1
2 Ne
) H
0
Proba to have the same parents at generation 0, with probability=0 to be different
With proba 1-(1/2N) offsprings have different parents, and these parents have proba H
0
(by definition) to be different
By iteration we get at generation t
H t
=
1 −
1
2 Ne
t
H
0
This means that in the absence of mutation, heterozygosity is lost at a rate of (1/2N) every generation
With the infinite allele model assumption that every new mutation creates a new allele:
Two contrary mechanisms drive the evolution of diversity in population: genetic drift and mutation
If they have the same strength and balance each other = mutationdrift balance
The change in heterozygosity between two generations is:
1
∆ H = H t + 1
− H t
= −
2 Ne
H t
+ 2 µ ( 1 − H t
)
1
∆ H = H t + 1
− H t
= −
2 Ne
H t
+ 2 µ ( 1 − H t
)
Change of heterozygosity due to random drift (always negative)
Change of heterozygosity due to new mutations (always positive)
At equilibrium the value of heterozygosity is Ĥ:
∆ H = 0
⇒ ˆ
=
4 N e
µ
N e
µ
Ĥ=θ / (1+ θ)
The value at equilibrium increases with increasing µ and Ne
WHY?
In the case of such model, we are interested in:
The probability for a new mutation to get fixed?
How long does it take to get fixed?
Using a coalescent argument: fixation of the mutation occured if and only if the mutant is that ancestor, this probability = 1/ 2 N
The expected time of fixation is equal to the expected time to the MRCA, so it is = 4 N
What do we expect for selected loci?
Substitution rate = rate at which mutations get fixed in a population/species
It is called k
A new mutation starts with frequency 1/ 2N in a population,
The substitution rate occurs mutliplying the number of mutations in a population = 2 N µ
And the probability that one mutation gets fixed = 1/ 2N
So k = 2 N µ * (1/2N) = µ (Kimura)
Most striking result: k does not depend on the effective population size