Lecture 3: Effective population size, mutations, heterozygosity

advertisement

Evolutionary Genetics: Part 3

Coalescent 2 – Effective Population size

S. chilense

S. peruvianum

Winter Semester 2012-2013

Prof Aurélien Tellier

FG Populationsgenetik

Color code

Color code:

Red = Important result or definition

Purple: exercise to do

Green: some bits of maths

Population genetics: 4 evolutionary forces

random genomic processes

( mutation , duplication, recombination, gene conversion) molecular diversity natural selection random spatial process ( migration ) random demographic process (drift)

Effective population size

The coalescent

We can calculate many aspects of a genealogical (coalescent) tree for a population of size 2 N

Time to MRCA : E[T

MRCA

] = 4 N (1 – 1/ n )

Length of a tree: E[L] ≈ 4 N log( n -1)

Time of coalescence of last two lineages : E[T

2

] = 2 N

2N

2N/3

2N/6

2N/10

Definition

The real physical population is likely not to behave as in the Wright – Fisher model

Most populations show some kind of structure:

Geographic proximity of individuals,

Social constraints…

The number of descendants may be > 1 for the Poisson distribution

Effective population size = size of a Wright – Fisher population that would produce the same rate of genetic drift as the population of interest

One consequence of drift: do two randomly picked offspring individuals have a common ancestor in the parent generation?

Definition

We will use here the inbreeding effective size = N e

Also called identity by descent population size

Ne = 1/ (2 * P[ T

2

= 1])

Where T

2 is given in generations, T

2

= time until two lineages coalesce

This depends on the immediate previous generation!

An extension is:

Ne (t) = E[T

2

] / 2

This relates to the number of generations until a MRCA is found in the population

Definition

For the haploid Wright – Fisher model

Ne = 1/ (2 * P[ T

2

= 1])

With P[ T

2

= 1] = 1 / (2 N )

So that Ne = N

The extension is:

Ne (t) = E[T

2

] / 2

With E[ T

2

] = 2 N

So that Ne (t) = N

For the Wright – Fisher model the two definitions agree

Calculating Ne

Diploid model with different numbers of males and females

N f

= number of females

N m

= number of males

N f

+ N m

= N

P[T

2

= 1] = (1- 1/(2N)) * N /(8N f

N m

)

Ne (t) = Ne = 4N f

N m

/(N f

+ N m

)

For example: when some men have a harem, N f

= 20 and N m

= 1

What is Ne ?

Calculating N and Ne

Example based on human population: many human genes have an MRCA less than

200,000 years ago

If one generation = 20 years

So if 4Ne < E[MRCA]

Ne < 200,000 / (4*20) => Ne < 2,500 !!!!!!!!!

Of course N is bigger in human population, but Ne maybe be very small ☺

We will see how to estimate Ne from sequence data later on

The coalescent – 2 role of mutations

Coalescent tree + mutations

The distribution of mutations amongst individuals can be summarized as a tree (on a genealogy)

“Coalescent theory” John Wakeley, 2009

The distribution of mutations amongst individuals can be summarized as a tree (on a genealogy)

Coalescent tree + mutations

How to add mutation on a coalescent tree?

In a Wright Fisher model: see drawing

Probability of mutation = µ that an offspring changes its genotype

And P[no mutation] = 1µ

This means for example: for a two allele model A and a : mutation to go from a to A , and vice and versa

Classical model for DNA sequences is the so called infinite site model

Definition: each new mutation hits a new site in the genome

So it cannot be masked by back mutation

Not affected by recurrent mutation

Every mutation is visible except if lost by drift

Models of mutation

There are other models of sequence evolution, but these will not be used for now.

Infinite allele model

Definition: each mutation creates a new allele

Example on a tree

Finite site model

Definition: mutations fall on a finite number of sites

Example on a tree

Coalescent tree + mutations

How to add mutation on a coalescent tree?

Probability of mutation = µ that an offspring changes its genotype

And P[no mutation] = 1µ

Do you see where this is going?

After t generations, what is the probability that there was no mutations?

P[X>t] = (1µ ) t = e - µt

So we can draw again in an exponential distribution the time until a new mutation occurs

And put this on a tree, drawing for each branch the time to new mutation

Coalescent tree + mutations

How to add mutation on a coalescent tree?

The mutation will be visible in all descendants from that branch

4 sites

AAAA

AAAA TTAA TTTT

Coalescent tree + mutations

How to add mutation on a coalescent tree?

The mutation will be visible in all descendants from that branch

4 sites

AAAA

One more mutation

5 sites

AAAAA

AAAA TTAA TTTT AAAAG TTAAA TTTTA

Mutations on a tree

For neutral mutations we can do this process without changing the shape of the tree or the size of the tree

Tree topology = shape and branching of the tree

Branch lengths = length of branches usually in units of 2 N generations

BECAUSE

Forward in time: a neutral mutation does not change the offspring distribution of an individual

Backward in time: mutation does not change the probability to be picked as a parent

Tree topology

For neutral mutations we can do this process without changing the shape of the tree or the size of the tree

Tree topology = shape and branching of the tree

Branch lengths = length of branches usually in units of 2 N generations

Definitions : external branches and internal branches

Tree topology and mutation

We define mutations = SNPs depending on their frequency a b

1 2 3 4

Mutation a is found in two sequences = doubleton

Mutation b is found in one sequence = singleton

Mutations on a tree

We are now interested in the number of mutations on each branch of the tree

For a branch of length llll

The number of mutations follows a Poisson distribution with parameter ( l l l l µ )

So for the total tree: Poisson (L µ )

Remember [ ] = 4 N

− 1 n

∑ 1 i i = 1

So we define S as the total number of mutations on a tree (on a set of sequences)

[ ] = 4 µ [ ]

θ

=

4 N

4 N

− 1 n

∑ 1 i i = 1

= θ

− 1 n

∑ 1 i i = 1

With θ=4N e

µ

The population mutation rate

θ=4N e

µ

This is the crucial parameter: combines mutation and Ne

θ is called the population mutation rate or scaled mutation rate

We can estimate θ based on sequence data

Two estimators have been derived:

θ

̟ derived by Tajima (1983)

θ

S

(or θ

W

) derived by Watterson (1975)

Watterson estimator

θ

S

= θ

W is based on the number of segregating sites in a tree S, compared to the average branch length of sample of size n defined as θ

S

=

S

− 1 n

∑ 1 i i = 1 remember: [ ] = 4 N

− 1 n

∑ 1 i i = 1

This is the expected average number of segregating sites per given length of tree branch

Tajima estimator

θ

̟ is defined as the number of average differences for all pairs of sequences in a sample

Based on ̟ ij which is the number of differences between two sequences i and j

Defined as θ

π

=

1

 

 

 

π ij i ≠ j

=

2

( − 1)

π ij i ≠ j

Because there are n(n-1)/2 pairs of sequences

So take all sequences, and count for all pairs the number of differences,

And then do the average

Tajima estimator

Based on π ij which is the number of differences between two sequences i and j a b

1 2 3 4

Different mutations counts differently

Mutation a is counted in four pairwise comparisons

Mutation b is counted in three comparisons

π ij and thus θ

π depends on how many mutations fall on internal or external branches

Coalescent tree + mutations

Example of calculation

4 sites

AAAA

θ

S

S

=

− 1 n

∑ 1 i i = 1

=

1 +

4 8

=

1 3

2

θ

π

=

2

( − 1)

π ij i ≠ j

=

3 3 2 8

=

3 3

ATAA TAAT TATA

Watterson estimator

θ

S

= θ

W is based on the number of segregating sites in a tree S, compared to the average branch length of sample of size n defined as

θ

S

=

S

− 1 n

∑ 1 i i = 1 remember: [ ] = 4 N

− 1 n

∑ 1 i i = 1

2 * 1

3 * 1/3

4 * 1/6

5 * 1/10

[ ] = 2 N 2 1

2 1

+ + + =

3 2

2 N

 2(1 +

1 1 1

+ +

2 3 4

)

= 4 N

− 1 n

∑ 1 i i = 1

Neutral model of coalescent

θ=4N e

µ

Very important result:

θ

S

= θ

̟

If the population follows a neutral model of coalescent with constant population size!!!!

Estimating Ne

It is possible to estimate Ne based on the two estimators

IF and only IF you have independent data on the mutation rate

Ne = θ

̟

/ 4µ = θ

S

/ 4µ

This assumes:

Infinite site model

Constant Ne over time

Homogeneous population (equal coalescent probability for all pairs)

Estimating Ne

Exercise Calculate θ

̟

, θ

S and estimate Ne

For two datasets:

In human populations: TNFSF-5-Humans.fas

In Drosophila populations: 055-Droso.nex

Define populations in Dnasp using : data => define sequence sets

Then => Polymophism analysis

For droso: europe and africa

Mutation rate in humans = 1.2 * 10 -8 per base per generation

(Scally and Durbin,

Nat Rev Genetics October 2012)

Mutation rate in Drosophila = 10 -8 per base per generation

What are the differences?

Heterozygosity

Heterozygosity

Definition: Heterozygosity H is the probability that two alleles taken at random from a population are different at a random site or locus.

It is a key measure of diversity in populations

If H

0 is the heterozygosity at generation 0, then at generation 1:

Assuming no new mutations

H

1

=

1

2 Ne

0 + (1 −

1

2 Ne

) H

0

Proba to have the same parents at generation 0, with probability=0 to be different

With proba 1-(1/2N) offsprings have different parents, and these parents have proba H

0

(by definition) to be different

Heterozygosity

By iteration we get at generation t

H t

=

 1 −

1

2 Ne

 t

H

0

This means that in the absence of mutation, heterozygosity is lost at a rate of (1/2N) every generation

Heterozygosity + mutation

With the infinite allele model assumption that every new mutation creates a new allele:

Two contrary mechanisms drive the evolution of diversity in population: genetic drift and mutation

If they have the same strength and balance each other = mutationdrift balance

The change in heterozygosity between two generations is:

1

∆ H = H t + 1

− H t

= −

2 Ne

H t

+ 2 µ ( 1 − H t

)

Heterozygosity + mutation

1

∆ H = H t + 1

− H t

= −

2 Ne

H t

+ 2 µ ( 1 − H t

)

Change of heterozygosity due to random drift (always negative)

Change of heterozygosity due to new mutations (always positive)

At equilibrium the value of heterozygosity is Ĥ:

∆ H = 0

⇒ ˆ

=

4 N e

µ

N e

µ

Ĥ=θ / (1+ θ)

The value at equilibrium increases with increasing µ and Ne

WHY?

Mutation – Drift balance

In the case of such model, we are interested in:

The probability for a new mutation to get fixed?

How long does it take to get fixed?

Using a coalescent argument: fixation of the mutation occured if and only if the mutant is that ancestor, this probability = 1/ 2 N

The expected time of fixation is equal to the expected time to the MRCA, so it is = 4 N

What do we expect for selected loci?

Mutation – Drift balance

Substitution rate = rate at which mutations get fixed in a population/species

It is called k

A new mutation starts with frequency 1/ 2N in a population,

The substitution rate occurs mutliplying the number of mutations in a population = 2 N µ

And the probability that one mutation gets fixed = 1/ 2N

So k = 2 N µ * (1/2N) = µ (Kimura)

Most striking result: k does not depend on the effective population size

Download