6 Pop gen

advertisement
Population genetics
Population genetics
Population genetics concerns the study of genetic variation and change within a population. While for evolving
species there is no model for the branching process (speciation), in population genetics there is. This
allows a detailed modelling of the interplay between mutation, selection, and stochastic effects (genetic
drift).
Simplifying assumptions that are initially made include:
- No selection
- No recombination
- No fluctuations in population size
- No population structure (subdivision; migration)
- No assortative mating (individuals mate randomly)
- No interaction between loci (no epistasis; no linkage)
- No environmental effects (e.g. climate/habitat change etc.)
 RA Fisher
JBS Haldane 
 Sewell Wright
Motoo Kimura 
Kimura’s Neutral Theory
Darwin(ism):
* Something causes minute (phenotype) variations in a population
(ideas: perhaps over-use during lifetime might cause variations (Lamarckism; think giraffes);
perhaps traits are transmitted through blood and blend)
* Natural selection causes adaptive variants to rise in frequency, while non-adaptive ones die out.
Neo-darwinism:
* The “something” is replaced by Mendelian genetics + random mutations
* Panselectionism; adaptionism: most traits are optimal; selection main driving force of evolution
(R.A. Fisher; Richard Dawkins; John Maynard Smith)
Population genetics / neutral theory:
* Most mutations are neutral; genetic drift underlies most of evolution
(Fisher; Haldane; Wright; Kimura)
Modern evolutionary synthesis:
* Takes onboard (parts of) all of the above.
* Neutral theory relevant for DNA data in populations; considered less relevant for phenotypes.
Wright-Fisher Model
-
Constant population size N diploid
individuals = 2N alleles
-
Each descendant chooses a parent
randomly
-
Everyone reproduces simultaneously (no
overlapping generations)
http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf
Wright-Fisher Model
Suppose i(t) individuals carry a particular mutation A in
generation t. The probability of any individual in
generation t+1 to be of type A is
x = i(t) / 2N
The number of individuals of type A in generation t+1 is
binomially distributed:
 2N  k
 x (1  x) 2 N k
P( i(t  1)  k )  
 k 
This distribution has mean and variance
E(i(t+1)) = i(t)
Var(i(t+1)) = 2N x (1-x)
The expected number of individuals carrying a
mutation A does not change, but because the
variance will increase, eventually the mutation will
either be lost (i=0) or reach fixation (i=2N).
http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf
Wright-Fisher Model
Suppose the initial frequency of the mutant A is i. Since
E( i(t+1) )
=
i(t),
the expectation of the frequency remains constant
throughout. However, eventually it will either be
lost or go to fixation. If the probability of eventual
fixation is p, we have
i = E( i(0) ) = E( i() ) = 2N p + 0 (1-p) = 2 N p
The probability p that A will go to fixation is therefore
p = i / 2N
A simpler argument is this: without selection all alleles are equivalent; the one that gets fixed is chosen uniformly
from the present-day population; the probability that this is an A mutant is i / 2N.
This also means that for neutral sites, the rate ρ of substitution = the rate u of mutation.
Wright-Fisher Model
Since x=i / 2N and
Var(i(t+1)) = 2N x (1-x)
we get
Var ( x ) = x (1-x) / 2N,
in other words, the sampling variance in the allele frequency x is
inversely proportional to the population size. This effect is
called (random) genetic drift.
The Wright-Fisher model is highly idealized; e.g. populations do
vary in size, there is structure, and individuals do not mate
randomly. Therefore, N does not directly relate to the actual
population size. A more accurate way of putting this is to
say that N is the Wright-Fisher population size that
generates the same amount of genetic drift as there is in the
actual population.
To emphasize this, the parameter N is often called the effective
population size (and written Ne).
http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf
The coalescent model
Whole population; Wright-Fisher
Ancestry of a random sample
Ancestry of current population
Coalescent
http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf
Kingman’s coalescent
Probability that two given lineages coalesce in one
generation:
J.F.C. Kingman
P(coalescence) = 1/2N
Expected number of generations before coalescence,
i.e. the time to the most recent common
ancestor (MRCA):
E( TMRCA ) = 2N
Probability of coalescence (of 2 lineages) when k
lineages are present = 1-P(no coalescence):
 
1  
2   k  1 
1  2  ...  (k  1)
1  1 1 
  1 
    1 
 
2 N 
2N
  2N   2N  

1 k 
 
2 N  2 
Other argument: Coalescence rate per pair is 1/2N;
there are k-choose-2 pairs.
http://www.stats.ox.ac.uk/~mcvean/Modelling.pdf
Variation in the population
Suppose the mutation rate is u (per generation, and per locus or site).
The expected number of differences between two individuals (diversity) is
 = 2 * u * E( TMRCA ) = 4 N u
(assuming all mutations are unique). The quantity 4 N u often appears in population genetics, and is usually
treated as an independent parameter, .
Real-life populations do not, of course, follow the Wright-Fisher model. The parameter N that makes the W-F
diversity  equal to the observed diversity is called the effective population size, Ne. Other definitions
(based on other aspects of the model) are used as well.
Allele frequency spectrum
By going to the continuous (diffusion) limit, the equilibrium
distribution of allele frequencies can be derived. This is
called the “allele frequency spectrum”.
Assuming that mutations and back-mutations occur at the
same rate u, the allele frequency spectrum P(x)dx is
P(x) dx = x-1 (1-x)-1 dx
(apart from normalization). Here  = 4 N u.
80
60
40
20
0.2
0.4
0.6
0.8
1
Suppose a mutation occurs at frequency x. The probability of sampling two individuals that are different at that
locus is 2 x (1-x). Multiplying with P(x) dx gives the contribution to the heterozygosity  (= probability that
two random alleles differ) per unit of frequency:
H(x) dx = x (1-x) dx
Since  is small, every frequency contributes nearly equally to the total heterozygosity  .
Under the influence of selection, the allele frequency spectrum becomes skewed towards the advanta-geous allele,
and depleted of intermediate-frequency alleles. This is one way to test for selection.
Linkage disequilibrium (LD)
Richard Lewontin (1929-)
Relates to 2 polymorphic sites
DAB = fAB – fAfB
= fAB fab - fAb faB
(DAB = -DaB = -DAb = Dab )
Correlation coefficient (Hill & Robertson 1968) : r2AB = DAB2 / fafAfbfB
Dynamics of LD
Genetic drift causes reduction in diversity, so that (expected) LD0 at equilibrium.
Recombination decreases LD.
Sweep
Effect of selective sweep (rapid increase of frequency of an advantageous allele) on LD:
– Diversity is reduced
– Polymorphisms on selected haplotype are carried along: hitchhiking
– More correlations between sites: many share ancestry
– Result: LD increases
Prior observations
•
“Extent of enzyme polymorphism is surprisingly constant between species. So
constant, in fact, that the effective sizes of most species must be within 1 order of
magnitude of each other.”
(Lewontin 1974; Maynard Smith & Haigh 1974)
•
Variation is reduced in regions with low recombination
(Aguade 1989; Begun & Aquadro 1992, etc.)
Neutral locus
Selected locus
Assumptions:
- Rate of neutral mutations = u
- Rate of advantageous mutations = v
- Selective advantage of adv. mutations = σ
Without linkage to selected locus:
mean sum-of-site heterozygosities (ssh; diversity) = 4 N u
( = mean time to coalescence * 2 lineages * neutral mutation rate)
Neutral locus
Selected locus
Assumptions:
- Rate of neutral mutations = u
- Rate of advantageous mutations = v
- Selective advantage of adv. mutations = σ
- Times of fixation at selected locus: Poisson process, rate ρ
- Fixations are fast compared to drift, can be regarded as instantaneous
With linkage to selected locus:
Rate of coalescence due to drift = 1/2N
Rate of fixation of adv. muts. at selected site = ρ
Total coalescence rate: ρ + 1/2N
Average time to coalescence: 1 / (ρ + 1 / 2N)
ssh = 2 u / (ρ + 1 / 2N) = 4 N u / ( 1 + 2 N ρ )
Limit for N  infinity: ssh = 2 u / ρ
ssh = 2 u / (ρ + 1 / 2N)
=
4Nu/ (1+2Nρ)
Rate of fixation ρ  v * 2 N σ
(provided 1/2N < σ < 1 )
Fixation due to hitchhiking:
Current frequency of allele A = x
New frequency of allele = z
z=1
z=0
z=x
with probability
with probability
with probability
freq = z-x
E(freq) = 0
Var(freq) =
ρ x (1-x)
Var(freq) =
(1/2N) x (1-x)
Var(freq) = (ρ + 1/2N) x (1-x)
ρx
ρ (1-x)
(1-ρ)
(hitchhiking; allele A)
(hitchhiking; allele a)
(no hitchhiking)
(infinite population)
(finite population; no hitchhiking)
(finite population + hitchhiking)
Same form as standard W-F model, but with Ne = N / (1 + 2 N ρ)
Now assume some recombination between neutral & selected loci (instead of total
linkage). Suppose allele linked to advantageous mutation rises to frequency y
(rather than frequency 1).
z = y + (1-y)x
z = (1-y)x
z=x
with probability
with probability
with probability
freq = z-x
E(freq) = 0
Var(freq) =
ρ y2 x (1-x)
Var(freq) =
(1/2N) x (1-x)
Var(freq) = (ρ y2 + 1/2N) x (1-x)
ρx
ρ (1-x)
(1-ρ)
(hitchhiking; allele A)
(hitchhiking; allele a)
(no hitchhiking)
(infinite population)
(finite population; no hitchhiking)
(finite population + hitchhiking)
Same form as standard W-F model, but with Ne = N / (1 + 2 N ρ y2)
Coalescence rate due to drift = 1/2N
Coalescence rate due to hitchhiking = ρ E( y2 )
If 2 ρ y2 > 1/N, “draft” (due to hitchhiking, sweeps) is more important than “drift”
(population size effect).
In the “draft” regime, nucleotide diversity is independent of population size.
Numerical example: Fruitfly
Limit for N  infinity:
Neutral mutation rate
Site heterozygosity
Assume
 = ssh = 2 u / ρ y2
u = 10^-9 per generation, per site
 = 0.006
y=1
Rate of advantageous substitutions ρ ~ 10-7,
“typical of rate of amino acid substitutions in coding regions”
Questions in (population) genetics
•
Effective population size of human population ~10000. Why the huge discrepancy with actual population
size?
•
The amount of genetic diversity is “surprisingly constant between species” (Lewontin 1964). Is this (i) not
a problem / not true, (ii) caused by Gillespie’s “genetic draft”, or (iii) caused by something else?
•
What is the cause of the variation in recombination rate (including hotspots) across the human genome.
Are the latest measurements accurate?
•
Roughly the same 3-5% of mammalian genome is conserved within the mammalian clade. Does this
represent most/all of the functional genome, or is a large fraction functional and fast evolving? What can
population genetics (rather than species comparisons) bring to this question?
•
Common (high-frequency) genetic variants associated with common disease are hard to find and usually
explain only a small fraction (~1%) of variability of susceptibility variation. Are common diseases often
caused by rare genetic variants instead? If so, how can these be found?
(Not by association studies – but linkage studies are expensive and have low-resolution)
Download