Lecture 25 : Tests of Neutrality
April 14, 2014
Out of Africa hypothesis
Neanderthal and Denisovan genomes
Introgression into humans
Sequence data and quantification of variation
Infinite sites model
Nucleotide diversity (π)
Sequence-based tests of neutrality
Ewens-Watterson Test
Tajima ’ s D
Hudson-Kreitman-Aguade
Synonymous versus Nonsynonymous substitutions
McDonald-Kreitman
Equilibrium Heterozygosity under IAM
H e
=
4 N e
4 N e m m +
1
= q q +
1
Frequencies of individual alleles are constantly changing
Balance between loss and gain is maintained
4N e
μ>>1: mutation predominates, new mutants persist, H is high
4N e
μ<<1: drift dominates: new mutants quickly eliminated, H is low
Effects of Population Size on Expected Heterozgyosity
Under Infinite Alleles Model (μ=10 -5 )
Rapid approach to equilibrium in small populations
Higher heterozygosity with less drift
Fate of Alleles in Mutation-Drift Balance
Generations from birth to fixation
Time between fixation events
Time to fixation of a new mutation is much longer than time to loss
Fate of Alleles in Mutation-Drift-Selection Balance
Purifying Selection
Which case will have the most time?
Neutrality
Balancing
Selection/Overdominance
A.
10
8
6
4
2
Assume you take a sample of 100 alleles from a large (but finite) population in mutation-drift equilibrium.
What is the expected distribution of allele frequencies in your sample under neutrality and the Infinite Alleles
Model?
B.
C.
2 4 6 8 10
2 4 6 8 10
Number of Observations of Allele
2 4 6 8 10
Hartl and Clark 2007
Black: Predicted from Neutral
Theory
Neutral theory allows a prediction of frequency distribution of alleles through process of birth and demise of alleles through time
White: Observed (hypothetical)
Comparison of observed to expected distribution provides evidence of departure from
Infinite Alleles model
Depends on f, effective population size, and mutation rate
Ewens Sampling Formula
.
Population mutation rate: index of variability of population:
Probability the i-th sampled allele is new given i alleles already sampled:
4 N e
Probability of sampling a new allele on the first sample:
Probability of observing a new allele after sampling one allele:
0
1
1
H e
Probability of sampling a new allele on the third and fourth samples:
2
i
3
Expected number of different alleles (k) in a sample of 2N alleles is:
E ( k )
2 i
N
1
0
i
1
1
2
...
2 N
1
Example: Expected number of alleles in a sample of 4:
E ( k )
i
N 2
1
0
i
i
3
0
i
1
1
2
3
Ewens Sampling Formula
E ( n )
1
i
N 2
0
1
1
i
2
...
2 N
1 where E(n) is the expected number of different alleles in a sample of N diploid individuals, and
= 4N e
.
f e
1
4 N e
1
1
1
Predicts number of different alleles that should be observed in a given sample size if neutrality prevails under
Infinite Alleles Model
Small
, E(n) approaches 1
Large
, E(n) approaches 2N
can be predicted from number of observed alleles for given sample size
Can also predict expected homozygosity (f e
) under this model
Ewens-Watterson Test
Compares expected homozygosity under the neutral model to expected homozygosity under Hardy-Weinberg equilibrium using observed allele frequencies
Comparison of allele frequency distributions
f e comes from infinite allele model simulations and can be found in tables for given sample sizes and observed allele numbers f
HW
p i
2
Ewens-Watterson Test Example
Hartl and Clark 2007 f e
Drosophila pseudobscura collected from winery
Xanthine dehydrogenase alleles
15 alleles observed in 89 chromosomes
f
HW
= 0.366
Generated f e mean 0.168
by simulation:
How would you interpret this result?
Most Loci Look Neutral According to Ewens-
Watterson Test
Hartl and Clark 2007
DNA sequence is ultimate view of standing genetic variation: no hidden alleles
Is this really true?
What about back mutation?
Signatures of past evolution are contained in DNA sequence
Neutral theory presents null model
Departures due to:
Selection
Demographic events
-
Bottlenecks, founder effects
-
Population admixture
Necessary first step for comparing sequences within and between species
Many different algorithms
Tradeoff of speed and accuracy
Quantifying Divergence of Sequences
Nucleotide diversity (π) is average number of pairwise differences between sequences
N
N
1
ij p i p j
ij where
N is number of sequences in sample, p i and p j are frequency of sequences i and j in the sample, and
π ij is the proportion of sites that differ between sequences i and j
5 10 15 20 25 30 35
A
B
C
A->B, 1 difference
A->C, 1 difference
B->C, 2 differences
N
N
1
ij p i p j
ij
3
( 0 .
33 )( 0 .
33 )( 1 / 35 )
( 0 .
33 )( 0 .
33 )( 1 / 35 )
( 0 .
33 )( 0 .
33 )( 2 / 35 )
2
0 .
01867
On average, there are 18.67 polymorphisms per kb between pairs of haplotypes in the population
Tajima ’ s D Statistic
Infinite Sites Model: each new mutation affects a new site in a sequence
E (
)
m
where m is length of sequence, and
4 N e
m
Expected number of polymorphic sites in all sequences:
E ( S )
a
1
S a
1
i
1 n
1
1 i
S
S a
1 where n is number of different sequences compared
A
B
C
S
5 10 15 20 25 30 a
1
i
1 n
1
1 i
1
1
1
2
1 .
5
0 .
01867
Two polymorphic sites
S=2
S
S a
1
2
1 .
5
1 .
33
m
( 0 .
01867 )( 35 )
0 .
65
35
Tajima ’ s D Statistic
Two different ways of estimating same parameter:
m
S
S a
1
Deviation of these two indicates deviation from neutral expectations d
S
D
d
V ( d ) where V(d) is variance of d
Tajima ’ s D Expectations
D=0: Neutrality d
S
D>0
Balancing Selection: Divergence of alleles (π) increases
OR
Bottleneck: S decreases
D<0
Purifying or Positive Selection: Divergence of alleles decreases
OR
Population expansion: Many low frequency alleles cause low average divergence
‘ balanced ’ mutation
Neutral mutation d
S
Balancing selection
Should increase nucleotide diversity (
)
Decreases polymorphic sites (S) initially.
D>0
Slide adapted from Yoav Gilad
Recent Bottleneck d
S
Rare alleles are lost
Polymorphic sites (S) more severely affected than nucleotide nucleotide diversity (
)
D>0
Standard neutral model
Positive Selection and Purifying Selection sweep recovery
S
Advantageous mutation
Neutral mutation d
S
s
s
Time
Should decrease both nucleotide diversity (
) and polymorphic sites
(S) initially.
S recovers due to mutation
recovers slowly: insensitive to rare alleles
D<0
Slide adapted from Yoav Gilad
Rapid Population Growth will also result in an excess of rare alleles even for neutral loci
Standard neutral model
Rapid population size increase
Most alleles are rare
Nucleotide diversity (
) depressed
Polymorphic sites (S) unchanged or even enhanced : 4N e
μ is large
D<0
Often two main haplotypes, some rare alleles
Slide adapted from Yoav Gilad
Most alleles are rare
4 N e
d
S
How do we distinguish these two forms of divergence
(selection vs demography)?
Divergence between species should be of same magnitude as variation within species
Provides a correction factor for mutation rates at different sites
Complex goodness of fit test
Perform test for loci under selection and supposedly neutral loci
Hudson-Kreitman-Aguade (HKA) test
Neutral Locus Test Locus A
Polymorphism 8 3
Divergence
20
Polymorphism: Variation within species
Divergence: Variation between species
8
8/20 ≈ 3/8
Slide adapted from Yoav Gilad
Hudson-Kreitman-Aguade (HKA) test
Neutral Locus Test Locus B
Polymorphism 8 3
Divergence
20
19
8/20 >> 3/19
Conclusion: polymorphism lower than expected in Test Locus B: Selective sweep?
Slide adapted from Yoav Gilad
http://www.nsf.gov/news/mmg/media/images/corn-and-teosinte_h1.jpg
Teosinte Maize Maize w/TBR mutation
HKA Example: Teosinte Branched
Lab exercise: test Teosinte-Branched Gene for signature of purifying selection in maize compared to Teosinte relative
Compare to patterns of polymorphism and diversity in Alchohol
Dehydrogenase gene