Inferring human demographic history from DNA sequence data

advertisement
Inferring human demographic
history from DNA sequence
data
Apr. 28, 2009
J. Wall
Institute for Human Genetics, UCSF
Standard model of human evolution
Standard model of human evolution
(Origin and spread of genus Homo)
2 – 2.5 Mya
Standard model of human evolution
(Origin and spread of genus Homo)
?
?
1.6 – 1.8 Mya
Standard model of human evolution
(Origin and spread of genus Homo)
0.8 – 1.0 Mya
Standard model of human evolution
Origin and spread of ‘modern’ humans
150 – 200 Kya
Standard model of human evolution
Origin and spread of ‘modern’ humans
~ 100 Kya
Standard model of human evolution
Origin and spread of ‘modern’ humans
40 – 60 Kya
Standard model of human evolution
Origin and spread of ‘modern’ humans
15 – 30 Kya
Estimating demographic
parameters
• How can we quantify this qualitative scenario
into an explicit model?
• How can we choose a model that is both
biologically feasible as well as computationally
tractable?
• How do we estimate parameters and quantify
uncertainty in parameter estimates?
Estimating demographic
parameters
• Calculating full likelihoods (under realistic
models including recombination) is
computationally infeasible
• So, compromises need to be made if one is
interested in parameter estimation
African populations
10 populations
229 individuals
African populations
Mandenka (bantu)
61 autosomal loci
~ 350 Kb sequence
data
Biaka (pygmies)
San (bushmen)
A simple model of African
population history
T
g1
m
g2
Mandenka
Biaka
(or San)
Estimation method
We use a composite-likelihood method (cf. Plagnol
and Wall 2006) that uses information from the joint
frequency spectrum such as:
Numbers of segregating sites
Numbers of shared and fixed differences
Tajima’s D
FST
Fu and Li’s D*
Estimation method
We use a composite-likelihood method (cf. Plagnol
and Wall 2006) that uses information from the joint
frequency spectrum such as:
Numbers of segregating sites
Numbers of shared and fixed differences
Tajima’s D
FST
Fu and Li’s D*
Estimating likelihoods
Pop1
Pop2
Estimating likelihoods
Pop 1 private polymorphisms
Pop1
Pop2
Estimating likelihoods
Pop 1 private polymorphisms
Pop 2 private polymorphisms
Pop1
Pop2
Estimating likelihoods
Pop 1 private polymorphisms
Pop 2 private polymorphisms
Shared polymorphisms
Pop1
Pop2
Estimation method
We use a composite-likelihood method (cf. Plagnol
and Wall 2006) that uses information from the joint
frequency spectrum such as:
Numbers of segregating sites
Numbers of shared and fixed differences
Tajima’s D
FST
Fu and Li’s D*
Estimating likelihoods
We assume these other statistics are multivariate
normal.
Then, we run simulations to estimate the means
and the covariance matrix.
This accounts (in a crude way) for dependencies
across different summary statistics.
Composite likelihood
We form a composite likelihood by assuming these
two classes of summary statistics are independent
from each other
We estimate the (composite)-likelihood over a grid
of values of g1, g2, T and M and tabulate the MLE.
We also use standard asymptotic assumptions to
estimate confidence intervals
Estimates (with 95% CI’s)
Parameter
Man-Bia
Man-San
g1 (000’s)
g2 (000’s)
0 (0 – 3.8)
4 (0 – 7.9)
0 (0 – 3.8)
2 (0 – 11)
T (000’s)
M (= 4Nm)
450 (300 – 640) 100 (77 – 550)
10 (8.4 – 12)
3 (2.2 – 4)
Fit of the null model
How well does the demographic null model fit the
patterns of genetic variation found in the actual
data?
Fit of the null model
How well does the demographic null model fit the
patterns of genetic variation found in the actual
data?
Quite well. The model accurately reproduces both
parameters used in the original fitting (e.g.,
Tajima’s D in each population) as well as other
aspects of the data (e.g., estimates of ρ = 4Nr)
Estimates (with 95% CI’s)
Parameter
Man-Bia
Man-San
g1 (000’s)
g2 (000’s)
0 (0 – 3.8)
4 (0 – 7.9)
0 (0 – 3.8)
2 (0 – 11)
T (000’s)
M (= 4Nm)
450 (300 – 640) 100 (77 – 550)
10 (8.4 – 12)
3 (2.2 – 4)
population size
Population growth
time
population size
Population growth
time
spread of agriculture and animal
husbandry?
Estimates (with 95% CI’s)
Parameter
Man-Bia
Man-San
g1 (000’s)
g2 (000’s)
0 (0 – 3.8)
4 (0 – 7.9)
0 (0 – 3.8)
2 (0 – 11)
T (000’s)
M (= 4Nm)
450 (300 – 640) 100 (77 – 550)
10 (8.4 – 12)
3 (2.2 – 4)
Ancestral structure in Africa
At face value, these results suggest that
population structure within Africa is old, and
predates the migration of modern humans out of
Africa.
Is there any evidence for additional (unknown)
ancient population structure within Africa?
Model of ancestral structure
Archaic
human
population
T
g1
m
g2
Mandenka
Biaka
(or San)
Standard model of human evolution
Origin and spread of ‘modern’ humans
~ 100 Kya
Admixture mapping
Modern human DNA
Neandertal DNA
Admixture mapping
Modern human DNA
Neandertal DNA
Admixture mapping
Modern human DNA
Neandertal DNA
Admixture mapping
Modern human DNA
Neandertal DNA
Admixture mapping
Modern human DNA
Neandertal DNA
Orange chunks are ~10
– 100 Kb in length
Genealogy with archaic ancestry
time
Modern
humans
Archaic
humans
present
Genealogy without archaic ancestry
time
Modern
humans
Archaic
humans
present
Our main questions
• What pattern does archaic ancestry
produce in DNA sequence polymorphism
data (from extant humans)?
• How can we use data to
– estimate the contribution of archaic humans to
the modern gene pool (c)?
– test whether c > 0?
Genealogy with archaic ancestry
(Mutations added)
time
Modern
humans
Archaic
humans
present
Genealogy with archaic ancestry
(Mutations added)
time
Modern
humans
Archaic
humans
present
Patterns in DNA sequence data
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
1
2
3
4
5
6
7
A
A
T
A
T
A
A
T
G
G
G
G
G
G
C
C
C
C
T
C
C
C
C
G
C
G
C
C
A
A
G
A
G
A
A
C
C
T
C
T
T
T
A
G
A
A
A
A
A
G
G
A
G
A
G
G
C
C
C
C
C
A
A
T
T
C
T
C
T
T
G
G
T
G
T
G
G
Patterns in DNA sequence data
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
1
2
3
4
5
6
7
A
A
T
A
T
A
A
T
G
G
G
G
G
G
C
C
C
C
T
C
C
C
C
G
C
G
C
C
A
A
G
A
G
A
A
C
C
T
C
T
T
T
A
G
A
A
A
A
A
G
G
A
G
A
G
G
C
C
C
C
C
A
A
T
T
C
T
C
T
T
G
G
T
G
T
G
G
Patterns in DNA sequence data
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
Sequence
1
2
3
4
5
6
7
A
A
T
A
T
A
A
T
G
G
G
G
G
G
C
C
C
C
T
C
C
C
C
G
C
G
C
C
A
A
G
A
G
A
A
C
C
T
C
T
T
T
A
G
A
A
A
A
A
G
G
A
G
A
G
G
C
C
C
C
C
A
A
T
T
C
T
C
T
T
G
G
T
G
T
G
G
We call the sites in red congruent sites – these are sites
inferred to be on the same branch of an unrooted tree
Linkage disequilibrium (LD)
LD is the nonrandom association of alleles at
different sites.
Low LD:
A
A
A
A
G
G
G
G
C
T
C
T
C
T
C
T
High recombination
High LD:
A
A
A
A
G
G
G
G
C
C
C
C
T
T
T
T
Low recombination
Measuring ‘congruence’
To measure the level of ‘congruence’ in SNP data from
larger regions we define a score function
S* =
max S(I)
I {1,2,...n }
k 1
where S (i1, . . . ik) =
S (i , i
j
j 1
)
j 1
and S (ij, ij+1) is a function of both congruence (or near
congruence) and physical distance between ij and ij+1.
An example
An example (CHRNA4)
An example (CHRNA4)
How often is S* from simulations greater than or equal to the
S* value from the actual data?
An example (CHRNA4)
How often is S* from simulations greater than or equal to the
S* value from the actual data?
p = 0.025
S* is sensitive to ancient admixture
General approach
We use the model parameters estimated before
(growth rates, migration rate, split time) as a
demographic null model.
Is our null model sufficient to explain the patterns
of LD in the data?
We test this by comparing the observed S* values
with the distribution of S* values calculated from
data simulated under the null model.
Distribution of p-values
(Mandenka and San)
0.2
0.18
0.14
0.12
0.1
0.08
0.06
0.04
0.02
p-value
0.95
0.85
0.75
0.65
0.55
0.45
0.35
0.25
0.15
0
0.05
frequency
0.16
Distribution of p-values
(Mandenka and San)
0.2
0.18
frequency
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
p-value
Global p-value: 2.5 * 10-5
0.95
0.85
0.75
0.65
0.55
0.45
0.35
0.25
0.15
0.05
0
Estimating ancient admixture
rates
The global p-values for S* are highly significant in
every population that we’ve studied!
If we estimate the ancient admixture rate in our
(composite)-likelihood framework, we can exclude
no ancient admixture for all populations studied.
A region on chromosome 4
A region on chromosome 4
19 mutations (from 6 Kb of sequence) separate 3 Biaka
sequences from all of the other sequences in our sample.
Simulations suggest this cannot be caused by recent
population structure (p < 10-3)
This corresponds to isolation lasting ~1.5 million years!
Possible explanations
• Isolation followed by later mixing is a recurrent
feature of human population history
• Mixing between ‘archaic’ humans and modern
humans happened at least once prior to the
exodus of modern humans out of Africa
• Some other feature of population structure is
unaccounted for in our simple models
Acknowledgments
Collaborators:
Mike Hammer (U. of Arizona)
Vincent Plagnol (Cambridge University)
Samples:
Foundation Jean Dausset (CEPH)
Y chromosome consortium (YCC)
Funding:
National Science Foundation
National Institutes for Health
Download