TCGA seminar series “Introduction to doing genetic history” Coalescent Theory

advertisement
TCGA seminar series
“Introduction to doing genetic
history”
Coalescent Theory
Models of microsatellite evolution
Dating methods
Coalescent Theory
The following series of slides shows how you
can build up a genealogical tree to relate a
sample of 22 individuals, collected in the
present day, at a single locus (e.g. the nonrecombining Y chromosome).
Because (for the Y chromosome) one son has
only one father, but one father can have
more than one son, coalescent events occur
in the genealogy which inevitably result in a
reduction of ancestors. Eventually, one
ancestor remains – the Most Recent
Common Ancestor (MRCA).
Present
Time
22 individuals
Present
22 individuals
18 ancestors
Time
Present
22 individuals
18 ancestors
16 ancestors
Time
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
Time
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
Time
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
Time
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
3 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
3 ancestors
2 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
3 ancestors
2 ancestors
2 ancestors
Present
22 individuals
18 ancestors
16 ancestors
14 ancestors
12 ancestors
9 ancestors
8 ancestors
Time
8 ancestors
7 ancestors
7 ancestors
5 ancestors
5 ancestors
3 ancestors
3 ancestors
3 ancestors
2 ancestors
2 ancestors
1 ancestor
Present
Time
Present
Time
Most recent common ancestor
(MRCA)
Mutational events can now be added to the
genealogical tree, resulting in polymorphic
sites. If these sites are typed in the modern
sample, they can be used to split the sample
into sub-clades (represented by different
colours)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
Time
mutation
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
Time
mutation
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
Time
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
Time
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
Time
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
Time
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
TCGAGGTATTAGC
Time
Most recent common ancestor
(MRCA)
Present
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
TCGAGGTATTAGC
Time
Most recent common ancestor
(MRCA)
Present
Time
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
TCGAGGTATTAGC
TCTAGGTATCAAC
Most recent common ancestor
(MRCA)
Present
Time
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
TCGAGGTATTAGC
TCTAGGTATCAAC
Most recent common ancestor
(MRCA)
Present
Time
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
TCGAGGTATTAGC
TCTAGGTATCAAC
*
** * *
Most recent common ancestor
(MRCA)
If we now imagine that our sample of 22 was
in fact sampled from different
subpopulations, then allele frequencies will
vary between subpopulations, with the
degree of difference reflecting the time at
which the population split took place.
Of course, population splitting is only one
model we can think of. Another would be a
migration model, where lineages occasional
swap from one subpopulation to another.
Note that population splits do not always have
to involve just one sub-clade of the tree – in
the figure, this is just a limitation of
drawing the splitting in 2 dimensions only.
Population 1
Present
Time
Population 2
Population 3
TCGAGGTATTAAC
TCTAGGTATTAAC
TCGAGGCATTAAC
TCTAGGTGTTAAC
TCGAGGTATTAGC
TCTAGGTATCAAC
*
** * *
Most recent common ancestor
(MRCA)
The information on mutational differences in
the tree can be used to describe a
genealogical cladogram or network.
The next slide shows an example using Y
chromosome haplogroups.
3
M
17
G-
G+
Y1
08
3
1
29
SR
A
G
1
C
Ta
t
C
T
M20
26
A
65
Y4 C
SR
T
28
G
•These are the results of very
rare mutation events
G
M9
20
• Differences are usually base
substitutions, insertions and
deletions
T
92R7
16
C
Root
P
YA
sY
81
21
G
8
A
A
G
SR
Y
G
+
10
83
A
12f2
SR
Y4
0
64
4
2
-
9
Y chromosome markers (1):
Unique Event
Polymorphisms (UEPs)
+(2)
-(1)
1
37
M
G
13
C
7
•In humans, each mutation
event is likely to have happened
only once (hence UEPs)
•UEPs are used to classify Ychromosomes into broad
‘haplogroups’
To move beyond mere description, and to
attempt such things as estimating the
TMRCA (Time to Most Recent Common
Ancestor) of the tree, it is necessary to
adopt certain modelling assumptions.
One classic model is the Wright-Fisher model.
It assumes no recombination (i.e. a single
locus), no selection, constant population
size, random mating and a random number
of offspring (strictly, following a Poisson
distribution)
Wright-Fisher (WF) model
• Forwards-in-time model of a neutral locus in a
constant-size, random-mating, haploid population
evolving in discrete generations.
• Each individual in generation t has a random number
(possibly 0) of offspring in generation t+1. Each is:
– identical to the parent with probability 1-μ;
– otherwise a mutation occurs.
generation 0
generation 1
generation 2
generation 3
generation 4
generation 5
generation 6
When we consider the same set of
assumptions but now simulate going
“backwards in time”, we arrive at the
standard coalescent model.
Coalescent theory
• Backwards-in-time “cousin” of the WF model:
similar assumptions, but traces the ancestry of n
observed alleles.
• Coalescent theory can be derived from the WF
model in the limit as N → ∞.
• Ancestry is represented via a genealogical tree:
leaves are observed alleles, root is the most recent
common ancestor (MRCA).
generation 6
generation 5
generation 4
generation 3
generation 2
generation 1
generation 0
t3 = TMRCA (Time
since Most Recent
Common Ancestor)
t2 = 2nd coalescence
t1 = 1st coalescence
t0 = present
Time is measured in units of N generations: 1 coalescent time
unit = NG years, where G is generation time in years.
Time back to the next coalescence when there are j
lineages has the exponential distribution with mean
and standard deviation both 2/j(j-1); e.g. n = 4:
mean = sd = 1
mean = sd = 1/3
mean = sd = 1/6
Height of tree:
mean = 3/2 sd = 1·07
Total branch length:
mean =11/3 sd = 2·33
The TMRCA under the coalescent
• The TMRCA (height of the genealogical tree) is
on average 2(n-1)/n; the average time in which
there are just two ancestral lineages is 1.
– the number of ancestors of a sample drops rapidly
(backwards in time);
– for more than half its history, on average, a sample has
only two ancestors;
– data often clustered.
• The standard deviation of the TMRCA is 1 when
n=2, increasing with n to about 1·08.
– the height of the tree is highly variable from locus to
locus;
When we simulate from the standard
coalescent, we find that there is
considerable variation in the TMRCA from
one simulation to the next.
Most coalescent event occur in the recent past
(at the tips of the tree)
The situation changes if we expand the
coalescent model to incorporate a factor of
exponential population growth.
No there is less variation in the TMRCA
between simulations, and more coalescent
events occur in the more distant past (near
the root of the tree).
Generalisations of the standard
coalescent model
• Variable population size: coalescences occur more
rapidly when the population size is small.
• Population subdivision with migration.
• Some forms of selection.
• Recombination: the ancestral recombination graph
(ARG)
What can we date?
- TMRCA (or subclade TMRCA)
- Demographic parameters
The next two slides briefly consider how we
can use a method called BATWING to get
estimates both of the TMRCA and of
demographic parameters such as the time
that growth starts of the timings of
population split events.
The first slide describes the demographic
model used (population splitting with
growth).
The second slide briefly describes how
BAWTING explores “tree space” to arrive
at probability distributions for parameters of
interest.
BATWING
Model
BATWING
Method
(briefly!)
Mutational models
- UEP (“infinite sites” model)
- Microsatellites (stepwise models)
The “infinite sites” model is appropriate for
UEP events on the Y chromosome. Because
these mutations occur rarely, and because
the Y chromosome is big (~60 Mb), each
mutation most likely occurs in a different
site (this would hold perfectly if the number
of sites were infinite). This allows each
mutation to be distinguished as illustrated in
the next slide.
MUTATION
MUTATION
10
00
00
00
01
01
01
01
The “stepwise mutation” model (SMM) is
appropriate for microsatellites. When a
mutation occurs, the new mutation length
depends on the existing length. In the
simplest case of the “single” SMM,
illustrated in the next slide, the new length =
old length +/- 1.
Microsatellites
GAGGCGTAGTAGTAGTAGTAGTAGTAGGCTCTA
GAGGCGTAGTAGTAGTAGTAGTAGGCTCTA
or
GAGGCGTAGTAGTAGTAGTAGTAGTAGTAGGCTCTA
• Microsatellites mutate very fast (~1 change every 500
generations)
• Mutation events usually involve a gain or a loss of a single
repeat unit
The “stepwise mutation” model (SMM) can
be expanded in many ways. One useful
model that is gaining acceptance is the
linear length-dependent SMM. Here, when
mutations occur they still go up (+1) or
down (-1) with equal probability, but the
rate at which mutations occur increases as a
linear function of microsatellite length.
The model has a plausible mechanism – if
mutations occur by replication slippage, and
can occur between any two adjacent motifs
with equal probability, then the linear
length-dependent SMM results.
Mutation rate
Length-dependent
SMM
Simple SMM
8
9
10
Repeat size
11
12
“YTime”
“YTime” is a software package written by Mike
Weale for estimating TMRCA’s using
microsatellite data at a single locus (as one would
find on the Y chromosome – hence the name). See
www.ucl.ac.uk/tcga/software
The software assumes that the haplotype of the
MRCA is known. This means that the point
estimate of the TMRCA is unaffected by
demographic history, although the standard error
of the estimate will be affected.
The method works by simulating under different
TMRCA values until the observed ASD falls with
the 95% confidence interval.
Vary TMRCA until obs. ASD is likely
D
T = 150 gens
Likelihood
T = 200 gens
Likelihood
D
Simulated distribution of ASD values
D
Simulated distribution of ASD values
T = 80 gens
Likelihood
T = 97 gens
Likelihood
D
Simulated distribution of ASD values
Simulated distribution of ASD values
Download