The Discrete-Time Coalescent

advertisement
The Geometric Distribution
• Probability of the 1st success on the Nth trial,
given a probability, p, of success
P(Roll 1st 6 on the ith roll) =
(1 - 5/6)i-1 (1/6)
P(N  j)  (1  p) j1 p
P(1st heads on the ith flip) =
(1 - 1/2)i-1 (1/2)
1
E(N) 
p
1 p
Var(N)  2
p
To show P(N=j) is a proper pdf:

 (1  p)
p
j1

p

4/13/2015
j1
 (1  p)
j1
1 p
Comp 790– Distributions & Coalescence
j
(1  p)  (1  p) 
p
 (1  p) 1 





1 p
0  (1  p) 
p

 p

1
1 p
1
Example
• Difference from “Binomial” distribution
– Binomial(k) = P(k successes in N trials)
– Geometric(k) = P(1st success after k-1 failures)
4/13/2015
Comp 790– Distributions & Coalescence
2
Expected Value Proof
• Expected value is value times its probability

E(N) 
 j(1  p)
j1
p
j1

p

• Recall the relation:

• Substituting gives:

4/13/2015


j1
ja j 
 j(1  p)
j
j1
1 p
a
(1  a)
2
for 0  a  1
1  p 
p 2 
 p  1
E(N) 

1 p
p
Comp 790– Distributions & Coalescence
3
Other Properties
• Markov Property
– The probability of the “next step” in a discrete or
continuous process depends only on the process's
present state
– The process is without memory of previous events
P(T t2 T  t1 )  P(T  t2  t1 )

4/13/2015
Comp 790– Distributions & Coalescence
4
Continuous Generalization
• Geometric distributions characterize “discrete”
events
• Sometimes we’d like to pose questions about
continuous variable, for example
– Probability that a population will be inbred after T years,
rather than after N generations, where T is a real number,
and N is an integer
• The “continuous” counterpart of the geometric
distribution is the “exponential” distribution
4/13/2015
Comp 790– Distributions & Coalescence
5

Exponential Distribution
• The Exponential density function is characterized by
one parameter, a, called the “rate” or “intensity”
Exp(a,t)  aeat
1
E(Exp(a,t)) 
a
1
Var(Exp(a,t)) 
a2
To show Exp(a,t) is a proper pdf:

 ae
at
dt
t0


t0
aeat dt  1  eat

10 1
0

4/13/2015
Comp 790– Distributions & Coalescence
6
Exponential Properties
• Other useful properties of U = Exp(a,t) include:
– Markov property, where t2 > t1
P(U t2 U  t1 )  P(U  t2  t1 )
– Assuming a second independent exponential
process, V = Exp(b,t)

a
P(U  V) 
a b
min(U,V) ~ Exp(a  b)
4/13/2015
Comp 790– Distributions & Coalescence
7
Approximations
• The geometric distribution can be approximated with the
exponential distribution in various ways
• Consider the following geometric distribution
P(N  j)  (1  p)
j
There are at least
“j” failures before the
first success
• We can model discrete time as a rational fraction of of some
very large number, M, that includes all intervals of interest.
 (i.e. 1/M, 2/M, … N/M … M/M, rather than 1, 2, 3, …)
• Assuming p is small and N is large, we can approximate
“continuous” time as t = j/M and a = pM
4/13/2015
Comp 790– Distributions & Coalescence
8
Approximations (cont)
• Recalling t = j/M and a = pM, we can rewrite (1-p)j as:
jM
M
 pM   a 
j
P(N  j)  (1  p)  1 
  1    P( MN  t)
 M   M 
tM
• Also note, for large M:
 a 
at
1    e
 M 
tM

• Thus, P(T = t) = a P(N/M ≥ t) is approximately exponential
with intensity a.

4/13/2015
Comp 790– Distributions & Coalescence
9
The Discrete-Time Coalescent
• We consider the N-coalescent, or the coalescent for
a sample of N genes (Kingman 1982)
• N-coalescent: What is the distribution of the number
of generations to find the Most Recent Common
Ancestor (MCRA) for a fixed population of 2N genes
• We use 2N because we recognize that the diploid
case is more realistic, and it is related to the simpler
haploid case by a factor of 2
4/13/2015
Comp 790– Distributions & Coalescence
10
MRCA Examples
4/13/2015
Comp 790– Distributions & Coalescence
11
Coalescence of two genes
• What is the distribution of the number of prior
generations for the MCRA (waiting time)?
• Probability a common parent (i.e. the MCRA is in the
immediately previous generation) is:
1
2N
The first gene can choose its ancestor freely,
but the second must choose the same of the
first, thus it has 1 out of 2N choices
• Probability that 2 genes have a different parents is

1
4/13/2015
1
2N
Comp 790– Distributions & Coalescence
12
Going back further
• Since sampling in successive generations is
independent of the past, the probability that two
genes find a common ancestor j generations back is:
 1  1
MRCA( j)  1 

 2N  2N
j1
In the first, j-1, generations they chose different
ancestors, and then in generation j they chose
the same ancestor
• Which is a geometric distribution with p = 1/2N
• Thus, the coalescence time for 2 genes is:
E(MRCA(j))  1p  2N
4/13/2015
Comp 790– Distributions & Coalescence
13
MRCA Examples
N = 10
4/13/2015
Comp 790– Distributions & Coalescence
14
N-genes, no common parent
• The waiting time for k ≤ 2N genes to have
fewer than k lineages is:
(2N 1) (2N  2)
2N
2N
(2N  (k 1))

2N
The 1st gene can
choose it parent
freely, but the next
k-1 must choose
from the remainder
Genes without a
child
k1
1 
i
2N
i1
• Manipulating a little

k1
1 
i1
k1
i
2N
1

i1
k  1
 1     O
2 2N

j
O
2N
1
N2

1
N2
• Where, for large N, 1/N2 is negligible

4/13/2015
Comp 790– Distributions & Coalescence
15
N-gene Colescence
• The probability k-genes have different parents is:
k  1
1   
2 2N
• And one or more have a common parent:
 k  1  k  1
1  1      
 2 2N  2 2N
     
• Repeated failures for j generations leads to a
geometric
 distribution, with
k  1
p   
2 2N
4/13/2015
 k  1 j1k  1
P(N  j)  
1  2 2N 
 2 2N
     
Comp 790– Distributions & Coalescence
16
Next Time
• Finish coalesence of a N-genes
• The effect of approximations
• The continuous-time coalescent
• The effective population size
4/13/2015
Comp 790– Distributions & Coalescence
17
Download