The Geometric Distribution • Probability of the 1st success on the Nth trial, given a probability, p, of success P(Roll 1st 6 on the ith roll) = (1 - 5/6)i-1 (1/6) P(N j) (1 p) j1 p P(1st heads on the ith flip) = (1 - 1/2)i-1 (1/2) 1 E(N) p 1 p Var(N) 2 p To show P(N=j) is a proper pdf: (1 p) p j1 p 4/13/2015 j1 (1 p) j1 1 p Comp 790– Distributions & Coalescence j (1 p) (1 p) p (1 p) 1 1 p 0 (1 p) p p 1 1 p 1 Example • Difference from “Binomial” distribution – Binomial(k) = P(k successes in N trials) – Geometric(k) = P(1st success after k-1 failures) 4/13/2015 Comp 790– Distributions & Coalescence 2 Expected Value Proof • Expected value is value times its probability E(N) j(1 p) j1 p j1 p • Recall the relation: • Substituting gives: 4/13/2015 j1 ja j j(1 p) j j1 1 p a (1 a) 2 for 0 a 1 1 p p 2 p 1 E(N) 1 p p Comp 790– Distributions & Coalescence 3 Other Properties • Markov Property – The probability of the “next step” in a discrete or continuous process depends only on the process's present state – The process is without memory of previous events P(T t2 T t1 ) P(T t2 t1 ) 4/13/2015 Comp 790– Distributions & Coalescence 4 Continuous Generalization • Geometric distributions characterize “discrete” events • Sometimes we’d like to pose questions about continuous variable, for example – Probability that a population will be inbred after T years, rather than after N generations, where T is a real number, and N is an integer • The “continuous” counterpart of the geometric distribution is the “exponential” distribution 4/13/2015 Comp 790– Distributions & Coalescence 5 Exponential Distribution • The Exponential density function is characterized by one parameter, a, called the “rate” or “intensity” Exp(a,t) aeat 1 E(Exp(a,t)) a 1 Var(Exp(a,t)) a2 To show Exp(a,t) is a proper pdf: ae at dt t0 t0 aeat dt 1 eat 10 1 0 4/13/2015 Comp 790– Distributions & Coalescence 6 Exponential Properties • Other useful properties of U = Exp(a,t) include: – Markov property, where t2 > t1 P(U t2 U t1 ) P(U t2 t1 ) – Assuming a second independent exponential process, V = Exp(b,t) a P(U V) a b min(U,V) ~ Exp(a b) 4/13/2015 Comp 790– Distributions & Coalescence 7 Approximations • The geometric distribution can be approximated with the exponential distribution in various ways • Consider the following geometric distribution P(N j) (1 p) j There are at least “j” failures before the first success • We can model discrete time as a rational fraction of of some very large number, M, that includes all intervals of interest. (i.e. 1/M, 2/M, … N/M … M/M, rather than 1, 2, 3, …) • Assuming p is small and N is large, we can approximate “continuous” time as t = j/M and a = pM 4/13/2015 Comp 790– Distributions & Coalescence 8 Approximations (cont) • Recalling t = j/M and a = pM, we can rewrite (1-p)j as: jM M pM a j P(N j) (1 p) 1 1 P( MN t) M M tM • Also note, for large M: a at 1 e M tM • Thus, P(T = t) = a P(N/M ≥ t) is approximately exponential with intensity a. 4/13/2015 Comp 790– Distributions & Coalescence 9 The Discrete-Time Coalescent • We consider the N-coalescent, or the coalescent for a sample of N genes (Kingman 1982) • N-coalescent: What is the distribution of the number of generations to find the Most Recent Common Ancestor (MCRA) for a fixed population of 2N genes • We use 2N because we recognize that the diploid case is more realistic, and it is related to the simpler haploid case by a factor of 2 4/13/2015 Comp 790– Distributions & Coalescence 10 MRCA Examples 4/13/2015 Comp 790– Distributions & Coalescence 11 Coalescence of two genes • What is the distribution of the number of prior generations for the MCRA (waiting time)? • Probability a common parent (i.e. the MCRA is in the immediately previous generation) is: 1 2N The first gene can choose its ancestor freely, but the second must choose the same of the first, thus it has 1 out of 2N choices • Probability that 2 genes have a different parents is 1 4/13/2015 1 2N Comp 790– Distributions & Coalescence 12 Going back further • Since sampling in successive generations is independent of the past, the probability that two genes find a common ancestor j generations back is: 1 1 MRCA( j) 1 2N 2N j1 In the first, j-1, generations they chose different ancestors, and then in generation j they chose the same ancestor • Which is a geometric distribution with p = 1/2N • Thus, the coalescence time for 2 genes is: E(MRCA(j)) 1p 2N 4/13/2015 Comp 790– Distributions & Coalescence 13 MRCA Examples N = 10 4/13/2015 Comp 790– Distributions & Coalescence 14 N-genes, no common parent • The waiting time for k ≤ 2N genes to have fewer than k lineages is: (2N 1) (2N 2) 2N 2N (2N (k 1)) 2N The 1st gene can choose it parent freely, but the next k-1 must choose from the remainder Genes without a child k1 1 i 2N i1 • Manipulating a little k1 1 i1 k1 i 2N 1 i1 k 1 1 O 2 2N j O 2N 1 N2 1 N2 • Where, for large N, 1/N2 is negligible 4/13/2015 Comp 790– Distributions & Coalescence 15 N-gene Colescence • The probability k-genes have different parents is: k 1 1 2 2N • And one or more have a common parent: k 1 k 1 1 1 2 2N 2 2N • Repeated failures for j generations leads to a geometric distribution, with k 1 p 2 2N 4/13/2015 k 1 j1k 1 P(N j) 1 2 2N 2 2N Comp 790– Distributions & Coalescence 16 Next Time • Finish coalesence of a N-genes • The effect of approximations • The continuous-time coalescent • The effective population size 4/13/2015 Comp 790– Distributions & Coalescence 17