Record Values, Poisson Mixtures, and the Joint Distribution of Counts of Strings in Bernoulli Sequences Fred W. Huffer (joint work with Jayaram Sethuraman and Sunder Sethuraman) Talk based on: Huffer, F.W., Sethuraman, J., and Sethuraman, S. A study of counts of Bernoulli strings via conditional Poisson processes. Proceedings of the American Mathematical Society., 137, No. 6, 2125–2134 (2009). Huffer, F.W., and Sethuraman, J. Joint distributions of counts of strings in finite Bernoulli sequences. To appear in Journal of Applied Probability , 49 No. 3 (2012). Let: I U1 , U2 , U3 , . . . be iid continuous random variables. I Y1 , Y2 , Y3 , . . . be Bernoulli rv’s which indicate the position of the record values in U1 , U2 , U3 , . . ., that is, 1 if Uj sets a new record Yj = (if Ui < Uj for all i < j), 0 otherwise. It is well known that: Y1 , Y2 , Y3 , . . . are independent. 1 I P (Yj = 1) = for all j. j P −1 P Since j = ∞, we know Yj = ∞ almost surely, that is, there will be infinitely many record values. I Example: A sequence of i.i.d. exponentials U1 , U2 , . . . and the corresponding record indicators Y1 , Y2 , . . .. The first 30 are shown with records colored red. 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 5 10 15 20 25 30 Example continued: The same sequence with each record marked beneath by its duration (how long it stands). 1 10 1 1 4 0 5 145 7 10 15 20 25 30 The duration for a record value Ui is how long the record stands, that is, the number of trials required to break the record. Let Zk be the total number of records (over all time) having duration k. A Surprising Theorem: I Zk ∼ Poisson(1/k) for all k, and I Z1 , Z2 , Z3 , . . . are independent. This theorem (or various parts of it, and also various extensions of it) has been proved by various people: Original Theorem: I Goncharov (1944) [in Russian], Zk ∼ Poisson I Kolchin (1971, Theory Probab. Appl.), Zk independent I Hahlin (1995) [tech report], Z1 I Diaconis (1996?) [unpublished], Z1 I Emery (1998) [unpublished], Z1 Extensions: I Arratia-Barbour-Tavare (1992, Ann. Appl. Prob.), Bern(a, 0) I Joffe-Marchand-Perron-Popadiuk (2004, J. Theoretical Prob.), Z1 for Bern(1, b). I Sethuraman and Sethuraman (2004, Rubin Festschrift, IMS), Bern(1, b) I Holst (2007, J. Appl. Prob.), Bern(a, b) The same sequence Y1 , Y2 , Y3 , . . . arises in other contexts: I The cycles of random permutations of {1, 2, . . . , n}. Here (Yn , Yn−1 , . . . , Y2 , Y1 ) indicate the endings of cycles in Feller’s construction of a random permutation. I In a special case of the Polya-Blackwell-MacQueen urn scheme for generating a Dirichlet process. Here Yi indicates the first appearance of a new value on the i-th draw from the urn. Therefore the same theorem holds with a different interpretation. For example: (n) Let Ck be the number of cycles of length k in random permutation of the integers {1, 2, . . . , n}. Then (n) (n) (n) d (C1 , C2 , . . . , Cm ) −→ (Z1 , Z2 , . . . , Zm ) as n → ∞. How to prove the Theorem? Various analytic and combinatorial approaches. One approach is via moments: Show that the moments and joint moments of (Z1 , . . . , Zm ) coincide with those of independent Poisson rv’s. P P 1 = 1. Example: Z1 = j Yj Yj+1 so that EZ1 = j 1j · j+1 A simpler approach: Independent Poisson rv’s arise naturally in the context of Poisson processes. So try to embed our problem in a Poisson process. Poisson Process (PP) with intensity λ(·) Let 0 < X1 < X2 < X3 < · · · be arrival times (or points). Let N(B) be the number of arrivals during B ⊂ R: P N(B) = i I (Xi ∈ B) A PP N(·) with intensity λ(·) satisfies: I For all intervals B, Z N(B) ∼ Poisson(Λ(B)) where Λ(B) ≡ λ(u) du . B I For any disjoint intervals B1 , B2 , . . . , Bk , N(B1 ), N(B2 ), . . . , N(Bk ) are independent. Note: λ(t) is the rate of arrivals at time t: P[arrival in (t, t + dt)] ≈ λ(t) dt Decomposition of a Poisson Process (a marked PP) Given: a Poisson process N with intensity λ. Suppose that each point Xi is independently assigned a positive integer “mark” Li , according to P(Li = k | Xi = x) = q(x, k) . The mark assigned to Xi depends only on the value of Xi , and not on the other points Xj or mark values Lj (j 6= i). Let Nk , k ≥ 1, be the process consisting of those points assigned a mark of k: P Nk (B) = i I (Xi ∈ B, Li = k) . The processes N1 , N2 , N3 , . . . are independent, and, for all k, Nk is a PP with intensity λk (x) ≡ λ(x)q(x, k). Thus, for any interval B, N1 (B), N2 (B), N3 (B), . . . are independent Poisson rv’s with means Z ENk (B) = λk (u) du . B Taking B = (0, ∞) counts all the arrivals. P Define Zk = Nk ((0, ∞)) = i I (Li = k). We conclude Z1 , Z2 , Z3 , . . . are independent Poisson rv’s with means Z ∞ EZk = λk (u) du . 0 A Marked Poisson Process obtained from the record values of a sequence of iid exponentials 111 6 46 1 15 1 1 115 1 0 6 1 46 20 40 60 111 80 100 Proof: By the memoryless property of the exponential distribution, when a new record is set, it breaks the old record by an amount which is Exp(1). Therefore, the spacings between the record values are iid exponential. So the record values X1 , X2 , X3 , . . . form a PP with rate 1. (Remark: If the Ui ’s are iid from some other continuous distribution, then the record values X1 , X2 , . . . form a nonhomogeneous Poisson process with intensity function equal to the hazard function of the distribution.) Let Lj be the duration of the record Xj . Given Xj = x, the number of trials Lj required to break this record is the number of exponential rv’s we must observe until one exceeds x. Thus Lj has a geometric distribution with pmf P(Lj = k | Xj = x) = (1 − e −x )k−1 e −x , k ≥ 1 ≡ q(x, k) . This also holds conditional on the earlier records (Xi , Li ), i < j. We have assigned each Poisson point Xj a mark Lj according to q(x, k). Thus, the points marked k form a PP with intensity λk (x) = 1 · q(x, k). Since Zk is the number of points marked k, by the result on decomposing Poisson processes, Z1 , Z2 , Z3 , . . . are independent Poisson rv’s with Z ∞ EZk = (1 − e −x )k−1 e −x dx 0 ∞ (1 − e −x )k 1 = = k k 0 QED By modifications of the Poisson decomposition argument, we can also prove the various extensions of the theorem. Unfortunately, in our proofs of the extensions, we lose the simple “record value” interpretation. Generalizations of the Theorem (to Bern(a, b) sequences) Assume a > 0, b ≥ 0. Let Y1 , Y2 , Y3 , . . . be independent Bernoulli rv’s with a a Bern(a, b) P(Yn = 1) = for n = 1, 2, 3, . . . sequence. a+b+n−1 (The earlier case was a = 1, b = 0.) Let Zk be the number of occurrences of pairs of successes which are k apart, that is, pairs of successes separated by exactly k − 1 failures: P Z1 = I {(Yi , Yi+1 ) = (1, 1)} Pi Z2 = I {(Yi , Yi+1 , Yi+2 ) = (1, 0, 1)} Pi Z3 = i I {(Yi , Yi+1 , Yi+2 , Yi+3 ) = (1, 0, 0, 1)} etc. • Bern(a, 0) sequences arise in connection with “biased permutations” studied by Arratia-Barbour-Tavare (which relate to the Ewen’s sampling formula). • Bern(a, b) sequences arise in Polya-Blackwell-MacQueen urn scheme. Generalizations of Theorem • If b = 0, then Z1 , Z2 , Z3 , . . . are independent Poisson rv’s with EZk = a . k • (Holst, 2007) If b > 0, then the joint distribution of (Z1 , Z2 , Z3 , . . .) can be expressed as a mixture of independent Poissons: One can construct a probability space with a random variable W ∼ Beta(b, a) such that conditional on W = w , the rv’s Z1 , Z2 , Z3 , . . . are independent Poisson rv’s with Zk having (conditional) mean λk given by λk = a(1 − w k ) . k This result is an instance of the Mixture of Independent Poissons (MIP) property: there is a random variable (or vector) W conditioned on which the count variables Z1 , Z2 , . . . are independent Poisson. Finite Bernoulli Sequences Holst (2007, 2008) studied finite Bernoulli sequences. Let Bern(a, b, n) denote the distribution of the finite initial segment (Y1 , . . . , Yn ) of a Bern(a, b). Holst obtained expressions for: • joint factorial moments of (Z1 , Z2 , . . . , Zk ) from Bern(a, 0, n), • factorial moments of Z1 from Bern(a, b, n). Note: Now Z1 ≡ n−1 X I {(Yi , Yi+1 ) = (1, 1)}, etc. i=1 In this case, the Zi ’s are no longer exactly Poisson or exactly independent or expressible as a MIP. In our latest work, we: I Consider initial segments (Y1 , Y2 , . . . , Yζ ) of Bern(a, b) which are finite but have a random length ζ. I Describe a general class of random variables ζ for which the counts Z1 , Z2 , . . . retain the MIP property. I For this class of ζ, give a general expression for the joint factorial moments of (Z1 , Z2 , . . . , Zk ). Method of Proof: extension of the Poisson process embedding argument. Simplest case of our general class of ζ: If ζ is independent of Y1 , Y2 , . . . and has a geometric distribution (allowing ζ = 0), then the MIP property holds. (We obtain the mixture explicitly. See next slide.) Suppose P(ζ = k) = (1 − ξ)ξ k for k ≥ 0 where 0 < ξ < 1. By specializing our results to this case and using a power series expansion in ξ, we deduce the conditional moments of Z1 , Z2 , . . . given {ζ = n}. This yields the results of Holst for Bern(a, b, n) and Bern(a, 0, n), and also yields a general expression for the joint factorial moments for Bern(a, b, n). Explicit formulas for mixture when b = 0 • Conditional on {T =t}, the counts Z1 , Z2 , . . . are independent Poisson random variables with means given by E (Zk | T =t) = aξ k t k k for k = 1, 2, 3, . . . • The rv T satisfies P(0 ≤ T < 1) = 1 and ξ (1 − t)a P(T > t) = (1 − ξ t)a for 0 ≤ t < 1. (When b > 0 the formulas are messier.) Another member of our general class of ζ Observe the sequence Y1 , Y2 , . . . and define ζ = first time we observe m consecutive zeros = inf{j : Yj−m+1 = Yj−m+2 = · · · = Yj−1 = Yj = 0} . When the Yj ’s indicate record values in an iid sequence (i.e., a = 1, b = 0), this ζ corresponds to stopping the competition when a record has stood for m trials. Explicit formula for mixture when b = 0 • Conditional on {T =t}, the counts Z1 , Z2 , . . . are independent Poisson random variables with means given by E (Zk | T =t) = at k I (k ≤ m) k • The rv T satisfies P(0 < T < 1) = 1 and Z t ax m P(T > t) = exp − dx 2 0 (1 − x) for 0 ≤ t < 1. The general class of ζ (preserving the MIP) Let: Y1 , Y2 , Y3 , . . . be a Bern(a, b) sequence. L0 , L1 , L2 , . . . be the waiting times (number of trials) between successes (1’s) in this sequence. Example: 00010000001 1 001000000000001··· L0 L1 L2 L3 L4 L0 = 4, L1 = 7, L2 = 1, L3 = 3, L4 = 12 Let φ = (φ1 , φ2 , . . .) and ψ = (ψ1 , ψ2 , . . .) be arbitrary sequences of constants in [0, 1]. Suppose we observe the values Y1 , Y2 , Y3 . . . one by one. Every time a ‘1’ is observed, a random decision is made whether or not to halt (stop observing) the sequence at that point, with the halting probability depending only on the waiting time since the previous ‘1’. Two Cases: I I Halt at L0 with probability 1 − φL . 0 For n ≥ 1, halt at n X Li with conditional probability 1 − ψL . n i=0 Let τ be the halting point. Now back up 1: define ζ = τ − 1. Theorem: The sequence Y1 , Y2 , . . . , Yζ has counts Z1 , Z2 , . . . satisfying the MIP property. (We give an explicit mixture.) In our two examples I ζ ∼ Geometric: equivalent to taking φk = ψk = ξ k for all k. I Stop if m consecutive zeros: equivalent to taking φk = ψk = 1 for k ≤ m and φk = ψk = 0 for k > m. Idea of Proof: (in Bern(a, 0) case with L0 = 1) Consider a marked Poisson process 0 < X1 < X2 < X3 < · · · L1 L2 L3 ··· (points) (marks) with intensity λ∗ (·) and mark distribution q ∗ (x, k). Define a Bernoulli sequence Y1 , Y2 , . . . with 1’s in positions K0 = L0 , K1 = L0 + L1 , K2 = L0 + L1 + L2 , . . . P The n-th ‘1’ is in position Kn = ni=0 Li for n = 0, 1, 2, . . . Let T > 0 be a rv independent of the marked PP. Suppose we observe only the initial part of the marked PP: just the points Xi less than T . This leads to a finite Bernoulli sequence Y1 , Y2 , . . . , Yζ consisting of the initial part of Y1 , Y2 , . . .. Here ζ = τ − 1 where τ = inf{Ki : Xi ≥ T } . For this finite sequence, the counts are: X Count points marked k Zk = I (Li = k) occurring before T . i : i≥1, Xi <T Since I The points marked k form a PP with intensity λ∗ (x)q ∗ (x, k) (independent for each k), and I T is independent of the marked PP, we obtain the MIP property: Conditional on T , I I Z1 , Z2 , Z3 , . . . are independent Poisson rv’s RT E (Zk |T ) = 0 λ∗ (x)q ∗ (x, k) dx If λ∗ , q ∗ , and T are chosen in just the right way, the finite sequence Y1 , . . . , Yζ coincides with that obtained in the sequential halting process (based on the φ and ψ) in our Theorem. Choosing in “just the right way” requires the “death mark” construction in our forthcoming JAP paper. References Arratia, R., Barbour, A.D. and Tavaré, S. (1992). Poisson process approximations for the Ewens sampling formula. Ann. Appl. Probab. 2, 519-535. Arratia, R., Barbour, A.D. and Tavaré, S. (2003). Logarithmic Combinatorial Structures: A Probabilistic Approach. European Mathematical Society, Zürich. Hahlin, L.-O. (1995). Double Records. Research Report 1995:12, Department of Mathematics, Uppsala University, Uppsala. Holst, L. (2007). Counts of failure strings in certain Bernoulli sequences. J. Appl. Probab. 44, 824-830. Holst, L. (2008a). The number of two consecutive successes in a Hoppe-Pólya urn. J. Appl. Probab. 45, 901–906. Holst, L. (2008b). A note on embedding certain Bernoulli sequences in marked Poisson processes. J. Appl. Probab. 45, 1181-1185. Holst, L. (2009). On consecutive records in certain Bernoulli sequences. J. Appl. Probab. 46, 1201-1208. Holst, L. (2011). A note on records in a random sequence. Ark. Mat. 49, 351-356. Huffer, F., Sethuraman, J. and Sethuraman, S. (2008). A study of counts of Bernoulli strings via conditional Poisson processes. Available at arXiv 0801.2115v1.pdf Huffer, F., Sethuraman, J. and Sethuraman, S. (2009). A study of counts of Bernoulli strings via conditional Poisson processes. Proc. Amer. Math. Soc. 137, 2125-2134. Huffer, F.W., and Sethuraman, J. (2012) Joint distributions of counts of strings in finite Bernoulli sequences. To appear in Journal of Applied Probability , 49 No. 3 (2012). Joffe, A., Marchand, E., Perron, F. and Popadiuk, P. (2004). On sums of products of Bernoulli variables and random permutations. Journal of Theoretical Probability 17, 285-292. Kingman, J.F.C. (1993). Poisson Processes. Oxford University Press, Oxford. Kolchin, V.F. (1971). A problem of the allocation of particles in cells and cycles of random permutations. Theory Probab. Appl. 16, 74-90. Resnick, S.I. (1992). Adventures in Stochastic Processes. Birkhäuser, Boston. Sethuraman, J. and Sethuraman, S. (2004). On counts of Bernoulli strings and connections to rank orders and random permutations. In A festschrift for Herman Rubin. IMS Lecture Notes Monograph Series 45, 140-152.