Chapter 1 Dirichlet distribution The Dirichlet distribution is intensively used in various fields: biology ([11]), astronomy ([24]), text mining ([4]), ... It can be seen as a random distribution on a finite set. Dirichlet distribution is a very popular prior in Bayesian statistics because the posterior distribution is also a Dirichlet distribution. In this chapter we give a complete presentation of this interesting law: representation by Gamma’s distribution, limit distribution in a contamination model. (The Polya urn scheme), ... 1.1 Random probability vectors Consider a partition of a nonvoid finite set E with cardinality ♯E = n ∈ N∗ into d nonvoid disjoint subsets. To such a partition corresponds a partition of the integer n, say c1 , . . . , cd , that is a finite family of positive integers, such that c1 + . . . + cd = n. Thus, if pj = cj , n we have p1 + . . . + pd = 1. In biology for example, pj can represent the percentage of the j th specy in a population. 1 CHAPTER 1. DIRICHLET DISTRIBUTION 2 So we are lead to introduce the following d-dimentional simplex: △d−1 = {(p1 , . . . , pd ) : pj ≥ 0, d X pj = 1}. j=1 When n tends to infinity, this yields to the following notion: Definition 1.1.1. One calls mass-partition any infinite numerical sequence p = (p1 , p2 , . . .) such that p1 ≥ p2 ≥ . . . and P∞ 1 pj = 1. The space of mass-partitions is denoted by ∇∞ = {(p1 , p2 , . . .) : p1 ≥ p2 ≥ . . . ; pj ≥ 0, j ≥ 1, ∞ X pj = 1}. j=1 Lemma 1.1.1. (Bertoin [28] page 63) Let x1 , . . . , xd−1 be d − 1 i.i.d. random variables uniformly distributed on [0, 1] and let x(1) < . . . < x(d−1) denote its order statistic, then the random vector (x(1) , . . . , x(d−1) − x(d−2) , 1 − x(d−1) ) is uniformly distributed on △d−1 . 1.2 Polya urn (Blackwell and MacQueen ) [3] We consider an urn that contains d colored balls numbered from 1 to d. Initially, there is only one ball of each color in the urn. We draw a ball, we observe its color and we put it back in the urn with another ball having the same color. Thus at the instant n we have n + d balls in the urn and we have added n = N1 + . . . + Nd balls with Nj balls of color j. We are going to show that the distribution of ( Nn1 , distribution. N2 , . . . , Nnd ) n converges to a limit 1.2. POLYA URN (BLACKWELL AND MACQUEEN ) [3] 3 1.2.1 Markov chain Proposition 1.2.1. lim ( n−→∞ Nd d N1 ,..., ) = (Z1 , Z2 , . . . , Zd ) n n where (Z1 , Z2 , . . . , Zd ) have a uniform distribution on the simplex △d−1 . Proof Denote the projection operation πi : Rd → R x = (x1 , . . . , xd ) 7→ xi and θi (x) = (x1 , . . . , xi−1 , xi + 1, xi+1 , . . . , xd ). Let S(x) = d X xi i=1 and fi (x) = πi (x) + 1 . S(x) + d Define a transition kernel as follows P (x, θi (x)) = P (x, y) = 0, πi (x) + 1 S(x) + d y 6∈ {θ1 (x), . . . , θd (x)}. if Recall that for any non-negative (resp. bounded) measurable function g defined on Rd , the function P g is defined as P g(x) = Z Rd g(y)P (x, dy). CHAPTER 1. DIRICHLET DISTRIBUTION 4 Here we see that P g(x) = d X g(θj (x)) i=1 First step : πi (x) + 1 . S(x) + d Consider Yn = (Yn1 , . . . , Ynd ) where (Yni )0≤i≤d is the number of balls of color i added to the urn at nth step. We clearly see that (Yn+1 ) only depends on the nth step so that (Yn )n is Markov chain with transition kernel P (Yn , θi (Yn )) = πi (Yn ) + 1 S(Yn ) + d and Y0 = (0, . . . , 0). On the other hand, P fi (Yn ) = since Pd πi (Yn )+1 πi (θj (Yn ))+1 j=1 S(Yn )+d S(θj (Yn ))+d πi (θj (Yn )) = πi (Yn ) if i 6= j, πi (θi (Yn )) = πi (Yn ) + 1 if i = j, S(θi (Yn )) = S(Yn ) + 1. Then P fi (Yn ) = P πi (Yn )+1 πj (Yn )+1 i6=j S(Yn )+d S(Yn )+d+1 + πi (Yn )+1 πi (Yn )+2 S(Yn )+d S(Yn )+d+1 P = πi (Yn )+1 [π (Y ) (S(Yn )+d)(S(Yn )+d+1) i n +2+ = πi (Yn )+1 [π (Y ) (S(Yn )+d)(S(Yn )+d+1) i n + 2 + (S(Yn ) + d − 1 − πi (Yn ))] = fi (Yn ). i6=j πj (Yn ) + 1] (1.1) 1.2. POLYA URN (BLACKWELL AND MACQUEEN ) [3] 5 implies that fi (Yn ) is a positive martingale which converges almost sure towards a random variable Zi . Since fi (Yn ) is bounded by 1, it is also convergent in the Lp spaces, according to the bounded convergence theorem. We then see that : n+d 1 πi (Yn ) = fi (Yn ) − n n n converges to the same limit Zi almost surely and in Lp . By the martingale properties we have moreover that E(fi (Yn )) = E(fi (Y0 )). Consequently E(limn→∞ f (Yn )) = limn→∞ E(f (Yn )) = E(f (Y0 )), so 1 E(Zi ) = E(fi (Y0 )) = . d Second step: Let ∧d−1 = {(p1 , . . . , pd−1 ) : pi ≥ 0 d−1 X pi ≤ 1}, i=1 and hu (Yn ) = (S(Yn ) + d − 1)! π1 (Yn ) π (Y ) u1 . . . ud d n Qd i=1 πi (Yn )! The uniform measure λd on △d−1 is defined as follows: for any borelian bounded function F (u1 , . . . , ud ) we have: Z Z F (u)λd (du) = F (u1 , . . . , ud−1 , 1 − u1 − u2 − . . . − ud−1 )du1 . . . . .dud−1 △d−1 ∧d−1 CHAPTER 1. DIRICHLET DISTRIBUTION 6 Now, let us compare the moments of (Z1 , Z2 , . . . , Zd ) with the ones of λd . Using formula (1.1) hu (θi (Yn )) = S(Yn )+d u h (Y ). πi (Yn )+1 i u n hence P P hu (Yn ) = hu (Yn )( di ui ) = hu (Yn ). implies that (hu (Yn )) is a martingale and similarly Z gk (Yn ) = hu (Yn )uk11 . . . ukdd λd (du) △d−1 is a martingale because P gk (Yn ) = = Pd i R P (Yn , θi (Yn )) △d−1 R △d−1 hu (Yn )uk11 . . . ukdd λd (du) P hu (Yn )uk11 . . . ukdd λd (du) = gk (Yn ). This gives E(gk (Yn )) = E(gk (Y0 )). On the other hand gk (Yn ) = Qi=d = Qi=d i=1 [πi (Yn ) + 1] . . . [πi (Yn ) + ki ] (n + d) . . . (n + s(k) + d − 1) [πi (Yn )+1] . . . [πi (Ynn)+ki ] i=1 n (n+d) . . . (n+s(k)+d−1) n n so that 0 ≤ gk (Yn ) ≤ d Y 2ki = 2S(k) . i=1 Therefore by the bounded convergence theorem 1.2. POLYA URN (BLACKWELL AND MACQUEEN ) [3] 7 limn→∞ E(gk (Yn )) = E(limn→∞ gk (Yn )) = E(Z1k1 . . . Zdkd ) = Q (d−1)! di=1 ki ! (S(k)+d−1)! = cd where cd = (d − 1)! Indeed if mk = Z △d−1 R △d−1 uk11 . . . ukdd λd (du), uk11 . . . ukdd cd λd (du) integrations and recurrences yield, mk = Qd ki ! . (S(k) + d − 1)! i=1 Taken (k1 , . . . , kd ) = (0, . . . , 0), we see that cd = (d − 1)!. Further, if µ is the distribution of (Z1 , . . . , Zd ), then cd λd and µ have the same moments and since △d−1 is compact, the theorem of monotone class yields, µ = cd λd . 1.2.2 Gamma, Beta and Dirichlet densities Let α > 0, the gamma distribution with parameter α, denoted Γ(α, 1), is defined by the probability density function: f (y) = y α−1 e−y 11{y>0} . Γ(α) Let Z1 , . . . , Zd be d independent real random variables with gamma distributions Γ(α1 , 1), . . . , Γ(αd , 1), respectively, then it is well-known that Z = Z1 + . . . + Zd has distribution Γ(α1 + . . . + αd , 1). CHAPTER 1. DIRICHLET DISTRIBUTION 8 Let a, b > 0, a beta distribution with parameter (a, b), denoted β(a, b), is defined by the probability density function: Γ(a + b) a−1 x (1 − x)b−1 11{0<x<1} . Γ(a)Γ(b) From these densities it is easily seen that the following function is a density function: Definition 1.2.1. For any α = (α1 , . . . , αd ) where αi > 0 for any i = 1, . . . , d, the density function d(y1 , y2 , . . . , yd−1 | α) defined as d−1 X Γ(α1 + . . . + αk ) α1 −1 αd−1 −1 y . . . yd−1 (1 − yh )αd −1 11∧d−1 (y) Γ(α1 ) . . . Γ(αk ) 1 h=1 (1.2) is called the Dirichlet density with parameter (α1 , . . . , αd ). Proposition 1.2.2. Let (Z1 , Z2 , . . . , Zd ) be uniformly distributed on △d−1 . Then the random vector (Z1 , Z2 , . . . , Zd−1 ) has the the Dirichlet density (1.2) Proof Let λi ∈ N for any i ∈ {1, . . . , d}. Let (Y1 , Y2 , . . . , Yd−1 ) be a random vector with Dirichlet density defined in (1.2). Pd−1 Let Yd = 1 − i=1 yi . Then λ E(Y1λ1 . . . Ydλd ) = E(Y1λ1 . . . Yd d−1 [1 − = Γ(α1 +...+αk ) Γ(α1 )...Γ(αd ) [1 − = Pd−1 i=1 R Pd−1 i=1 Yi ]λd ) α ∧d−1 d−1 y1α1 +λ1 −1 . . . yd−1 yi ]αd +λd −1 dy1 . . . dyd−1 Γ(α1 +...+αd )Γ(α1 +λ1 )...Γ(αd +λd ) P . Γ(α1 )...Γ(αd )Γ((α1 +...+αd )+ di=1 λi ) +λd−1 −1 1.3. DIRICHLET DISTRIBUTION 9 Consequently, if λi , i ∈ {1, . . . , d} are non-negative integers and α1 = . . . = αd = 1, then E(Y1λ1 . . . Ydλd ) Q (d − 1)! di=1 λi ! . = ((d − 1) + S(λ))! Now the proof of the preceding proposition 1.2.1 shows that (Z1 , Z2 , . . . , Zd ) and (Y1 , . . . , Yd ) have the same moments, and thus the same distribution. Consequently (Z1 , Z2 , . . . , Zd−1 ) has the same distribution as (Y1 , . . . , Yd−1 ) which is by construction d(y1 , y2 , . . . , yd−1 | α). 1.3 Dirichlet distribution The Dirichlet density is not easy to be handled and the following theorem gives an interesting construction where appears this density. Theorem 1.3.1. Let Z1 , . . . , Zd be d independent real random variables with gamma distributions Γ(α1 , 1), . . . , Γ(αd , 1) respectively and let Z = Z1 + . . . + Zd . Then the random vector ( ZZ1 , . . . , Zd−1 ) has a Dirichlet density with parameters (α1 , . . . , αd ). Z Proof The mapping (y1 , . . . , yd ) 7→ ( y1 yd−1 ,..., , y1 + . . . + yd ) y1 + . . . + yd y1 + . . . + yd is a diffeomorphism from [0, ∞)d , to ∧d−1 ×]0, ∞) with Jacobian ydd−1 and reciprocal function: (y1 , . . . , yd ) 7→ (y1 yd , . . . , yd−1 yd , yd [1 − d−1 X i=1 yi ]). CHAPTER 1. DIRICHLET DISTRIBUTION 10 The density of (Z1 , . . . , Zd−1 , Z) at point (y1 , . . . , yd ) is therefore equal to: α −1 d−1 (1 − e−yd y1α1 −1 . . . yd−1 d−1 X yi )αd −1 i=1 Integrating w.r.t. yd and using the equality R∞ 0 ydα1 +...+αd −d d−1 y . Γ(α1 ) . . . Γ(αd ) d e−yd ydα−1 dyd = Γ(α1 + . . . + αd ), we see that the density of ( ZZ1 , . . . , Zd−1 ) is a Dirichlet density with parameters (α1 , . . . , αd ). Z Definition 1.3.1. Let Z1 , . . . , Zd be d independent real random variables with gamma distributions Γ(α1 , 1), . . . , Γ(αd , 1), respectively, and let Z = Z1 + . . . + Zd . The Dirichlet distribution with parameters (α1 , . . . , αd ) is the distribution of the random vector ( ZZ1 , . . . , ZZd ). Not that the Dirichlet distribution is singular w.r.t Lebesgue measure in Rd since it is supported by ∆d−1 which has Lebesgue measure 0. The following can be easily proved Proposition 1.3.1. With the same notation as in Theorem 1.3.1 let Yi = Zi , Z i = 1, . . . , d then Yi has a beta distribution β(αi , α1 + . . . + αi−1 + αi+1 + . . . + αd ) and E(yi ) = 1.4 αi αj αi , E(yi yj ) = . α1 + . . . + αd (α1 + . . . + αk )(α1 + . . . + αd + 1) Posterior distribution and Bayesian estimation Consider the Dirichlet distribution D(α1 , . . . , αd ) as a prior on p = (p1 , p2 , . . . , pd ) ∈ ∆d−1 . Let X be a random variable assuming values in {1, . . . , d}, such that P (X = i | p) = pi . Then the posterior distribution p | X = i is Dirichlet D(α1 , . . . , αi−1 , αi + 1, . . . , αd ). P Indeed let Ni = nj=1 11Xj =i , 1 ≤ i ≤ d. The likelihood of the sample is d−1 Y i=1 Ni p (1 − d−1 X i=1 pi )Nd . 1.4. POSTERIOR DISTRIBUTION AND BAYESIAN ESTIMATION 11 If the prior distribution of p is D(α1 , . . . , αd ), the posterior density will be proportional to d−1 Y piαi +Ni (1 − i=1 d−1 X pi )Nd +αd . i=1 Thus the posterior distribution of p is D(α1 + N1 , α2 + N2 , . . . , αk + Nd ). If (X1 , . . . , Xn ) is a sample of law p = (p1 , . . . , pd ) on {1, . . . , d} then the average Bayesian estimation of p is: α1 + N1 α2 + N2 αk + Nd p′ = ( Pd , Pd , . . . , Pd ). i=1 αi + 1 i=1 αi + 1 i=1 αi + 1 Proposition 1.4.1. ([19]) Let r1 , . . . , rl be l integers such that 0 < r1 < . . . < rl = d. 1. If (Y1 , . . . , Yd ) ∼ D(α1 , . . . , αd ), then rl rl r2 r2 r1 r1 X X X X X X αi , . . . , Yi , . . . , αi . Yi ∼ D αi , Yi , 1 r1 +1 rl−1 1 r1 +1 rl−1 2. If the prior distribution of (Y1 , . . . , Yd ) is D(α1 , . . . , αd ) and if P (X = j | Y1 , . . . , Yd ) = Yj a.s for j = 1, . . . , d, then the posterior distribution of (Y1 , . . . , Yd ) given X = j (j) (j) is D(α1 , . . . , αk ) where (j) αi = αi if i 6= j αj + 1 if i = j 3. Let D(y1 , . . . , yd | α1 , . . . , αd ) denote the distribution function of the Dirichlet distribution D(α1 , . . . , αd ), that is Z 0 z1 ... Z 0 D(y1 , . . . , yd | α1 , . . . , αd ) = P (Y1 ≤ y1 , . . . , Yd ≤ yd ). zd yj dD(y1 , . . . , yd | α1 , . . . , αd ) = αj (j) (j) D(z1 , . . . , zd | α1 , . . . , αd ). α CHAPTER 1. DIRICHLET DISTRIBUTION 12 Proof 1. Recall that: if Z1 ∼ Γ(α1 ), Z2 ∼ Γ(α2 ), and if Z1 and Z2 are independent then Z1 + Z2 ∼ Γ(α1 + α2 ). We conclude by recurrence. 2. Is obtained then by induction. 3. Using 2 P (X = j, Y1 ≤ z1 , . . . , Yd ≤ zd ) = P (X = j)P (Y1 ≤ z1 , . . . , Yd ≤ zd | X = j) = E(E(11X=j | Y1 , . . . , Yd )) (j) j × D(z1 , . . . , zd | α1 , . . . , α(d) ) (j) (j) = E(Yj )D(z1 , . . . , zd | α1 , . . . , αd ) = αj D(z1 , . . . , α (j) (j) zd | α1 , . . . , αd ). 1.4. POSTERIOR DISTRIBUTION AND BAYESIAN ESTIMATION 13 On the other hand P (X = j, Y1 ≤ z1 , . . . , Yd ≤ zd ) = E(11{X=j, Y1 ≤z1 ,..., Yd ≤zd } ) = E(E(11{X=j, Y1 ≤z1 ,..., Yd ≤zd } | Y1 , . . . , YK ) = E(11{Y1 ≤z1 ,..., Yd ≤zd } E(11{X=j} | Y1 , . . . , Yd )) = E(11{Y1 ≤Z1 ,..., Yd ≤zd } Yj )) = R z1 0 ... R zd 0 Yj dD(Y1 , . . . , Yd | α(1) , . . . , α(d) ). 14 CHAPTER 1. DIRICHLET DISTRIBUTION Chapter 2 Poisson-Dirichlet distribution The Poisson-Dirichlet distribution is a probability measure introduced by J.F.C Kingman [31] on the set ▽∞ = {(p1 , p2 , . . .); p1 ≥ p2 ≥ . . . , pi ≥ 0, ∞ X pj = 1}. j=1 It can be considered as a limit of some specific Dirichlet distributions and is also, as shown below, the distribution of the sequence of the jumps of a Gamma process arranged by decreasing order and normalized . We will also see how Poisson-Dirichlet distribution is related to Poisson processes. 2.1 Poisson- Dirichlet distribution (J.F.C Kingman) 2.1.1 Gamma process and Dirichlet distribution Definition 2.1.1. We say that X = (Xt )t∈R+ is a Levy process if for every s, t ≥ 0, the increment Xt+s − Xt is independent of the process (Xv , 0 ≤ v ≤ t) and has the same law as Xs , in particular, P(X0 = 0) = 1. 15 CHAPTER 2. POISSON-DIRICHLET DISTRIBUTION 16 Definition 2.1.2. A subordinator is a Levy process taking values in [0, ∞), which implies that its sample paths are increasing. Definition 2.1.3. A Gamma process is a subordinator such that its Levy measure is γ(dx) = x−1 e−x dx. Remark 2.1.1. Let ξ be a gamma process. Let α1 , . . . , αn > 0, t0 = 0, tj = α1 + . . . + αj , for 1 ≤ j ≤ n and Yj = ξ(tj ) − ξ(tj−1 ) then Yj ∼ Γ(αj ). Moreover, Y1 , Y1 . . . , Yn are independent. Let Y = Y1 + . . . + Yn = ξ(tn ) and p = (p1 , . . . , pn ) with pj = Yj Y then p is a random vector on ∧n−1 having D(α1 , . . . , αn ) distribution. Therefore we get a random vector having Dirichlet distribution. 2.1.2 The limiting order statistic Let D(α1 , . . . , αn ) be a Dirichlet distribution defined as in chapter 1 and let: fα1 , ..., αd (p1 , p2 , . . . , pd ) = Γ(α1 + . . . + αd ) α1 −1 p . . . pαd d −1 11△d−1 . Γ(α1 ) . . . Γ(αd ) 1 (2.1) Assume that the αi are equal, then fα1 , ..., αd (p1 , p2 , . . . , pd ) reduces to d(p1 , p2 , . . . , pd | α) = Γ(N α) (p1 . . . pd )α−1 . Γ(α)d (2.2) In this section we prove the following theorem which exhibits the limiting joint distribution of the order statistics p(1) ≥ p(2) ≥ . . . an element of the subset ▽∞ of the set △∞ . Consider the following mapping ψ : △∞ −→ ▽∞ 2.1. POISSON- DIRICHLET DISTRIBUTION (J.F.C KINGMAN) 17 (p1 , p2 , . . .) 7−→ (p(1) , p(2) , . . .). If P is any probability measure on ▽∞ , and n is any positive integer, then the random nvector (p(1) , p(2) , . . . , p(n) ) has a distribution depending on P ,which might be called the nth marginal distribution of P . The measure P is uniquely determined by its marginal distributions. Theorem 2.1.1. For each λ ∈]0, ∞[, there exists a probability measure Pλ on ▽∞ with the following property. If for each N the random vector p is distributed over △N according to the distribution (2.1) with α = αN , and if N αN → λ as N → ∞, then for any n the distribution of the random vector p = (p(1) , p(2) , . . . , p(n) ) converges to the nth marginal distribution of Pλ as N → ∞. Proof Let y1 , y2 , . . . , yN be independent random variables, each having a gamma distribution Γ(λ, 1). We know that if S = y1 + y2 + . . . + yN , then (y1 /S, y2 /S, . . . , yn /S) has a Dirichlet distribution D(λ, . . . , λ). To exploit this fact, consider as above a gamma process ξ, that is a stationary random process (ξ(t), t ≥ 0) with ξ(0) = 0. The process ξ increases only in jumps. The positions of these jump forms a random countable dense subset J(ξ) of (0, ∞), with P {t ∈ J(ξ)} = 0 (2.3) for all t > 0. For each value of N , write qj (N ) = ξ(jαN ) − ξ (j − 1)αN (2.4) ξ(N αN ) by the result cited above, the vector q = (q(1) , q(2) , . . . , q(N ) ) has the same distribution as p and it therefore suffices to prove the theorem with p replaced by q. We shall in fact prove that lim q(j) (N ) = δξ(j) /ξ(λ) N →∞ (2.5) CHAPTER 2. POISSON-DIRICHLET DISTRIBUTION 18 where the δξ(j) ’s are the magnitudes of the jumps in (0, λ) arranged in descending order. This will suffice to prove the theorem, with Pλ the distribution of the sequence (δξ(j) /ξ(λ); j = 1, 2 . . .) (2.6) since this sequence lies in ▽∞ as a consequence of the equality ξ(λ) = ∞ X δξ(j) . (2.7) j=1 For any integer n, choose N0 so large that, for any N ≥ N0 , the discontinuities of height δξ(j) (j = 1, 2, . . . , n) are contained in distinct intervals ((i − 1)αN , iαN ). Then ξ(N αN )q(j) ≥ δξ(j) (1 ≤ j ≤ n, N ≥ N0 ), so that lim q(j) ≥ δξ(j) /ξ(λ). (2.8) For j = 1, 2, . . . , n. Since n is arbitrary, (2.8) holds for all j, and moreover, Fatou’s lemma and (2.7) give lim q(j) = lim{1 − X q(i) } ≤ 1 − i6=j X lim q(j) ≤ 1 − i6=j X {δξ(i) /ξ(λ)} = δξ(j) /ξ(λ). i6=j Hence, δξ(j) /ξ(λ) ≤ lim q(j) ≤ lim q(j) ≤ δξ(j) /ξ(λ). Thus, lim q(j) = δξ(j) /ξ(λ). By definition of δξ(j) /ξ(λ), we have δξ(1) /ξ(λ) ≥ δξ(2) /ξ(λ) ≥ . . . , 2.1. POISSON- DIRICHLET DISTRIBUTION (J.F.C KINGMAN) and ∞ X 19 δξ(k) /ξ(λ) = 1. k=0 We will write (δξ(1) /ξ(λ), δξ(2) /ξ(λ), . . .) ∼ PD(0, λ) where PD(0, λ) is the Poisson-Dirichlet distribution define as follows: Definition 2.1.4. Let 0 < λ < ∞. Let (ξ(t), t ∈ [0, λ]) be a gamma subordinator and let J1 ≥ J2 ≥ . . . ≥ 0 be the ordered sequence of its jumps. The distribution on ∧∞ J1 of the random variable ( ξ(λ) , J2 , ξ(λ) . . .) is called the Poisson-Dirichlet distribution with parameter λ and is denoted by PD(0, λ). Theorem 2.1.1 shows that if (p1 , . . . , pN ) ∼ D(αN , . . . , αN ) then the distribution of (p(1) , . . . , p(N ) ) approximates PD(0, λ), if N is fairly large, the αN being uniformly small and N αN closed to λ. 20 CHAPTER 2. POISSON-DIRICHLET DISTRIBUTION Chapter 3 Dirichlet processes In a celebrated paper [19], Thomas S. Ferguson introduced a random distribution, called a Dirichlet process, such that its marginal w.r.t. any finite partition has a Dirichlet Distribution as defined in Chapter 1. A Dirichlet process is a random discrete distribution which is a very useful tool in nonparametric Bayesian statistics. The remarkable properties of a Dirichlet process are described below. 3.1 The Dirichlet process We want to generalize the Dirichlet distribution. Let H be a set and let A be a σ−field on H. We define below a random probability, on (H, A) by defining the joint distribution of the random variables (P (A1 ), . . . , P (Am )) for every m and every finite sequence of measurable sets (Ai ∈ A for all i). We then verify the Kolmogorov consistency conditions to show there exists a probability, P, on ([0, 1]A, BFA) yielding these distributions. Here [0, 1]A represents the space of all functions from A into [0, 1], and BFA represents the σ-field generated by the field of cylinder 21 CHAPTER 3. DIRICHLET PROCESSES 22 sets . It is more convenient to define the random probability P , by defining the joint distribution of (P (B1 ), . . . , P (Bm )) for all k and all finite measurable partitions (B1 , . . . , Bm ) of H. If Bi ∈ A for all i, Bi ∩ Bj = ∅ for i 6= j, and ∪kj=1 Bj = H. From these distributions, the joint distribution of (P (A1 ), . . . , P (Am )) for arbitrary measurable sets A1 , . . . , Am may be defined as follows. Given arbitrary measurable sets A1 , . . . , Am , we define Bx1 , ..., xm where xj = 0 or 1, as x j Bx1 , ..., xm = ∩m j=1 Aj where A1j = Aj , and A0j = Acj . Thus {Bx1 , ..., xm } form a partition of H . If we are given the joint distribution of {P (Bx1 , ..., xm ); xj = 0, or 1 j = 1, . . . , m} (3.1) then we may define the joint distribution of (P (A1 ), . . . , P (Am )) as X P (Ai ) = P (Bx1 , ..., xi =1, ..., xm ). (3.2) {(x1 , ..., xm ); xi =1} We note that if A1 , . . . , Am is a measurable partition to start with, then this does not lead to contradictory definitions provided P (∅) is degenerate at 0. If we are given a system of distribution of (P (B1 ), . . . , P (Bk )) for all k and all measurable partitions B1 , . . . , Bk , is one consistency criterion that is needed; namely, CONDITION C: If (B1′ , . . . , Bk′ ), and (B1 , . . . , Bk ) are measurable partitions, and if (B1′ , . . . , Bk′ ) is a refinement of (B1 , . . . , Bk ) with B1 = ∪r11 Bi′ , ′ B2 = ∪rr21 +1 Bi′ , . . . , Bk = ∪krk−1 +1 Bi′ , 3.1. THE DIRICHLET PROCESS 23 then the distribution of ′ r1 r2 k X X X ′ ′ ( P (Bi ), P (Bi ), . . . , P (Bi′ )), 1 r1 +1 rk−1 as determined from the joint distribution of (P (B1′ ), . . . , P (Bk′ ′ )), is identical to the distribution of (P (B1 ), . . . , P (Bm )) Lemma 3.1.1. If a system of joint distributions of (P (B1 ), . . . , P (Bm )) for all k and measurable partition (B1 , . . . , Bk ) satisfies condition C, and if for arbitrary measurable sets A1 , . . . , Am , the distribution of (P (A1 ), . . . , P (Am )) is defined as in (3.2), then there exists a probability P on, ([0, 1]A, BFA) yielding these distribution. Proof See [21] page 214. Definition 3.1.1. Let α be a non-null finite measure on (H, A) . We say P is a Dirichlet process on (H, A) with parameter α if for every k = 1, 2, . . . , and measurable partition (B1 , . . . , Bk ) of H, the distribution of (P (B1 ), . . . , P (Bk )) is Dirichlet D(α(B1 ), . . . , α(Bk )). Proposition 3.1.1. Let P be a Dirichlet process on (H, A) with parameter α and let A ∈ A. If α(A) = 0, then P (A) = 0 with probability one. If α(A) > 0, then P (A) > 0 with probability one. Furthermore, E(P (A)) = α(A) . α(H) Proof By considering the partition (A, Ac ), it is seen that P (A) has a beta distribution, β(α(A), α(Ac )). Therefore E(P (A)) = α(A) . α(H) Proposition 3.1.2. Let P be a Dirichlet process on (H, A) with parameter α and let Q be a fixed probability measure on (H, A) with Q ≪ α. Then, for any positive integers CHAPTER 3. DIRICHLET PROCESSES 24 m and measurable sets A1 , . . . , Am and ε > 0 P{| P (Ai ) − Q(Ai ) |< ε, for i = 1, . . . , m} > 0. Proof X P (Ai ) = P (Bx1 , ..., xi =1, ..., xm ), {(x1 , ..., xm ); xi =1} and note that P{| P (Ai ) − Q(Ai ) |< ε, for i = 1, . . . , m} ≥ P{ X | P (Bx1 , ..., xm ) − Q(Bx1 , ..., xm ) |< ε for i = 1, . . . , m}. {(x1 , ..., xm ); xi =1} Therefore, it is sufficient to show that P{| P (Bx1 , ..., xm ) − Q(Bx1 , ..., xm ) |< 2m ε, for all, (x1 , . . . , xm )} > 0. If α(Bx1 ,..., xm ) = 0, then Q(Bx1 ,..., xm ) = 0 with probability one, so that | P (Bx1 ,..., xm ) − Q(Bx1 ,..., xm ) |= 0 with probability one. For those (x1 , . . . , xm ) for which α(Bx1 , ..., xm ) > 0, the distribution of the corresponding P (Bx1 , ..., xm ) gives positive weight to all open sets in the set X P (Bx1 , ..., xm ) = 1. (α(Bx1 , ..., xm )>0)∈(x1 , ..., xm ) This proposition states that the support of Dirichlet process on (H, A) with parameter α contains the set of all probability measures which are absolutely continuous with respect to α. Definition 3.1.2. Let P be a random probability measure on (H, A) .We say that X1 , . . . , Xn is a sample of size n from P if for any m = 1, 2, . . . and measurable A1 , . . . , Am , C1 , . . . , Cn , 3.1. THE DIRICHLET PROCESS 25 P{X1 ∈ C1 , . . . , Xn ∈ Cn | P (A1 ), . . . , P (Am ), P (C1 ), . . . , P (Cn )} = n Y P (Cj ), a.s j=1 (3.3) Roughly, X1 , . . . , Xn is a sample of size n from P , if, given P (C1 ), . . . , P (Cn ), the events {X1 ∈ C1 }, . . . , {Xn ∈ Cn } are independent of the rest of the process, and are independent among themselves, with P{Xj ∈ Cj | P (C1 ), . . . , P (Cn )} = P (Cj ) a.s. for j = 1, . . . , n. This definition determines the joint distribution of X1 , . . . , Xn , P (A1 ), . . . , P (Am ), once the distribution of the process is given, since P{X1 ∈ C1 , . . . , Xn ∈ Cn , P (A1 ) ≤ y1 , . . . , P (Am ) ≤ ym } (3.4) may be found by integrating (3.3) with respect to the joint distribution of P (A1 ), . . . , P (Am ), P (C1 ), . . . , P (Cn ) over the set [0, y1 ] × . . . × [0, ym ] × [0, 1] × . . . × [0, 1]. The Kolmogorov consistency conditions may be checked to show that (3.4) determines a probability P over (Hn × [0, 1]A, An × FA). Proposition 3.1.3. Let P be a Dirichlet process on (H, A) with parameter α let X be sample of size 1 from P . Then for A ∈ A, P(X ∈ A) = α(A) . α(H) Proof Since P{X ∈ A | P (A))} = P (A), a.s., P(X ∈ A) = E(P(X ∈ A | P (A))) = E(P (A)) = α(A) . α(H) CHAPTER 3. DIRICHLET PROCESSES 26 Proposition 3.1.4. Let P be a Dirichlet process on (H, A) with parameter α, and let X be a sample of size 1 from P . Let (B1 , . . . , Bk ) be a measurable partition of H, let A ∈ A. Then, k X α(Bj ∩ A) (j) (j) P{X ∈ A, P (B1 ) ≤ y1 , . . . , P (Bk ) ≤ yk } = D(Y1 , . . . , Yk | α1 , . . . , αk ) α(H) j=1 (3.5) where (j) αi = α(Bi ) if i 6= j α(Bj ) + 1 if i = j . Proof Define Bj, 1 = Bj ∩ A, and Bj, 0 = Bj ∩ Ac for j = 1 , . . . , k. Let Yj, x = P (Bj, x ) for j = 1, . . . , k and x = 0 or 1. Then, from (3.3) P{X ∈ A | Yj, x , j = 1, . . . , k and x = 0 or 1} = P (A) = k X j=1 Hence for arbitrary Yj, x ∈ [0, 1], for j = 1, . . . , k and x = 0 or 1 P{X ∈ A | Yj, x , j = 1, . . . , k, and x = 0, or1} can be found by integrating Yj, 1 a.s. 3.1. THE DIRICHLET PROCESS 27 P{X ∈ A, P (Bi ) ≤ yi , i = 1, . . . , k} = E(11{X∈A, P (B1 )≤y1 , ..., P (Bk )≤yk )} ) = E E(11{X∈A, P (Bi )≤yi ,i=1,..., k} | (P (Bi ))1≤i≤k ) = E 11{P (B1 )≤y1 , ..., P (Bk )≤yk } × E(11{X∈A} | P (B1 ), . . . , P (Bk )) = E(11{P (B1 )≤y1 , ..., P (Bk )≤yk )} = = = Pk j=1 Pk j=1 Pk j=1 Pk j=1 Yj, 1 ) E(11{P (Bi )≤yi , i=1,..., k} P (Bj ∩ A)) R y1 0 ... R yk 0 α(Bj, 1 ) D(y α(H) Yj, 1 dD(Y | α(j) ) | α(j) ) (j) (j) (j) (j) where y = (y1, 0 , . . . , y0, k , y1, 1 , . . . , yk, 1 )) and α(j) = (α1, 0 , . . . , αk,0 , α1, 1 , . . . , αk, 1 ), and where (j) αi, x = Monotone Class Theorem α(Bi, x ) if i 6= j α(Bj, x ) + 1 if i = j A monotone vector space W on a space Ω is defined to be a collection of bounded, realvalued functions on Ω satisfying • W is a vector space over R; CHAPTER 3. DIRICHLET PROCESSES 28 • constant functions are in W; • if (fn )n ⊂ W and 0 ≤ f1 ≤ . . . fn ≤ . . . , fn ↑ f and f is bounded, then f is in W A collection of M of real functions defined on W is said to be multiplicative if f g ∈ M for all f, g ∈ M. The following theorem is quoted from page 7 of P. Protter. Theorem 3.1.1. Let M be a multiplicative class of bounded real-valued functions defined on Ω and let A = σ(M) be a σ-field generated by M. If W is a monotone vector space containing M, then W contains all bounded A-measurable functions. The following very useful theorem states that posterior of a Dirichlet process is still a Dirichlet process. Theorem 3.1.2. ([19]) Let P be a Dirichlet process on (H, A) with parameter α, and let X1 , . . . , Xn be a sample of size n from P . Then the conditional distribution of P P given X1 , . . . , Xn , is as Dirichlet process with parameter α + n1 δXi . Proof It is sufficient to prove the theorem for n = 1, since the theorem would then follow by induction upon repeated application of the case n = 1. Let (B1 , . . . , Bk ) be a measurable partition of H. To show that D(α + δX ) is the conditional distribution of P | X, we need to show • For fixed ω, D(α + δX(ω) ) is a probability measure. • For any fixed measurable C ⊂ P(H), D(α + δX ) is a version of P r(P ∈ C | X), i.e., for all B ∈ A E(D(α + δX )(C)11X∈B ) = P r(P ∈ C, X ∈ B) 3.1. THE DIRICHLET PROCESS 29 This is equivalent to, for all C ∈ P(H) and B ∈ A, R D(α + δX )(C)α(dx) = B = α(.) . α(H) where α = And also to Z Z R R P r(X ∈ B | P (B), P ∈ C)D(α)(dP ) P ∈C P (B)D(α)(dP )D(α)(dP ) 11(P ∈C) D(α + δX )(dP )α(dx) = B Z P (B)11(P ∈C) D(α)(dP ) (3.6) Let us prove that, for any f : P(H) −→ R, bounded and B(P(H))-measurable, Z Z Z (3.7) f (P )D(α + δX )(dP )α(dx) = f (P )P (B)D(α)(dP ) B Let S = {f : P(H) −→ R, bounded B(P(H))-measurable and satisfying (3.7)} and let M = {P (B1 )r1 P (B2 )r2 . . . P (Bk )rk : k ∈ N, Bi ∈ A, ri ∈ N, i = 1, . . . , k}. Since S is a monotone vector space containing M and M is a multiplicative class of functions and σ(M) = B(P(H)), by the monotone class theorem, it suffices to show (3.6) for f ∈ M. Note that further that f ∈ M is a linear combination of functions in M∗ = {P (B1 )r1 , . . . , P (Bk )rk : k ∈ N, (B1 , . . . , Bk ) a measurable partition of H } Finally, it suffices to show (3.6) for f ∈ M∗ . thus, we need to show, for a measurable partition (B1 , B2 , . . . , Bk ) of H and ri , i = 1, . . . , k, R R Qk R Qk ri ri i=1 P (Bi ) D(α + δX )(dP )α(dx) = i=1 P (Bi ) P (B)D(α)(dP ) B First, when some of α(Bi ) = 0, both sides are 0 and the equality holds. For the CHAPTER 3. DIRICHLET PROCESSES 30 rest of the proof, assume α(Bi ) > 0 for i = 1, . . . , k. Let α′ = (α(B1 ), α(B2 ), . . . , α(Bk )) = (α1 , α2 , . . . , αk ). R R Qk B i=1 P (Bi )ri D(α + δX )(dP )α(dx) = = = = Pk j=1 Pk j=1 Pk j=1 Pk j=1 Pk j=1 B T α(B α(B Bj Bj T T R Qk i=1 R Qk i=1 Bj ) P (Bi )ri D(α + δX )(dP )α(dx) yiri D(α′ + δj )(dy)α(dx) R Qk i=1 α(B T Bj ) α(H) αj T α(B Bj ) j=1 α(Bj ) Pk On the other hand yiri D(α′ + δj )(dy) R r Bj ) y1r1 . . . yj j . . . ykrk R R α +1−1 . . . y1 j r +1 y1r1 . . . yj j Γ(α(H)) y α1 −1 Γ(α1 )...Γ(αj +1)...Γ(αk ) 1 × = R B T Γ(α(H)+1) y α1 −1 Γ(α1 )...Γ(αj +1)...Γ(αk ) 1 × = R α −1 . . . y1 j r +1 y1r1 . . . yj j . . . y1αk −1 dy . . . ykrk . . . y1αk −1 dy . . . ykrk D(α′ )(dy). 3.2. AN ALTERNATIVE DEFINITION OF DIRICHLET PROCESS = = = = = 3.2 R Qk i=1 Pk j=1 Pk j=1 P (Bi )ri P (B)D(α)(dP ) R Qk i=1 R P (Bi )ri P (B T α(B Bj ) j=1 α(Bj ) Pk T Bj )D(α)(dP ) P (B1 )r1 . . . P (Bj )rj +1 . . . P (Bk )rk T α(B Bj ) j=1 α(Bj ) Pk 31 R R T P (B Bj ) D(α)(dP ) P (Bj ) P (B1 )r1 . . . P (Bj )rj +1 . . . P (Bk )rk D(α)(dP ) r +1 y1r1 . . . yj j . . . ykrk D(α′ )(dy). An alternative definition of Dirichlet process In this section, we will see how the Dirichlet process can be derived from the Gamma process on (H, A) with parameter α. The basic idea is that the Dirichlet distribution is defined as the joint distribution of independent Gamma variables divided by their sum. Hence the Dirichlet process should be defined from a Gamma process with independent "increments" divided by their sum. Using a representation of a process with independent increments as a sum of countable number of jumps of random height at countable number of random points, we may divided by the total height of the jumps and obtain a discrete probability measure, which should be distributed as a Dirichlet process. More precisely, Let Γ(α, 1), α > 0, denote the Gamma distribution with characteristic function −α ϕ(u) = (1 − iu) = exp Z 0 ∞ (eiux − 1)dN (x), (3.8) CHAPTER 3. DIRICHLET PROCESSES 32 where N (x) = −α Z ∞ e−y y −1 dy, for, 0 < x < ∞. (3.9) x Let define the distribution of random variables J1 , J2 , . . . as follows. P(J1 ≤ x1 ) = eN (x1 ) , for, x1 > 0, (3.10) and for j = 2, 3, . . . P(Jj ≤ xj | Jj−1 = xj−1 , . . . , J1 = x1 ) = exp[N (xj )−N (xj−1 )], for 0 < xj < xj−1 . (3.11) In other words, the distribution function of J1 is exp N (x1 ) and for j = 2, 3, . . ., the distribution of Jj−1 given Jj−1 , . . . , J1 , is the same as the distribution of J1 truncated above at Jj−1 . Theorem 3.2.1. Let G(t) be a distribution function on [0, 1]. Let Zt = ∞ X Jj 11[0, G(t)) (Uj ) (3.12) j=1 where i) The distribution of J1 , J2 , . . . is given in (3.10) and (3.11), and ii) U1 , U2 , . . . are independent identically distributed variables, uniformly distributed on [0, 1], and independent of J1 , J2 , . . . . Then, with probability one, ξt converges for all t ∈ [0, 1] and is a gamma process with independent increments, with ξt ∼ Γ(α(G(t)), 1). In particular, ξ1 = If we define P∞ 1 Jj converges with probability one and ξ1 ∼ Γ(α, 1) Pj = Jj /ξ1 , then Pj ≥ 0 and P∞ 1 Pj = 1 with probability one. We now define the Dirichlet process. (3.13) 3.2. AN ALTERNATIVE DEFINITION OF DIRICHLET PROCESS 33 As before, let (H, A) be a measurable space, and let α(.) be a finite non-null measure on A. Let V1 , V2 , . . . be a sequence of independent identically random variables with values in H, and with probability measure Q, where Q(A) = α(A)/α(H). We identify the α in formulas (3.8) and (3.9) with α(H) define the random probability measure, P , on (H, A), as P (A) = ∞ X Pj δVj (A). (3.14) j=1 Theorem 3.2.2. The random probability measure defined by (3.10) is a Dirichlet process on (H, A) with parameter α. Proof Let (B1 , . . . , BK ) be a partition of H. Then ∞ 1 X (P (B1 ), . . . , P (BK )) = Jj (δVj (B1 ), . . . , δVj (Bk )) ξ1 j=1 From the assumption on the distribution of V1 , V2 , . . . , Mj = (δVj (B1 ), . . . , δVj (Bk )), are i.i.d random vectors having a multinomial distribution with probability vector P (Q(B1 ), . . . , Q(BK )). Hence the distribution of ∞ 1 Jj Mj must be the same as the distribution of (ξ1/k , ξ2/k − ξ1/k , . . . , ξ1 − ξ(k−1)/k ) = ∞ X i=1 and Ji (11[0,G(1/k)) (Uj ), . . . , 11[G(i−1/k), G(i/k)) (Uj ), . . . , 11[G(k−1/k), G(1)) (Uj )) CHAPTER 3. DIRICHLET PROCESSES 34 where ξt is the gamma process defined by the above theorem with G(t) chosen so that G(j/k) − G(j − 1/k) = Q(Bj ), j = 1, . . . , k. Hence, P∞ j=1 Jj δVj (Bi ) are, for i = 1, . . . , k, independent random variables, with ∞ X Jj δVj (Bi ) ∼ Γ(α(Bi ), 1). j=1 (because α(G(j/k) − G(j − 1/k)) = α(Q(Bj )) = α(Bj )). Since ξ1 is the sum of these independent gamma variables, (P (B1 ), . . . , P (BK )) ∼ D(α(B1 ), . . . , α(Bk )) form the definition of the Dirichlet distribution. Thus, P satisfies the definition of the Dirichlet process. Theorem 3.2.3. Let P be the Dirichlet process defined by (3.10) ,and let Z be a R measurable real valued function defined on (H, A). If | Z | dα < ∞, then R | Z | dP < ∞ with probability one, and Z Z Z 1 Zdα. E( ZdP ) = ZdE(P ) = α(H) Proof From P (A) = P∞ j=1 Pj δVj (A), Z ∞ X | Z | dP = | Z(Vj ) | Pj , (3.15) j=1 so that the monotone convergence theorem gives R P∞ E( | Z | dP ) = j=1 E(| Z(Vj ) |)E(Pj ) = 1 α(H) = 1 α(H) R R | Z | dα P∞ | Z | dα, j=1 E(Pj ) 3.2. AN ALTERNATIVE DEFINITION OF DIRICHLET PROCESS 35 where we have used the independence of the Vj and Pj . Therefore, Z ZdP = ∞ X Z(Vj )Pj j=1 is absolutely convergent with probability one. Since this series is bounded by (3.11), which is integrable, the bounded convergence theorem implies R P∞ E( ZdP ) = j=1 E(Z(Vj ))E(Pj ) 1 α(H) = R Zdα. Theorem 3.2.4. Let P be the Dirichlet process defined by (3.10), and let Z1 and R Z2 be measurable real valued functions defined on (H, A). If | Z1 | dα < ∞, R R | Z2 | dα < ∞ and | Z1 Z2 | dα < ∞, then Z Z σ12 + µ1 µ2 E( ξ1 dP ξ2 dP ) = α(H) + 1 where 1 µi = α(H) Z ξi dα, i = 1, 2 σ12 1 = α(H) Z ξ1 ξ2 dα − µ1 µ2 . Z ξ2 dP = and Proof As in theorem (3.2.3) Z Z1 dP = ∞ X j=1 ∞ X ∞ X i=1 j=1 Z1 (Vj )Pj ∞ X i=1 Z1 (Vj )Z2 (Vi )Pj Pi , Z2 (Vi )Pi (3.16) CHAPTER 3. DIRICHLET PROCESSES 36 since both series are absolutely convergent with probability one. This is bounded in absolute value by ∞ X ∞ X | Zj (Vj )Zi (Vi ) | Pj Pi . (3.17) i=1 j=1 If this is an integrable random variable, we may take an expectation of (3.12) inside the summation sign and obtain R R P∞ P∞ E( Z1 dP Z2 dP ) = j=1 E(Zj (Vj )Zi (Vi ))E(Pj Pi ) i=1 = + PP i6=j P i E(Z1 (Vj ))E(Z2 (Vi ))E(Pj Pi ) E(Z1 (Vi )Z2 (Vi ))E(Pi2 ). Using the independence of the Pi and the Vi , and the independence of the Vi among themselves. The equation continues = µ1 µ2 XX E(Pj Pi ) + (σ12 + µ1 µ2 ) X E(Pi2 ). i i6=j An analogous equation shows that (3.12) is integrable. The proof will be complete when we show ∞ X E( Pi2 ) = j=1 1 . α(H) + 1 This seems difficult to shows directly from the definition of the Pi , so we proceed as follows. The distribution of the P depends on α only through the value of α(H). So choose H to be the real line, α to give mass α(H)/2 to −1 and mass α(H)/2 to +1, and Z1 (x) = Z2 (x) to be identically x. Then µ1 = µ2 = 0 and 3.3. SETHURAMAN’S REPRESENTATION 37 σ12 = 1. Hence R P 2 2 E( ∞ j=1 Pi ) = E(( xdP (x)) ) = E(−1.P {−1} + P {1})= E(2P ({1}) − 1)2 = 1 α(H)+1 since P ({1}) ∼ β(α(H)/2, α(H)/2). 3.3 Sethuraman’s Representation Let α be a finite measure on H and let V1 , V2 , . . . ∼ β(1, α(H)) α(.) α(H) and they are independent of each other. Define Y1 , Y2 , . . . ∼ p1 = V1 .. . pn = (1 − V1 )... (1 − Vk−1 )Vk , k = 2, · · · , n .. . Then P = ∞ X pi δVi ∼ D(α) i=1 where the convergence holds a.e. Note that, since for any A ∈ A, P (A) = measurable. P∞ i=1 pi 11Yi ∈A is measurable, P is In practice, the above representation is applied as follows: CHAPTER 3. DIRICHLET PROCESSES 38 Corollary 3.3.1. (Stick − breaking construction) Let α be any finite mea- sure on H. Let c = α(H) and H = α/c. For any integer N , let V1 , · · · , VN −1 be iid Beta(1, c) and VN = 1. Let p1 = V1 , pk = (1 − V1 )...(1 − Vk−1 )Vk , k = PN 2, · · · , N . Let Vk be iid ∼ H. Then, PN = i=1 pi δZi converges a.e. to a Dirichlet process D(α). Proof It suffices to show that, for given measurable partition (B1 , . . . , Bk ) of H, P (B1 ), . . . , P (Bk ) ∼ D(α(B1 ), . . . , α(Bk )). Let Ui = (11(Yi ∈B1 ) , . . . , 11(Yi ∈Bk ) ), i = 1, 2, . . .. We need to show that ∞ X pi Ui ∼ D(α(B1 ), . . . , α(Bk )). i=1 Let P ∼ D and independent of Vi ’s and Yi ’s. By a property of Dirichlet distribution, Vn Un + (1 − Vn )P ∼ D(α(B1 ), . . . , α(Bk )). Again Vn−1 Un−1 + (1 − Vn−1 )(Vn Un + (1 − Vn )P ) ∼ D(α(B1 ), . . . , α(Bk )) and it is Vn−1 Un−1 + (1 − Vn−1 )Vn Un + (1 − Vn−1 )(1 − Vn )P ∼ D(α(B1 ), . . . , α(Bk )). Applying the same property, we get Vn−2 Un−2 + (1 − Vn−2 )[Vn−1 Un−1 + (1 − Vn−1 )Vn Un + (1 − Vn−1 )(1 − Vn )P ] ∼ D(α(B1 ), . . . , α(Bk )), 3.4. SOME APPLICATIONS 39 which is Vn−2 Un−2 + (1 − Vn−2 )Vn−1 Un−1 + (1 − Vn−2 )(1 − Vn−1 )Vn Un + (1 − Vn−2 )(1 − Vn−1 )(1 − Vn )P ∼ D(α(B1 ), . . . , α(Bk )). Doing this operation until Vn and noting 1 − Pn = n X pi Ui + (1 − i=1 Since (1 − Pn i=1 n X Pn 1=1 Pn = Qn i=1 (1 − Vi ), we get pi )P ∼ D(α(B1 ), . . . , α(Bk )). i=1 D D pi ) −→ 0, and Pn −→ D, ∞ X pi Ui ∼ D(α(B1 ), . . . , α(Bk )). i=1 3.4 Some applications In this paragraph, α is taken to denote a σ-additive non-null finite measure on (H, A). We write P ∈ D(α) as a notation for the phrase "P ∈ D(α) is a Dirichlet process on (H, A) with parameter α". In most of the applications we take (H, A) = (R, β). The nonparameter statistical decision problems we consider are typically described as follows. The parameter space is the set of all probability measures P on (H, A). The statician is to choose an action a in some space, there by incurring a loss, L(P, a). There is a sample X1 , . . . , Xn from P available to the statician, upon which he may base his choice of action. He seeks a Bayes rule with respect to the prior distribution, P ∈ D(β). With such a prior distribution, the posterior distribution of P given the observaP tions is D(α + ∞ 1 δXi ). Thus if we can find a Bayes rule for the no-sample probP lem (with n = 0), in problem may be found by replacing α with α+ n1 δXi (Theorem CHAPTER 3. DIRICHLET PROCESSES 40 3.1.2). In problems considered below, we first find the Bayes rule for the nosample problem, and then state the Bayes rule for the general problem. Mixtures of Dirichlet process as introduced by Antoniak are very intensively used in various applied fields. 3.5 Estimation of a distribution function Let (H, A) = (R, β), and let the space of actions of the statician be the space of all distribution functions on R. Let the loss function be Z L(P, Fb) = (F (t) − Fb(t))2 dW (t) where W is a given finite measure on (R, β), and where F (t) = P ((−∞, t]). If P ∼ D(β), then F (t) ∼ β(α(−∞, t]), α((t, ∞))) for each t. The Bayes risk for the no-sample problem, E(L(P, Fb)) = Z (E(F (t) − Fb(t))2 )dW (t), is minimized by choosing Fb(t) for each t to minimize E((F (t) − Fb(t))2 ). This is achieved by choosing Fb(t) to be E(F (t)). Thus, the Bayes rule for the no-sample problem is where Fb(t) = E(F (t)) = F0 (t) F0 (t) = α((−∞, t])/α(R) ( we used F (t) ∼ β(α(−∞, t]), α((t, ∞))) represents our prior guess at the shape of the unknown F (t). 3.6. ESTIMATION OF THE MEAN 41 For a sample of size n, the Bayes rule is therefore Fbn (t | X1 , . . . , Xn ) = P α((−∞, t])+ n 1 δXi ((−∞, t]) α(R)+n (3.18) = pn F0 (t) + (1 − pn )Fn (t | X1 , . . . , Xn ). where pn = and α(R) (α(R) + n) n 1X δXi ((−∞, t])) Fn (t | X1 , . . . , Xn ) = n 1 is the empirical distribution function of the sample. The Bayes rule (3.18) is a mixture of our prior guess at F and of the empirical distribution function, with respective pn and (1 − pn ). If α(R) is large compared to n, little weight is given to the prior guess at F . One might interpret α(R) as a measure of faith in the prior guess at F measured in units of numbers of observations. As α(R) tends to zero, the Bayes estimate Fbn (t | X1 , . . . , Xn ) converges to it uniformly almost surely. This follows from the Glivenko-Cantelli theorem and the observation that pn → 0 as n → ∞. The results for estimating a k-dimensional distribution function are completely analogous. 3.6 Estimation of the mean Again let (H, A) = (R, β), and suppose the statistic is to estimate the mean with squared error loss CHAPTER 3. DIRICHLET PROCESSES 42 where L(P, µ b) = (µ − µ b)2 , µ= Z xdP (x). We assume P ∼ D(α), where α has finite first moment. The mean of the corresponding probability measure α(.)/α(H) is denoted by µ0 : µ0 = Z xdα(x)/α(R). (3.19) By Theorem (3.2.4), the random variable µ exists. The Bayes rule for the nosample problem is the mean of µ, which, again by Theorem (3.2.4), is µ b = µ0 . For a sample of size n, the Bayes rule is therefore −1 µ b(X1 , . . . , Xn ) = (α(R) + n) Z xd(α(x) + n X δXi (x) ) (3.20) 1 = pn µ0 + (1 − pn )X̄n , where X̄n is the sample mean, n 1X Xi . X̄n = n i=1 The Bayes estimate is thus between the prior guess at µ, namely µ0 , and the sample mean. As α(R) → 0, µ bn , converges to X̄n . Also, as n → ∞, pn → 0 so that, in particular, the Bayes estimate (3.20) is strongly consistent within the class of distribution with finite first moment. More generally, for arbitrary (H, A), if Z is real-valued measurable defined on (H, A), and if we are to estimate Z θ = ZdP, 3.6. ESTIMATION OF THE MEAN 43 with squared error loss and prior P ∼ D(α), where α is such that Z θ0 = Zdα/α(H) < ∞, then the estimate θb = θ0 is Bayes for the no-sample problem. For a sample of size n, θbn (X1 , . . . , Xn ) = pn θ0 + (1 − pn )1/n n X Z(Xi ) 1 is Bayes, where pn = α(H)/(α(H) + n). Results for estimating a mean vector in k-dimensions are completely analogous. 44 CHAPTER 3. DIRICHLET PROCESSES Chapter 4 Mixtures of continuous-time Dirichlet processes In this chapter, we first define, in section 1, continuous-time Dirichlet processes. In section 2 we examine the case of the Brownian-Dirichlet process (BDP) whose parameter is proportional to a standard Wiener measure. Next we show that some stochastic calculus formulas (Ito’s formula, local time occupation formula) hold for BDP’s. Next, in section 3, we define mixtures of continuous-time Dirichlet processes and we extend some, rather nontrivial computations of Antoniak (1974) [1]. 4.1 Continuous-time Dirichlet processes From now, we take for H any standard Polish space of real functions defined on an interval I ⊂ [0, ∞), for example the space C(I) (resp. D(I)) of continuous (resp. cadlag) functions. For any t ∈ I, let πt : x −→ x(t) denote the usual projection 45 46 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES at time t from the space H to R. Recall that πt maps any measure µ on H into a measure πt µ on R defined by πt µ(A) = µ(πt−1 (A)) for any Borel subset A of R. The following proposition defines a continuous-time process (Xt ) such that each Xt is a Ferguson-Dirichlet random distribution. Proposition 4.1.1. (Emilion, 2005) Let α be any finite measure on H, let X be a Ferguson-Dirichlet random distribution D(α) on H and let Xt = πt X. Then the time continuous process (Xt )t∈I is such that for each t ∈ I, Xt is a FergusonDirichlet random distribution on R D(αt ) where αt = πt α. Moreover if V (i) is α α(H) any iid sequence on H such that V (i) ∼ X(ω) = ∞ X and pi (ω)δV (i) (ω) i=1 where the sequence (pi ) is independent of the V (i) ’s and has a Poisson-Dirichlet distribution PD(α(H)), then Xt (ω) = ∞ X pi (ω)δV (i) (ω)(t) . i=1 For sake of simplicity we deal with just one parameter α, but it can be noticed that two-parameter Xt,α,β continuous-time Dirichlet process can be defined similarly by using two-parameter Poisson-Dirichlet distributions introduced in Pitman Yor (1997) [44]. Proof Let k ∈ {1, 2, 3, ...} and A1 , ..., Ak a measurable partition of R. Then for any t ∈ R, πt−1 (A1 ), ..., πt−1 (Ak ) is a measurable partition of H so that, by definition of X, the joint distribution of the random vector (X(πt−1 (A1 )), ..., X(πt−1 (Ak ))) 4.1. CONTINUOUS-TIME DIRICHLET PROCESSES 47 is Dirichlet with parameters (α(πt−1 (A1 )), ..., α(πt−1 (Ak )). In other words (Xt (A1 )), ..., Xt (Ak )) is Dirichlet with parameters (αt (A1 ), ..., αt (Ak )) and Xt ∼ D(αt ). A consequence of the definition of πt is that ∞ ∞ X X πt ( µi ) = πt µi i=1 i=1 for any sequence of positive measures on H and πt (λµ) = λπt (µ) for any positive real number λ. Hence if V (i) is any i.i.d. sequence on H such that V (i) ∼ and X(ω) = ∞ X α α(H) pi (ω)δV (i) (ω) i=1 where (pi ) has a Poisson-Dirichlet distribution PD(α(H)), then Xt (ω) = πt (X(ω)) = ∞ X i=1 pi (ω)πt (δV (i) (ω) ) = ∞ X pi (ω)δV (i) (ω)(t) i=1 the last equality being due to the fact that πt (δf ) = δf (t) for any f ∈ H, as easα )= ily seen. In addition the V (i) (t)’s are iid with V (i) (t) ∼ πt ( α(H) 1 α. αt (R) t 1 π (α) α(H) t = Moreover (pi ) has a Poisson-Dirichlet distribution PD(α(H)) = PD(αt (R)) so that the preceding expression of Xt (ω) is exactly the expression of a FergusonDirichlet random distribution D(αt ) as a random mixture of random Dirac masses. As a corollary of the above proof and of Sethuraman stick-breaking construction (1994) (see chapter 3 section 3.3), we observe the following result which is of interest for simulating continuous-time Dirichlet processes. It shows that such processes of random distributions can be used to generate stochastic paths and to classify random curves. Corollary 4.1.1. (Continuous-time stick-breaking construction) Let α be any finite measure on H and αt = πt α. Let c = α(H) and H = α/c. For any integer N , let 48 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES V1 , · · · , VN −1 be iid Beta(1, c) and VN = 1. Let p1 = V1 , pk = (1 − V1 ) . . . (1 − P Vk−1 )Vk , k = 2, · · · , N . Let Zk be iid H. Then, PN,t = N k=1 pk δZk,t converges a.e. to a continuous-time Dirichlet process D(αt ). Corollary 4.1.2. Let Xt be as in the preceding proposition, then for any Borel subset A of R, (Xt (A))t≥0 is a Beta process, ie for any t ≥ 0 Xt (A) ∼ Beta(αt (A), αt (Ac )). 4.2 Brownian-Dirichlet process We suppose here that the parameter α is proportional to a standard Wiener measure W so that the V (i) ’s above are i.i.d. standard Brownian motions that we denote by B i . The sequence (pi ) is assumed to be Poisson-Dirichlet(c) independent of (B i )i=0,1,... Definition 4.2.1. Let X be a Dirichlet process such that X ∼ D(cW ), then the continous-time process (Xt ) defined by Xt = πt X, for any t > 0, is called a Brownian-Dirichlet process (BDP). As observed in the previous proposition, Xt is a random probability measure such that Xt ∼ D(cN(0, t)) and if we have a representation X(ω) = ∞ X pi (ω)δB i (ω) , ∞ X pi (ω)δBti (ω) . i=1 then we also have Xt (ω) = i=1 We show that stochastic calculus can be extended to such processes (Xt ). Consider the filtration defined by F0 = σ(pi , i ∈ N∗ ), 4.2. BROWNIAN-DIRICHLET PROCESS 49 and for any s > 0, Fs = F0 ∪ (∪i σ(Bui , u < s)). 4.2.1 Ito’s formula Proposition 4.2.1. Let f ∈ C 2 be such that there exist a constant c ∈ R such that Rs ′ i 2 (f (Bu ) du < c for any i and any s > 0. Then, 0 1. Mt = 2. Vt = P∞ i=1 1 2 pi (ω) P+∞ i=1 variation, and Rt f ′ (Bui )dBui is a well-defined (Fs ) − martingale, 0 pi (ω) Rt 0 f ”(Bui )du is a well-defined process with bounded 3. < Xt − X0 , f >= Mt + Vt . Proof. Let Mtn (ω) = n X pi (ω) t 0 i=1 (k) Z (k) f ′ (Bui )dBui , (k) and let s < t. Let 0 = t1 < t2 < . . . < trk = t be a sequence of subdivisions of [0, t] such that Z 0 t f ′ (Bui )dBui = lim k−→+∞ rk X l=1 f ′ (Bti(k) )(Bti(k) − Bti(k) ), l l+1 l the limit being taken in L2 -norm. We now show that Mtn is a martingale. Note that we don’t use below the fact that the sequence pi has a Poisson-Dirichlet distribution. For sake of simplicity, in what follows, we omit the superscript (k) in 50 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES (k) tl . We have E(Mtn | Fs ) = R t ′ i i f (B )dB | F E p s i i u i=1 0 Pn P Pn ′ i i i = limk→∞ { i=1 E pi {l : tl <s} f (Btl )(Btl+1 − Btl ) | Fs + P ′ i i i E p f (B )(B − B ) | F i s }. tl tl+1 tl i=1 {l:tl >s} Pn In the case tl < s, if we have in addition tl+1 < s then E f ′ (Btil )(Btil+1 − Btil ) | Fs = f ′ (Btil )(Btil+1 − Btil ) while if tl+1 > s, writing Btil+1 − Btil = Btil+1 − Bsi + Bsi − Btil , we see that E f ′ (Btil )(Btil+1 − Btil ) | Fs = f ′ (Btil )(Bsi − Btil ). On the other hand in the case tl > s we have E f ′ (Btil )(Btil+1 − Btil ) | Fs = E E(f ′ (Btil )(Btil+1 − Btil ) | Ftl ) | Fs ′ i i i = E f (Btl )E(Btl+1 − Btl | Ftl ) | Fs = E f ′ (Btil )E(Btil+1 − Btil ) | Fs = 0. Hence, E(Mtn | Fs ) = Pn i=1 pi limk−→∞ P ′ i i i f (B )(B − B ) tl tl+1 tl {l:tl+1 <s} + f ′ (Btis )(Bsi − Btis ) (k) (k) (k) such that tl < s and tl+1 > s. Therefore Z s n X n E(Mt | Fs ) = f ′ (Bui )dBui = Msn pi (ω) where ts denotes the unique tl i=1 0 4.2. BROWNIAN-DIRICHLET PROCESS 51 proving that Mtn is a martingale. Moreover, since Rs Rs P (n) E (Ms )2 = 2 {1≤i<j≤n} E pi pj 0 f ′ (Bui )dBui 0 f ′ (Buj )dBuj + = = i h R s ′ i i 2 2 i=1 E pi ( 0 f (Bu )dBu ) Pn Pn i=1 Rs E(p2i )E( 0 f ′ (Bui )dBui )2 ) R P∞ s ′ i 2 2 (f (B )) du ≤ c E(p )E u i i=1 E(pi ) = c i=1 0 Pn the martingale convergence theorem implies that Mtn converges to a martingale Z t ∞ X Mt = f ′ (Bui )dBui . pi (ω) 0 i=1 Finally, applying Ito’s formula to each B i , we get P∞ i i < Xt (ω) − X0 (ω), f > = i=1 pi (ω)(f (Bt ) − f (B0 )) = + P∞ i=1 1 2 pi (ω) P∞ Rt 0 i=1 pi (ω) f ′ (Bui )dBui Rt 0 f ′′ (Bui )du = Mt + Vt where Vt is obviously a bounded variation process. Corollary 4.2.1. (Stochastic integral) Let Xt be a BDP given by ∞ X Xt (ω) = pi (ω)δBti (ω) . i=1 Let (Yt ) be a real valued stochastic process and φ a bounded function defined on R R. Then the stochastic integral φ(Yt )dXt is defined as the measure such that Z ∞ Z ∞ Z X X ′ i i 1 < φ(Yt )dXt , f >= φ(Yt )pi (ω)f (Bt )dBt + φ(Yt )pi (ω)f ′′ (Bti )dt, 2 i=1 i=1 52 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES for any function f verifying the conditions of the preceding proposition. 4.2.2 Local time The following result exhibits the local time of a Brownian-Dirichlet process as a density of occupation time. Proposition 4.2.2. Let (Xt ) be a BDP such that Xt (ω) = ∞ X pi (ω)δBti (ω) . i=1 Then for each (T, x) ∈ R+ × R, there exist a random distribution L(T, x) such that Z L(T, x)f (x)dx = Z T < Xs , f > ds, 0 R for any f Borel measurable and locally integrable on R. Proof. Let Li (T, x) be the local time w.r.t. to B (i) so that for any i ∈ N we have Z Li (T, x)f (x)dx = R and Z X n pi Li (T, x)f (x)dx = Then, if f ∈ T f (Bsi )ds 0 Z T 0 R i=1 L+ ∞, Z n X pi f (Bsi )ds. i=1 the monotone convergence theorem yields Z X ∞ R i=1 pi Li (T, x)f (x)dx = Z 0 T ∞ X pi f (Bsi )ds i=1 and the same holds if f ∈ L∞ by using f = f+ − f− . Letting L(T, x) = P∞ i=1 pi Li (T, x) we get the desired result. 4.3. MIXTURES OF DIRICHLET PROCESSES 53 4.2.3 Diffusions Definition 4.2.2. A stochastic process (ψt ) is called a diffusion w.r.t. to the BDP (Xt ) if it has a.s. continuous paths and can be represented as ψt = ψ0 + Z t a(s)ds + 0 ∞ X pi (ω) Z t 0 i=0 bi, s dBsi where a ∈ L1 (R+ ) and bi ∈ L2 (R+ ) for any integer i. The following result can be proved using the Banach fixed point theorem, similar to the classical case of a single Brownian motion. Proposition 4.2.3. Suppose that f and gi , i = 0, 1, . . . are Lipshcitz functions from R to R. Let u0 be an F0 -measurable square integrable r.v. Then there exist a diffusion (ψt ) w.r.t. to the BDP (Xt ) such that dψt = f (ψt )dt + ψ0 = u0 . 4.3 P∞ i=0 pi gi (ψt )dBti , (4.1) Mixtures of Dirichlet processes The following definitions are due to C. Antoniak [1]. 4.3.1 Antoniak mixtures Let (U, B, H) be a probability space called the index space. Let (Θ, A) be a measurable space of parameters. Definition 4.3.1. A transition measure on U × A is a mapping α from U × A into [0, ∞) such that 54 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES 1. for any u ∈ U, α(u, .) is a finite, nonnegative non-null measure on (Θ, A) 2. for every A ∈ A, α(., A) is measurable on (U, B). Note that this differs from the definition of a transition probability in that α(u, Θ) need not be identically one as we want α(u, .) to be a parameter for a Dirichlet process. Definition 4.3.2. A random distribution P is a mixture of Dirichlet processes on (Θ, A) with mixing distribution H and transition measure α, if for all k = 1, 2, ... and any measurable partition A1 , A2 , . . . , Ak of Θ we have P{P (A1 ) ≤ y1 , . . . , P (Ak ) ≤ yk } = Z D(y1 , . . . , yk |α(u, A1 ), . . . , α(u, Ak ))dH(u), U where D(y1 , . . . , yk |α1 , . . . , αk ) denotes the distribution function of Dirichlet distribution with parameters (α1 , . . . , αk ). In concise symbols we will use the heuristic notation: P ∼ Z D(α(u, .))dH(u). U Roughly, we may consider the index u as a random variable with distribution H and given u, P is a Dirichlet process with parameter α(u, .). In fact U can be defined as the identity mapping random variable and we will use the notation |u f or ”U = u”. In alternative notation where αu = α(u, .). u∼H P |u ∼ D(αu ) (4.2) 4.3. MIXTURES OF DIRICHLET PROCESSES 55 4.3.2 Mixtures of continuous-time Dirichlet processes We now consider the case where αu is a finite measure on a function space like C(I) and D(I) (spaces defined in section 1). The following proposition defines a continuous-time process (Pt )t such that each Pt is a mixture of Dirichlet processes. Proposition 4.3.1. Let P be a mixture of Dirichlet distributions P ∼ Z D(αu )dH(u). U Let Pt = πt P . Then, for each t ≥ 0, Pt is a mixture of Dirichlet processes: Pt ∼ Z D(αu, t )dH(u) U where αu, t = αu (πt−1 (.)). Proof Let A1 , A2 , . . . , Ak be a partition of R. P[Pt (A1 ) ≤ y1 , . . . , Pt (Ak ) ≤ yk ] = P[P πt−1 (A1 ) ≤ y1 , . . . , P πt−1 (Ak ) ≤ yk ] = R U D(y1 , y2 , . . . , yk | (αu (πt−1 Ai ))1≤i≤k )dH(u), since πt−1 (A1 ), πt−1 (A2 ), . . . , πt−1 (Ak ) is a partition of Θ. Therefore Pt ∼ Z U D(αu, t )dH(u). 56 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES 4.3.3 Posterior distributions We suppose now that the sample space of observations is X = C(R+ ), where C(R+ ) denote the space of continuous functions from R+ to R. Let F be a transition probability from Θ × ζ into [0, 1]. Let θt be a sample from Pt , i.e. θt |Pt , u ∼ Pt and X(t) |Pt , θt , u ∼ F (θt , .). Let Hx denote the conditional distribution of (θt , u) given X(t) = x. Let Hθt denote the conditional distribution of u given θt . The following proposition shows that if (Pt ) is a mixture of Dirichlet processes then for each t ∈ R+ the posterior probability of Pt is also a mixture of Dirichlet processes. Proposition 4.3.2. If for any t ∈ R+ Pt |u ∼ D(αu, t ) u ∼ HR P ∼ U D(αu, t )dH(u) t θt |Pt , u ∼ Pt X(t) | ∼ F (θ , .) Pt , θt , u then Pt |X(t)=x ∼ Z (4.3) t D(αu, t + δθt )dHx (θt , u). Θ×U Proof Let A1 , A2 , . . . , Ak be a partition of R P[Pt (Ai ) ≤ yi , 1 ≤ i ≤ k |X(t)=x ] = E[P[Pt (Ai ) ≤ yi , i = 1, . . . , k |X(t)=x, θt , u ] |X(t)=x ] = E[D(y1 , y2 , . . . , yk |βu, t (A1 ),..., βu, t (Ak ) ) |X(t)=x ] = R U ×Θ D(y1 , . . . , yk |βu, t (A1 ),..., βu, t (Ak ) )dHx (u, θ). 4.3. MIXTURES OF DIRICHLET PROCESSES 57 where βu, t (Ai ) = αt, u (Ai ) + δθt (Ai ), for any i = 1, . . . , k. Therefore Pt |X(t)=x ∼ Z D(αu, t + δθt )dHx (θt , u). U As a corollary, let us show that the same result holds, if (Pt ) is simply a continuoustime Dirichlet process: the posterior distribution of Pt given X(t) = x is still a mixture of continuous-time Dirichlet processes. Corollary 4.3.1. If then Pt ∼ D(αt ) θt ∼ Pt (4.4) X(t) |Pt , θt ∼ F (θt , .) Pt |X(t)=x ∼ Z D(αu, t + δθt )dHx (θt ). U Proof Let A1 , A2 , . . . , Ak be a partition of R P[Pt (Ai ) ≤ yi , 1 ≤ i ≤ k |X(t)=x ] = E[P[Pt (Ai ) ≤ yi , 1 ≤ i ≤ k |X(t)=x, θt , u ] |X(t)=x ] = E[D(y1 , y2 , . . . , yk |βA1 , t , βA2 , t ,..., βAk , t ) |X(t)=x ] = R Θ D(y1 , y2 , . . . , yk |βA1 , t ,βA2 , t ,..., βAk , t )dHx (θt ), where βAi , t = αt, u (Ai ) + δθt (Ai ), i ∈ {1, 2, . . . , k}. Therefore Z Pt |X(t)=x ∼ D(αt + δθt )dHx (θt ). Θ Corollary 4.3.2. If for any t ∈ R+ Z Pt ∼ D(αu, t )dH(u) U 58 CHAPTER 4. MIXTURES OF CONTINUOUS-TIME DIRICHLET PROCESSES and θt ∼ Pt then for any t ∈ R+ Pt |θt ∼ Z D(αu, t + δθt )dHθt (u). U Proof Let A1 , A2 , . . . , Ak be a partition of R P[Pt (Ai ) ≤ yi , i = 1, . . . , k |θt ] = E[P[Pt (Ai ) ≤ yi , i = 1, . . . , k |θt , u ] |θt ] = E[D(y1 , y2 , . . . , yk | βu, t (A1 ), . . . , βu, t (Ak )) | θt ] = Therefore Pt |θt ∼ Z U R U D(y1 , y2 , . . . , yk | βu, t (A1 ), . . . , βu, t (Ak ))dHθt (u D(αu, t + δθt )dHθt (u). Chapter 5 Continuous-time Dirichlet hierarchical models In some recent and interesting papers, hierarchical models with a Dirichlet prior, shortly Dirichlet hierarchical models, were used in probabilistic classification applied to various fields such as biology ([1]), astronomy ([24]) or text mining ([4]). Actually, these models can be seen as complex mixtures of real Gaussian distributions fitted to non-temporal data. The aim of this chapter is to extend these models and estimate their parameters in order to deal with temporal data following a stochastic differential equation (SDE). The chapter is organized as follows. In section 2 we briefly recall Dirichlet hierarchical models. In section 3 we consider the case of a Brownian motion with a Dirichlet prior on its variance which is shown to be a limit of a random walk in Dirichlet random environment. As an application, we estimate, in section 4, regime switching models with stochastic drift and volatility. In section 5, we consider the case of functional data such as signals or solutions 59 60 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS of SDE’s. Computing some posterior distributions in the multivariate case, the preceding method is extended in order to classify such functional data. 5.1 Dirichlet hierarchical models Let P ∼ D(cH) denote a Dirichlet process with precision parameter c > 0 and mean parameter H, where H is a probability measure on a Polish space X. It is well-Known that P can be approximated by P = N X pk δXk (.) k=1 where iid Xi ∼ H (pi ) ∼ SB(c, N ) (5.1) (pi ) ⊥ (Xi ), SB(c, N ) denoting the stick-breaking scheme of Sethuraman. We will say that (Xi )1, 2,..., follows a Dirichet hierarchical model if X | P iid ∼P i (5.2) P ∼ D(c, H). 5.2. BROWNIAN MOTION IN DIRICHLET RANDOM ENVIRONMENT 5.2 61 Brownian motion in Dirichlet random environ- ment 5.2.1 Random walks in random Dirichlet environment Let D(cα) denote a Dirichlet process with parameters c > 0 and α, a finite measure on a polish space X. Consider a random variable H and a sequence (Ui ) of random variables defined by the following hierarchical model iid Ui | V = σ ∼ N(0, σ 2 ) V−1 | P ∼ P P | c ∼ D(cΓ(ν1 , ν2 )) c ∼ Γ(η , η ). 1 2 (5.3) Since V is sampled from a Dirichlet process, we have σ < ∞ a.e. because P(V < ∞) = E(E(V ∈ R | P, P (R))) = E(P (R)) = 1 Hence, we are allowed to consider the following random walk (Sn )n∈N in Dirichlet random environment, starting from 0: Sn = U1 + U2 + . . . + Un . For any real number t ≥ 0 let Stn = 1 n1/2 S[nt] (5.4) where [x] denotes the integer part of x. Let B σ = σB denote a zero mean Brownian motion with variance σ 2 , B denoting a standard Brownian motion independent from V. 62 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS Proposition 5.2.1. d (Stn )t≥0 −→ VB. Proof Let E = C(R+ ) be the space of real-valued continuous functions defined on R+ . For any bounded continuous function f defined on E we have Z Z Z n f ((St ))dP = ( f (x)dPStn |σ′ =σ )dP(σ). E R But, a standard result on the convergence of Gaussian random walks is that Z Z f (x)dPStn |V=σ −→ f (x)dPB σ E E and this integral is dominated by k f k. Hence by the dominated convergence theorem we have R R R (f (Stn )t≥0 )dP −→ R E f (x)dPB σ (x) dPσ (σ) = = R R R R f (σx)dP dPσ (σ) B E f (σB)dP the last equality being due to the fact that B and σ ′ are independent. Definition 5.2.1. A Brownian motion in Dirichlet random environment (BMDE) is a process Z such that Z | V = σ = L(B σ ) V−1 | P ∼ P P | c ∼ D(cΓ(ν1 , ν2 )) c ∼ Γ(η , η ). 1 2 5.2. BROWNIAN MOTION IN DIRICHLET RANDOM ENVIRONMENT 63 So, the above random walks in Dirichlet environment converge to a BMDE Proposition 5.2.2. If Z is BMDE then its conditional increments are independent Gaussians Zti − Zti−1 | V = σ = N(0, (ti − ti−1 )σ 2 ). The increments Zti − Zti−1 are orthogonal mixtures of Gaussians but need not be independent. 5.2.2 Simulation algorithm An order to simulate a M paths Z 1 , . . . , Z M of BM DE, proceed as follows: A path of a BMDE process (Z0 = 0, Zt1 , . . . , Ztn ) can be simulated as follows: Let dt = ti+1 − ti > 0 be small enough and let K be the stick-breaking precision. Draw c from Γ(η1 , η2 ) and draw q = (q1 , q2 , . . . , qK ) from SB(c, N ). iid Draw x = (x1 , x2 , . . . , xK ) with xi ’s ∼ Γ(ν1 , ν2 ). Repeat M times: P iid Draw σ −1 from K i=1 qi δxi , draw Z0 = 0 and n points Zti such that Zti+1 − Zti ∼ N(0, σ 2 dt). • Simulations 5.2.3 Estimation Due to proposition 1, given an observed path (zti of a BMDE, an estimation of its parameters can be obtained by performing Ishwaran and James blocked Gibbs 64 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS Figure 5.1: M Paths of BMDE and non Gaussian density of (Zt1i , . . . , ZtMi ). algorithm with 0 means and equal variances on the data zti+1 − zti (see Ishwaran - James paper, Section 3). 5.3 Description of the model let (ω, F, Ft , P ) be a stochastic basis and (Wt ) a one dimensional Winer process adapted to (ω, F, Ft , P ). We consider a stochastic process satisfying the following SDE: dXt = b(t, Xt )dt + θ(t)h(Xt )dWt where the function h(.) is assumed to be known, the volatility coefficient θ(.) is a known function of time and has to be correctly estimated, the drift coefficient b(t, x) may be unknown. We observe one sampling path of the process (Xt , t ∈ 5.3. DESCRIPTION OF THE MODEL 65 [0, T ]) at the discrete times ti = i△ for i = 1, . . . , N . The sampling interval △ is small in comparison of T . Let assume that N := T △−1 is an integer. We will use the following assumptions: • (A0): θ(t) is adapted to the filtration Ft , b(t, .) is non-anticipative map, b ∈ C −1 (R+ , R) and the exist LT > 0 such that ∀LT > 0 such that ∀t ∈ [0, T ], E(θ4 ) ≤ LT and E(θ8 ) ≤ LT . P • (A1): θ(.) = fρ=0 θρ 11[tρ , tρ+1 ) (.) where tρ is the volatility jump times. • (A2): ∃ > 0 such that θ2 (.) is almost surely Hölder continuous of order m with a constant K(ω) and E(K(ω)2 ) < +∞. If we assume that the volatility jump times correspond to the sampling times ti = i△, we have • (A1’): θ(.) = PN 1[ti , ti+1 ) (.) i=0 θi 1 2 we denote δθ2 = θi+1 − θi2 . and if moreover there is at most one change time in each window we get (A3). • (A3): (A1) and (A1’) are satisfied and inf ρ=0,..., f |tρ+1 − tρ | ≥ A△. Remark 5.3.1. If θ(t) satisfies a S.D.E. then (A2) is fulfilled, see e.g [A. Revuz and M.Yor, (1991)]. Rt We need to control tii+1 b4 (s, Xs )ds, so we will use: (B1) ∃KT > 0, ∀t ∈ [0, T ], E(b(t, Xt )4 ) ≤ KT In all the sequel we work on the simplified model: dXt = bt (t, Xt )dt + θ(t)dWt . Under some natural assumptions, the model (2) becomes (3) after the following change of variable: Proposition 5.3.1. (Pierre Bertrand) Assume that there exists a domain D ⊆ R such that h ∈ C(D, R+ − {0}) the space of continuous function from D to R+ − {0}, h−1 ∈ L1loc (D) and for (Xt ) solution of (2) satisfying P(Xt ∈ D, ∀ t ∈ 66 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS [0, T ]) = 1. Let H(x) =∈ h−1 (ξ)dξ. Then Yt = H(Xt ) satisfies the S.D.E (3) with b1 (t, x) = ′ h−1 (x)a(t, x) − 21 h (x)θ2 (t). 5.4 Estimation of the Volatility using Haar wavelets basis Since the size of the window appears in numerical applications as a free parameter to be arbitrarily chosen, we give a description of the Estimator introduced by Pierre Bertrand N/A−1 HA,△ (t) = X k=1 5.5 ( ) A−1 X A−1 (XtkA+i+1 − XtkA+i )2 11[tkA ;t(k+1)A ) (t). (5.5) k=1 SDE in Dirichlet random environment More generally, consider the following model. During the observation time interval [0, T ] the process Xt , evolves according to various regimes. Regime Rj holds during a random time interval [Tj−1 , Tj ) where 0 = T0 < T1 < T2 < . . . < TL = T. The drift and the variance are randomly chosen in each regime but they do not change during this regime, so dXt = L X j=1 µRj 1[Tj−1 , Tj ) (t)dt + L X j=1 σRj 1[Tj−1 , Tj ) (t)dBt 5.5. SDE IN DIRICHLET RANDOM ENVIRONMENT 67 where the Rj ’s ∈ {1, . . . , N } are random positive integers such that iid P Rj |p ∼ N k=1 pk δk (.) (µk , σk ) | θ ∼ N(θ, σµ ) ⊗ Γ(η1 , η2 ), p | α ∼ SB(α, N ) α ∼ Γ(ν1 , ν2 ) θ ∼ N(0, A). k = 1, . . . , L 5.5.1 Estimation The above process (Xt ) is observed at discrete times, say idt, i = 0, 1, 2, . . . , n It is also assumed that the regime changes occur at these times. The estimation of the above parameters can be done through Ishwaran and James Blocked Gibbs algorithm where their class label variable K is our regime R. ind ∆Xi | R, µ, σ ∼ N(µRi , σRi ) iid P Ri |p ∼ N k=1 pk δk (.) µi | θ ∼ N(θ, σµ ) σi ∼ Γ(η1 , η2 ) p | α ∼ SB(α, N ) α ∼ Γ(ν1 , ν2 ) θ ∼ N(0, A). 5.5.2 Option pricing in a regime switching market The above setting can be used in the option pricing problem with Xt = log(St ) where (St )t≥0 is the stock price process governed by a geometric Brownian mo- 68 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS tion, and σRi is a stochastic volatility during regime Ri . Observe that the estimations are done here without using any sliding windows technique and without assuming that Tj − Tj−1 is exponentially distributed, as it is done with Markov chains in regime switching markets. Definition 5.5.1. Suppose X is an n×p matrix, each row of which is independently drawn from p-variate normal distribution with zero mean: X(i) = (x1i , . . . , xpi )T ∼ Np (0, V ). Then the Wishart distribution is the probability distribution of the p × p random matrix T W = XX = n X T . X(i) X(i) i=1 One indicates that W has that probability distribution by writing W ∼ W(n, V ). The positive integer n is the number of degrees of freedom. 5.6 Classification of trajectories We consider the problem of classifying a set of n functions representing signals, stock prices and so on. Each function is known through a finite dimensional vector of observed points. In order to classify these functions, we now extend the blocked Gibbs algorithm to vector data. First let us precise our model. 5.6.1 Hierarchical Dirichlet Model for vector data In the finite d-dimensional normal mixture problem, we observe data f = (f1 , f2 , . . . , fn ), where fi are iid random curves with finite Wiener mixture 5.6. CLASSIFICATION OF TRAJECTORIES 69 density, the curves fi can be represented and approximated by the vector f̃i = (△1 fi , △2 fi , . . . , △L fi ) Z ψP (f ) = R×R+ φ(f |σ(y) )dP (y) = Σdk=1 pk, 0 φ(f |σk ) (5.6) where φ(f |σ ) represents a d-dimensional normal distribution with mean 0 and variance matrix σ. Based on the data , we would like to estimate the unknown mixture distribution P . We can devise a Gibbs sampling scheme for exploring the posterior PN | f. Notice that the model derived from (5) also contains hidden variables K = {K1 , . . . , Km } since it can also be expressed as iid f̃i | K, W, µ ∼ NL (µKi , △ti WKi ) PN Ki | p ∼ k=1 pk δk (.) (5.7) µk | θ ∼ NL (θ, σµ ) Wk ∼ W(s, V ) θ ∼ N (0, A) k where W(s, V ) and NL (µ, σ) denote a Wishart and a multivariate Gaussian distribution respectively, and p ∼ SB(c, N ). Note that a similar model for vector data appear in Caron F. et al. (2006) but in our case the parameters of the Whishart prior are updated at each iteration. In addition, we have a problem of clustering which justifies the use of the hidden variables Ki ’s. In particular we will need to compute the posterior distribution of the class variable K and of the weight variable p. To implement the blocked Gibbs sampler we iteratively draw values from the following conditional distributions: µ | K, W, θ, f W | K, µ, K, f 70 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS K | p, σ, Z, f p | K, α α|p θ | µ. 5.6.2 Posterior computations Blocked Gibbs Algorithm for vector data . ⋆ } denote the current m unique values of K. In each iteration of Let {K1⋆ , . . . , Km the Gibbs sampler we simulate: ⋆ (a) Conditional for µ: For each j ∈ {K1⋆ , . . . , Km }, draw ind µj | W, K, θ, f ∼ Nl (µ⋆j , Wj⋆ ) where µ⋆j = P {i:Ki =j} f̃i + θ and Wj⋆ = σµ , also for each j ∈ K − K ⋆ , indepen- dently simulate µj ∼ Nl (θ, σµ ). ⋆ • (b) Conditional for W : For each j ∈ {K1⋆ , . . . , Km }, draw ind Wj | µ, K, f ∼ W(s, X (f̃i − µj )(f̃i − µj )T + V ) {i:Ki =j} where W(V, p) denote the Wishart distribution with parameters V and p. • (c) Conditional for K: iid Ki | p, µ, W, f ∼ N X h=1 ph, i δh (.), i = 1, . . . , l 5.6. CLASSIFICATION OF TRAJECTORIES 71 where for each h = 1, 2, . . . , N ph, i ∝ ph 1 1/2 (2π)l/2 det(Wh ) nh exp D X E (f˜d − µh )(f˜d − µh )T , Wh , {d, Kd⋆ =h} and < A, B > is the trace of AB. • (d) Conditional for p: For any integer N , let V1 , . . . , VN −1 be iid β(1, c) and VN = 1. Let p1 = ⋆ V1⋆ , pk = (1 − V1⋆ ) . . . (1 − Vk−1 )Vk⋆ , k = 2, . . . , N where Vk⋆ N X = β 1 + rk , α + rl , f or k = 1, . . . , N − 1 l=k+1 and (as before) rk records the number of Ki values which equal k. • (e) Conditional for α: N −1 X α | p ∼ Γ N + η1 − 1, η2 − log(1 − Vk⋆ ) , k=1 for the same values of Vk⋆ used in the simulation for p. • (f) Conditional for θ: θ | µ ∼ NL (θ⋆ , σ ⋆ ), where θ⋆ = N X µk and σ ⋆ = A. k=1 Proof ⋆ Let φ denote the distribution function, for every j ∈ {K1⋆ , . . . , Km } 72 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS • (a) Conditional for µ: φµj |W, K, θ, f (y) = φf |µj =y, W, K,θ (y)φµj |W, K,θ (y)φW,K,θ = = Q {d, Kd⋆ =j} Q iy T = e φf˜d |µj =y, W, K, θ (y)φµj |W, K, θ (y)φW, K, θ {d, Kd⋆ =j} P eiy ˜ {d, K ⋆ =j} fd d T f˜ d − 12 e 1 ˜ T Wj f˜d e− 2 fd P 1 T T eiy θ− 2 y σµ y ˜ T Wj f˜d ) {d, K ⋆ =s} (fd d eiy T θ− 1 y T σ y µ 2 T P 1P T iy (θ+ {d, K ⋆ =j} f˜d )− 21 y T σµ y − 2 {d, K ⋆ =s} (f˜d Wj f˜d ) d d e = e hence ind µj | W, K, θ, f ∝ Nl (θ + X {d, Kd⋆ =j} f˜d , σµ ) 5.6. CLASSIFICATION OF TRAJECTORIES 73 ⋆ • (b) Conditional for W : For each j ∈ {K1⋆ , . . . , Km } φWj−1 |µ, K, f (M ) = φX|Wj =M, K (M )φWj−1 |K, µ (M )φµ, K (z, t) = × Q n−l−1 n−l−1 2 ) 2 nl n ) 2 2 det(V ) 2 Γp ( n 2 = e = e− 2 (fd −µj ) det(M − 21 T r × ˜ 1 {d, Kd⋆ =j} P T M (f˜ −µ ) j d 1 e− 2 T r(V −1 M ) φµ, K (z, t) T ˜ ˜ {d,K ⋆ =j} (fd −µj )(fd −µj ) M d n−l−1 n−l−1 2 ) 2 nl n 2 2 det(V ) 2 Γp ( n ) 2 det(M n−l−1 n−l−1 det(M 2 ) 2 nl n ) 2 2 det(V ) 2 Γp ( n 2 1 e− 2 T r(V − 21 T r e −1 M ) ( P φµ,K (z, t) T −1 )M ˜ ˜ {d, K ⋆ =j} (fd −µj )(fd −µj ) +V d × φµ, K (z, t) therefore, X ind Wj | µ, K, f ∝ W n, ( (f̃i − µj )(f̃i − µj )T + V )−1 . {i:Ki =j} • (c) Conditional for K: P {Ki = j | p, µ, W, f } = P {f | p, W, Ki = j, µ}P {Ki = s | W, µ}P {µ}P {W } ∝ P {f | p, W, Ki = j, µ}P {Ki = s | W, µ} = Q {d, Kd⋆ =s} ps (2π)l/2 det(Ws ) 1 ˜ 1/2 e− 2 (fd −µs ) TW ˜ s (fd −µs ) . 74 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS Hence, ps, i ∝ ps 1 1/2 (2π)l/2 det(Ws ) ns exp D X (f˜d − µs )(f˜d − µs )T , Ws {d, Kd⋆ =s} E where ns is the number of time Ks⋆ occurs in K. (d) Conditional for θ: φθ|µ=µ′ (θ) ∝ φµ|θ (µ′ )φθ (θ) = = = QN j=1 Q ei φµ|θ (µ′j )φθ (θ) ′ N iθT µ′j − 21 µ′T e j σ µ µj j=1 e PN j=1 θT µ′j − 21 θT Aθ e e − 21 θT Aθ P N 1 ′T ′ e j=1 − 2 µj σµ µj . P Hence the distribution of θ | µ ∝ NL ( N j=1 µj , A). 5.6.3 Classes of volatility Let (St ) be the stock price process and suppose that Xt = log(St ), satisfies: dXt = b(t, Xt )dt + θ(t)h(Xt )dBt (5.8) where the function h(.) is assumed to be known, the volatility coefficient θ(.) is a random function of time and has to be estimated and the drift coefficient b(t, x) is unknown. We observe a path of the process (Xt , t ∈ [0, T ]) sampled at discrete times ti = i△, for i = 1, . . . , N . Under some conditions and after a change of variable (see e.g. [5]), equation (5.8) reduces to dXt = bt (t, Xt )dt + θ(t)dBt . 5.7. CONCLUSION 75 A refined method to estimate θ(t) consists in using wavelets. Consider (Vj , j ∈ Z) an r-regular Multi Resolution Analysis of L2 (R) such that the associated scale function Φ and the wavelet function ψ are compactly supported. For all j, the family {Φj, k (t) = 2j/2 Φ(2j t − k), k ∈ Z} is an orthogonal basis of Vj . Time being sampled with △ = 2−n , St , the estimator is then: θ2 (t) = X µj(n), k Φj(n), k (t) (5.9) k for j(n) < n, where µj(n), k = N −1 X Φj(n), k (ti )(Xti+1 − Xti )2 . (5.10) i=1 Suppose that we have observed n trajectories X1 , . . . , Xl , . . . , Xn sampled as above, and that we want to classify them according to their volatility component, that is, we want to classify the θl ’s estimated by (5.9). We then see that we have just to apply the preceding algorithm to the vectors µlj(n), k which are finite dimensional representations of the θl ’s. 5.7 Conclusion We have extended Dirichlet hierarchical models in order to deal with temporal data such as solutions of SDE with stochastic drift and volatility. It can be thought that the process on which are based these parameters belongs to a certain well-known class of processes, such as continuous time Markov chains. Then, we think that a Dirichlet prior can be put on the path space, that is a functional space. The estimation procedure in such a context is the topic the next chapter. 76 CHAPTER 5. CONTINUOUS-TIME DIRICHLET HIERARCHICAL MODELS Chapter 6 Markov regime switching with Dirichlet Prior. Application to Modelling Stock Prices We have seen in a preceding Chapter, some examples of continuous-time Dirichlet processes with parameters proportional to the distribution of continuous-time processes, such as the Wiener measure one. In the present Chapter, motivated by some mathematical models in finance dealing with ’Regime switching markets’, we consider the case where the continuous-time process is a continuous-time Markov chain whose state at time t modellizes the state of the market at time t. Indeed, while in preceding Chapter 5, volatility was constant during some time interval of random length without any hypothesis on the switching process, here the switching depends on a Markov chain which states represent the different regimes. Also, the various values of the trend and the volatility depend on the state of this chain which ’chooses’ these values among some i.i.d. ones. Clearly, we deal with 77 78CHAPTER 6. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR. APPLICATION stochastic volatility In our approach, the regimes play the same role as the classes play in classification: each temporal observation therefore belongs to a class that is to a regime. Our contribution consists in placing a Dirichlet process prior on the path space of the Markov chain, which is a cadlag function space. This idea is new as it has never been used in the literature. In the first Section, we present our model. Section 2 deals with the estimation procedure, the computations of the posteriors follow from those done in Chapter 5. In the last Section 3, we give some indications on the implementation of the algorithm in C language and some numerical results are presented. 6.1 Markov regime switching with Dirichlet prior In this section , we take ᾱ = H, the distribution of a continuous time Markov chain on a finite set of states and we propose a new hierarchical model that is specified, as an example, in the setting of mathematical finance. Of course, this can be similarly used in many other cases. We consider the Black-Scholes SDE in random environment with a Dirichlet prior on the path space of the chain, the states of the chain representing the environment due to the market. We model the stock price using a geometric Brownian motion with drift and variance depending on the state of the market. The state of the market is modeled as a continuous time Markov chain with a Dirichlet prior. In what follows, the notation σ will be used to denote the variance rather than the standard deviations. The following notations will be adopted: 1. n will denote the number of observed data and also the length of an observed path. 6.1. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR 79 2. M will denote the number of states of the Markov chain. 3. The state space of the chain will be denoted by S = {i : 1 ≤ i ≤ M }. 4. N will denote the number of simulated paths. 5. m will denote the number of distinct states of a path. – The stock price follows the following SDE: p dSt = β(Xt )dt + σ(Xt )dBt , St t ≥ 0, where Bt is a standard Brownian motion. By the Ito’s formula, the process Zt = log(St ) satisfies the SDE, dZt = µ(Xt )dt + p σ(Xt )dBt , t ≥ 0, where µ(Xt ) = β(Xt )− 21 σ(Xt ). The observed data is of the form Z0 , Z1 , . . . , Zn . – The process (Xt ) is assumed to be a continuous time Markov process taking values in the set S = {i : 1 ≤ i ≤ M }. The transition probabilities of this chain are denoted by pij , i, j ∈ S and the transition rate matrix is Q0 = (qij )i,j∈S with λi > 0, qij = λi pij if i 6= j, and qii = − X qij , i, j ∈ S. j6=i Then, conditional on the path {Xs , 0 ≤ s ≤ n}, Yt = Zt − Zt−1 = log(St /St−1 ) are i.i.d. N(µXt , σXt ), t = 1, 2, . . . , n. – For each i = 1, 2, . . . , M, the priors on µi = µ(i) and σi = σ(i) are specified by µi ind σi ind ∼ N(θ, τ µ ), ∼ Γ(ν1 , ν2 ). with θ ∼ N(0, A), A > 0, (6.1) (6.2) 80CHAPTER 6. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR. APPLICATION – The Markov chain {Xt , t ≥ 0} has prior D(α H), where H is a probability measure on the path space of cadlag functions D([0, ∞), S). The initial distribution according to H is the uniform distribution π0 = (1/M, . . . , 1/M ), and the transition rate matrix is Q with pij = 1/(M − 1) and λi = λ > 0. Thus the Markov chain under Q will spend an exponential time with mean 1/λ in any state i and then jump to state j 6= i with probability 1/(M − 1). A realization of the Markov chain from the above prior is generated as follows: Generate a large number of paths Xi = {xis : 0 ≤ s ≤ n}, i = 1, 2, . . . , N, from H. Generate the vector of probabilities (pi , i = 1, . . . , N ) from a Poisson Dirichlet distribution with parameter α, using stick breaking. Then draw a realization of the Markov chain from p= N X pi δXi , (6.3) i=1 which is a probability measure on the path space D([0, n), S). The parameter λ is chosen to be small so that the variance is large and hence we obtain a large variety of paths to sample from at a later stage. The prior for α is given by, α ∼ Γ(η1 , η2 ). 6.2 (6.4) Estimation Estimation is done using the simulation of a large number of paths of the Markov chain which will be selected according to a probability vector (generated by stickbreaking) and then using the blocked Gibbs sampling technique. This technique uses the posterior distribution of the various parameters. 6.2. ESTIMATION 81 To carry out this procedure we need to compute the following conditional distributions. We denote by µ, and σ, the current values of the vectors (µ1 , µ2 , . . . , µn ), (σ1 , σ2 , . . . , σn ), respectively. Let Y be the vector of observed data (Y1 , . . . , Yn ). Let X = (x1 , x2 , . . . , xn ) be the vector of current values of the states of the Markov chain at times t = 1, 2, . . . , n, respectively. Let X ∗ = (x∗1 , . . . , x∗m ) be the distinct values in X. – Conditional for µ. For each j ∈ X ∗ draw ind (µj |σ, X, θ, Y ) ∼ N(µ∗j , σj∗ ), where µ∗j = σj∗ X Yt θ + µ σj τ t:X =j t σj∗ = nj 1 + µ σj τ −1 ! (6.5) , , and nj being the number of times j occurs in X. For each j ∈ X \ X ∗ , independently simulate µj ∼ N(θ, τ µ ). – Conditional for σ. For each j ∈ X ∗ draw ind (σj |µ, K, Y ) ∼ Γ(ν1 + where ν2,∗ j = ν2, j + nj ∗ , ν ), 2 2, j (6.6) X (Yt − µj )2 . 2 t:X =j t Also for each j ∈ X \ X ∗ , independently simulate σj ∼ Γ(ν1 , ν2 ). – Conditional for X. (X|p) ∼ N X i=1 p∗i δXi , (6.7) 82CHAPTER 6. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR. APPLICATION where p∗i ∝ m Y g=1 Y {d, xi,∗ d =g} 1 − 2σ1 (Yd −µg )2 g e pi , (2σg π)1/2 (6.8) i,∗ where (xi,∗ 1 , . . . , xm ) denote the current m = m(i) unique values of Xi , i = 1, . . . , N. – Conditional for p. ∗ p1 = V1∗ , and pk = (1−V1∗ ) · · · (1−Vk−1 )Vk∗ , k = 2, 3, . . . , N −1, (6.9) where ind Vk∗ ∼ N X β 1 + rk , α + rℓ , ℓ=k+1 rk being the number of xil ’s which equal k. – Conditional for α. (α|p) ∼ Γ N + η1 − 1, η2 − N −1 X log(1 − i=1 ! Vi∗ ) , where the V ∗ values are those obtained in the simulation of p in the above step. – Conditional for θ. (θ|µ) ∼ N(θ∗ , τ ∗ ), (6.10) where M τ∗ X µj , θ = µ τ j=1 ∗ and ∗ τ = Proof. M 1 + µ τ A −1 . 6.2. ESTIMATION 83 (a) The computation of the posterior distributions for µ, σ and θ follow in the same manner as in Ishwaran and James (2002) and Ishwaran and Zarepour (2000). Here, Xt = s means that the class variable is equal to s. (b) Conditional for X: P {X = Xi | p, µ, σ, Y } = P {Y | p, σ, X = Xi , µ}P {X = Xi | σ, µ, p}P {µ, σ} ∝ Hence, p∗i ∝ m Y g=1 Y Qm Q g=1 1 {d,xi,∗ d =g} − 1 (Y −µ )2 1 e 2σg d g (2πσg )1/2 − 2σ1 (Yd −µg )2 e 1/2 g (2πσg {d, xi,∗ =g} d pi pi i,∗ where Xi = (xi1 , . . . , xin ) and (xi,∗ 1 , . . . , xm ) denote the current m unique values in the path Xi . (c) Conditional for p : The Sethuraman stick-breaking scheme can be extended to the two-parameter Beta distributions, see Walker and Muliere (1997, 1998): ind Let Vk ∼ β(ak , bk ), for each k = 1, . . . , N . Let p1 = V1 , and pk = (1 − V1 ) · · · (1 − Vk−1 )Vk , k = 2, 3, . . . , N − 1. We will write the above random vector, in short as p ∼ SB(a1 , b2 , . . . , aN −1 , bN −1 ). By Connor and Mosimann (1969), the density of p is −1 NY Γ(ak − bk ) a1 −1 a −1 −1 bN −1 −1 p1 . . . pNN−1 pN × Γ(a k )Γ(bk ) k=1 ×(1 − P1 )b1 −(a2 −b2 ) . . . (1 − PN −2 )bN −2 −(aN −1 −bN −1 ) , where Pk = p1 + . . . + pk . 84CHAPTER 6. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR. APPLICATION From this, it easily follows that the distribution is conjugate for multinomial sampling, and consequently the posterior distribution of p given X, when ak = 1 and bk = α for each k, is SB(a∗1 , b∗2 , . . . , a∗N −1 , b∗N −1 ), where b∗k = α + N X rℓ ℓ=k+1 a∗k = 1 + rk , and rk is the number of xil ’s which equal k, k=1,. . . , N-1. 6.3 Implementation The algorithm presented in the previous section was implemented in C language. The implementation includes: - functions that simulate standard probability distributions: Uniform, Normal, Gamma, Beta, Exponential. - a function that returns an index ∈ {1, . . . , n} according to a vector of probability p1 , . . . , pn . - a function that simulates a probability vector according to stick-breaking scheme. - a function that simulates n paths of a Markov chain. - a function that records the number of times a state appears in a path. - a function that chooses one of the paths according to a vector of probability. - a function that modifies the parameters of prior distributions according to the formulas of the posteriori distributions. 6.3. IMPLEMENTATION 85 After having simulated a number of paths, we perform the iterations. At each iteration a path is randomly selected and the parameters are updated according to posteriori formulas. At the end of each iteration of the Gibbs sampling, we obtain a path X of the Markov chain. From this, the parameters π and Q0 can be re-estimated. From Q0 the parameters λi and pij can be derived. 6.3.1 Simulated data We fit the model, using the algorithm developed above, to a simulated series of lenght n = 480, with a number of states (regimes) M = 4, mean and variance in each state being chosen as follows: (µ1 , σ1 ) = (−1.15, 0.450) (µ2 , σ2 ) = (−0.93, 0.450) (µ3 , σ3 ) = (−0.60, 0.440) (µ4 , σ4 ) = (1.40, 0.500). We have performed our algorithm on that series with number of states M = 10, number of paths N = 100 and number of iterations = 25,000. Then, we have observed that the algorithm is able to put most of the mass (in terms of the stationary distribution of the MC) on 4 regimes, which are close to the ones chosen above. At the end of the iterations we compute a confidence interval for the mean and for the variance w.r.t. each regime. We can conclude that the algorithm is able to identify the parameters of the simulated data set. The confidence intervals for the mean and the variance are given below. Regime 1: Im = [−1.208, −1.12423] and Iv = [0.431, 0.4738]. 86CHAPTER 6. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR. APPLICATION Regime 2: Im = [−0.9351, −0.9296] and Iv = [0.442, 0.4538]. Im = [−0.63446, −0.5140] and Iv = [0.4319, 0.4491]. Regime 3: Regime 4: Im = [1.30114, 1.43446] and Iv = [0.4949, 0.5081]. 6.3.2 Real data We have also applied our algorithm to the Bsemidcap index data of the Indian National Stock Exchange (NSE) from 21/12/2006 to 15/11/2007 (www.nseindia.com). For this dataset we have, n = 250, ∆t = 1, and we deal with N = 100 of paths while Gamma(2, 4) is the prior for α. With the above choice, we obtain 6 regimes for which the estimates for the mean, variance and stationary probabilities are as follows: R1 R2 R3 R4 R5 R6 µ 0.001124 -0.009479 0.000629 -0.004579 0.000829 0.001109 σ 2.9132 e-05 7.2166 e-05 2.3023 e-05 7.3800 e-05 1.186 e-05 3.3372 e-05 π 20 % 3% 29% 5% 10 % 33 % The most frequent Markov chain path, its parameters λi s and the matrix of transition probability (pi,j )1≤i6=j≤6 are respectively equal to: 6.3. IMPLEMENTATION 87 35363636165136353366563611416133666313336333 45666646111666661333161335633165413646335636 23613361665511535336165616631631162366633266 61336631366166116153513534133531366613565336 36135666516331166636136366666363664636116461 3 4 3 6. λ1 λ2 λ3 λ4 λ5 λ6 0.8 1 0.7 1 0.95 0.75 0 0.48 0.03 0.06 0.42 0 0.66 0 0 0.33 0.16 0.02 0.062 0.2 0.54 0.375 0 0 0.125 0.5 0.157 0 0.42 0.052 0.36 0.36 0.038 0.384 0.077 0.134 It is interesting to note that in the high volatility states, the index has a negative drift as is usually observed in analysis of empirical data. A by-product of our algorithm is the distribution of the current state of the volatility, which is required to compute the price of an option ( see [12] and references therein). 88CHAPTER 6. MARKOV REGIME SWITCHING WITH DIRICHLET PRIOR. APPLICATION Chapter 7 Conclusion and Perspectives Our main subject of interest was to investigate Dirichlet processes when the parameter is proportional to the distribution of a stochastic process (Brownian motion, jump processes, ...) and to propose continuous-time hierarchical models involving continous-time Dirichlet processes. Although this area requires some rather nontrivial techniques, we have shown that such a setting can be of interest in modelling SDEs in random environment and that the proposed estimation procedure works. Let us finally mention some perspectives. It is clear that it would be interesting to extend the method to other SDEs and to other kind of processes, we think of replacing, in the last chapter, the markov chain by a diffusion, a spatio-temporal process or a multivariate process. It would be also of interest to use the estimated model for prediction and to compare this prediction with other models. Concerning the algorithm in the last chapter it can be observed that for each iteration, an option price w.r.t. the selected path can be computed by using for example the formula in Ghosh and Deshpande. After performing all the iterations, we will 89 90 CHAPTER 7. CONCLUSION AND PERSPECTIVES have a distribution of option prices that can be used for decision-making on the final option price. This should be compared to other decision procedures. Bibliography [1] A NTONIAK , C.E. (1974). Mixtures of Dirichlet processes. Ann. Statist. 2, 6, 1152-1174. [2] B ERTOIN , J (2006). Random fragmantation and coagulation processes. [3] B LACKWELL , D. and M AC Q UEEN , J. B. (1973). Ferguson distributions via Polya urn schemes. Ann. Statist. 2, 1, 353-355. [4] B LEI , D. and J ORDAN ., I. J. (2005). Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1 121-144. [5] B LEI , D., N G , A. and J ORDAN , M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022. [6] B ERTRAND , P (1996). Estimation of the Stochastic Volatility of a Diffussion Process I. Comparison of Haar basis Estimator and Kernel Estimators. INRIA. [7] B RUNNER , L.J. and L O , A. Y. (2002). Bayesian classification To appear. [8] C IFARELLI , D. M. and M ELILLI , E. (2000). Some new results for Dirichlet priors, Journal of the American Statistical Association. Ann. Statist. 28 13901413. [9] C IFARELLI , D. and R EGAZZINI , E. (1978). Problemi Statistici Non Parametrici in Condizioni di Scambiabilit‘ a Parziale e Impiego di Medie Asso- 91 BIBLIOGRAPHY 92 ciative. Tech. rep., Quaderni Istituto Matematica Finanziaria dell-Universitàdi Torino. [10] C IFARELLI , D. M. and R EGAZZINI , E. (1990). Distribution functions of means of a Dirichlet process. Ann. Statist. 18 429-442. [11] DAHL , D. B. (2003). Modeling differential gene expression using a Dirichlet Process mixture model. in Proceedings of the American Statistical Association, Bayesian Statistical Sciences Section. American Statistical Association, Alexandria, VA. [12] D ESHPANDE , A. AND G HOSH , M. K. (2007). Risk Minimizing Option Pricing in a Regime Switching Market. Stochastic Analysis and Applications. Vol. 28, 2008. To appear. [13] DONNET, S., and SAMSON, A. (2005). Parametric Estimation for Diffusion Processes from Discrete-time and Noisy Observations. Journal of the American Statistical Association, 92, 894-902. [14] D OSS , H. and S ELLKE , T (1982). The tails of probabilities chosen from a Dirichlet prior. Ann. Statist. 10 1302-1305. [15] E MILION , R. (2001). Classification and mixtures of processes. SFC 2001 and C.R. Acad. Sci. Paris, série I 335, 189-193. [16] E MILION , R. and PASQUIGNON , D. (2005). Random distributions in image analysis. Preprint. [17] E MILION , R. (2005). Process of random distributions. Afrika Stat, vol 1, 1, pp. 27-46, http://www.ufrsat.org/jas (contenus). [18] F ERGUSON , T.S (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209-230. BIBLIOGRAPHY 93 [19] F ERGUSON , T.S. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2 615-629. [20] F ERGUSON , T. S. (1983). Bayesian density estimation by mixtures of normal distributions. Recent Advances in Statistics (H. Rizvi and J. Rustagi, eds.), New York: Academic Press. [21] H AL DAUME III and DANIEL M ARCU (2005). A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior. Journal of Machine Learning Research 6 1, 1551-1577. [22] H UILLET, T and C HRISTIAN PAROISSIN (2005). A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior. preprint. [23] I N D EY, D., M ULLER , P. and S INHA , D. EDS . (1969). Computational methods for mixture of Dirichlet process models. Practical Nonparametric and Semiparametric Bayesian Statistics, 23-44. Springer. [24] I SHWARAN , H. and JAMES , L.F. (2002). Approximate Dirichlet processes computing in finite normal mixtures: smoothing and prior information. J. Comp. Graph. Stat. 11 209-230. [25] I SHWARAN H., JAMES , L.F and S UN , J. (2000). Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions. conditionally accepted by Journal of the American Statistical Association. [26] I SHWARAN , H. and Z AREPOUR , M. (2000). Markov Chain Monte Carlo in Approximate Dirichlet and Beta Two-Parameter Process Hierarchical Models,. EJP 1, 1-28. [27] JASON A. DUAN, M ICHELE GUINDANI, and A LAN E. GELFAND (2007). Generalized spatial Dirichlet process models. Biometrika, 94, 4, pp. 809825. BIBLIOGRAPHY 94 [28] KOTTAS , T., A. D UAN , J. and G ELFAND , A. E. (2007). Modeling Disease Incidence Data with Spatial and Spatio-Temporal Dirichlet Process Mixtures. Biometrical Journal, 5, 114 DOI: 10.1002/bimj.200610375. [29] K INGMANN , J. F. C. and JAMES , L.F. (2002). Approximate Dirichlet processes computing in finite normal. [30] K INGMANN , J. F. C. (1993). P OISSON P ROCESSES . C LARENDON , OX FORD U NIVERSITY P RESS . [31] K INGMAN , J. F. C. (1975). Random discrete distributions. J. Roy. Statist. Soc. B, 37, 1-22. [32] R ADFORD M. NEAL (1996). Density Modeling and Clustering Using Dirichlet Diffusion Trees. BAYESIAN STATISTICS 7, pp. 619-629. [33] ROEDER , K. and WASSERMAN , L. (1997). Practical Bayesian Density Estimation Using Mixtures of Normals. Journal of the American Statistical Association , 92, 894-902. [34] S ETHURAMAN , J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica 4 639-650. [35] S ETHURAMAN , J. and W EST, M. (1994). Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association, 90, pp, 577-588. [36] S ETHURAMAN , J. and T IWARI , R. C. (1982). Convergence of Dirichlet Measures and the Interpretation of Their Parameters. Statistical Decision Theory and Related Topics III, 2, 305-315. [37] S OULE , A., S ALAMATIAN , K. and E MILION , R. (2004). Classification of histograms. Sigmetrics 2004, rp.lip6.fr/site_npa/site_rp/publications.php. New York. http://www- BIBLIOGRAPHY 95 [38] S OULE , A., S ALAMATIAN , K., TAFT, N., Emilion, R. and PAPAGIANNAKI , K. (2004). Classification of Internet flows. Sigmetrics’04, New-York. [39] ROEDER , K. and WASSERMAN , L. (1997). Practical Bayesian Density Estimation Using Mixtures of Normals. Journal of the American Statistical Association , 92, 894-902. [40] M AC LOSKEY, J.W. (1965). Model for the Distribution of Individuals by Species in an Environment. Unpublished Ph.D. thesis, Michigan State University. [41] M ULIERE , P. and TARDELLA , L. (1998). Approximating Distributions of Random Functionals of Ferguson-Dirichlet Priors. Canadian Journal of Statistics, 26, 283-297. [42] N. J OHNSON and S. KOTZ . (1978). Urn Models and Their Applications: an approach to Modern Discrete Probability Theory. Technometrics, Vol. 20, No. 4, Part 1 (Nov., 1978), p. 501. [43] P. DAMIEN , J. C. WAKEFIELD , and S. G.WALKER . (1999). S. G.Walker. Gibbs sampling for Bayesian nonconjugate and hierarchical models using auxiliary variables. Journal of the Royal Statistical Society Series B, 61: 331-344. [44] P ITMAN , J. and YOR , M. (1996). Random discrete distributions derived from self-similar random sets. EJP 1, 1-28. [45] P ITMAN , J. and YOR , M. (1997). The two-parameter Poisson-Dirichlet distribution derived from stable subordiantors. Ann. Proba 25, 2, 855-900. [46] P ITMAN , J. (2003). Poisson-Kingman partitions. In: Science and Statistics: A Festschrift for Terry Speed. D. R. Goldstein editor. Lecture Notes - Monograph Series 30 1-34. Institute of Mathematical Statistics Hayward, California. 2 Lecture Notes - Monograph Series 30 1-34.