Bayesian Nonparametrics Foundations and Applications Notes to Stat 718 (Spring 2008) Jayaram Sethuraman E-mail: sethuram@math.sc.edu Modified on April 22, 2008 January 15, 2008 1 Introduction of Bayes Inference A statistician wishes to make inferences about the unknown state of nature. Suppose that this unknown state of nature is summarized by a real variable θ. We try to observe some data, represented as X, which again for simplicity, will be assumed to be a a real number or a finite dimensional vector of real numbers. It’s distribution will depend on θ. This fact allows us to infer something about θ from the data X. The Bayesian begins by saying that he has some information about θ 1 even before he has collected data. This is assumed to be summarized by a probability distribution (the prior distribution) for θ defined by a pdf π(θ). The distribution of the data given that the state of nature is θ is summarized by a distribution with pdf p(x|θ). Let Q denote the joint distribution of (X, θ). The pdf of this distribution is p(x|θ)π(θ). The pdf of the conditional distribution of θ given the data X (the posterior distribution) can be given as π(θ|x) = R p(x|θ)π(θ) ∝ p(x|θ)π(θ). p(x|θ′ )π(θ′ )dθ′ Example 1.1. Let the random variables X1 , X2 , ...Xn be i.i.d N(θ, σ 2 ) where σ 2 is known. Let X stand for (X1 , ...Xn ). Then p(x|θ) ∝ e−n where x̄ = 1 n Pn 1 (x̄−θ)2 2σ 2 A(x) xi and A(x) depends only on x. Suppose that the prior information on θ is summarized by the normal distribution with pdf π(θ) ∼ N(µ, τ 2 ). Then −n(X̄ − θ)2 (θ − µ)2 − ) 2σ 2 2τ 2 1 n 1 nx̄ µ ∝ exp (− ( 2 + 2 )θ2 + θ( 2 + 2 )) 2 σ τ σ τ µ nX̄ 1 1 n 2 + τ2 2 ∝ exp (− ( 2 + 2 )(θ − σn 1 ) 2 σ τ + 2 σ τ2 nX̄ µ 1 2 + τ2 . ∼ N σn 1 , n + τ 2 σ2 + τ12 σ2 π(θ|x) ∝ exp ( 2 Under a squared error loss, the optimal estimate of θ is given by θ̂ = E(θ|x) which is the expected value of θ under the posterior distribution π(θ|x). This Bayes estimate is equal to θ̂ = nx̄ σ2 n σ2 + + µ τ2 1 τ2 . This is usually interpreted by saying that the Bayes estimate is a convex combination of the prior mean and the sample mean, with constants proportional to the prior precision ( τ12 ) and the sample precision ( σn2 ). It is interesting to note that if τ → ∞, this estimate tends to x̄, the sample mean. Thus τ → ∞ is interpreted as no prior information about θ and it produces the usual frequentist estimate. This can also be formally done by using the uniform distribution on (−∞, +∞) for θ. It should be emphasized that this is not a rigorous Bayesian way of doing things. Example 1.2. Let N = (N1 , N2 , . . . Nk ) be a multivariate random vector P satisfying Ni ≥ 0, i = 1, . . . , k and ki=1 Ni = n, whose distribution depends P on a parameter θ = (θ1 , θ2 . . . θk ) satisfying θi ≥ 0, ki=1 θi = 1. Suppose that this distribution is given by the pmf k Y n! p(N|θ) = θiNi . N1 ! . . . Nk ! i=1 A popular prior distribution for θ is given the pdf of k−1 variables θ1 , . . . , θk−1 : π(θ) ∝ k Y θiαi −1 1 k Γ(α1 + · · · + αk ) Y αi −1 = θ Γ(α1 ) . . . Γ(αk ) i=1 i 3 with θk = 1 − θ1 − · · · − θk−1 and where α1 > 0, . . . , αk > 0 satisfying α1 + · · · + αk > 0. This distribution is called the finite dimensional Dirichlet distribution D(α1 . . . αk ) = Dα where α = (α1 , . . . , αk ). An alternate description can be given as θ ∼ Dα if θ1 = Z1 , . . . θk Z = Zk Z when Z = Z1 + . . . Zk and Z1 , . . . Zk are independent Gamma r. v.’s with parameters (λ, α1 ), . . . , (λ, αk ), respectively. A r. v. Z is said to have the Gamma distribution Γ(λ, α) if its pdf is p(z) = λα −λz α−1 e z Γ(α) z≥0 if α > 0 and is δ0 , the degenerate distribution at 0 if α = 0. Notice that π(θ|N) ∝ k Y i=1 θiNi +αi −1 ∼ D(α1 + N1 , . . . , αk + Nk ) = Dα+N and hence the posterior distribution of θ is Dα+N . We will examine this in some more detail below. Lemma 1.1. Let U, V, A be independent random variables with U ∼ Dα, V ∼ P P P P Dβ and A ∼ B( ki=1 αi , ki=1 βi ) (i.e. (A, 1 − A) ∼ D( ki=1 αi , ki=1 βi )). Let W = AU + (1 − A)V. Then W ∼ D(α1 + β1 , . . . αk + βk ) = Dα+β . Z∗ Z∗ Proof: Let U = ( ZZ1 , . . . ZZk ), V = ( Z1∗ , . . . Zk∗ ), A = Z , Z+Z ∗ where Z = Z1 + · · · + Zk , Z ∗ = Z1∗ + · · · + Zk∗ and Z1 , . . . , Zk , Z1∗ , . . . , Zk∗ are independent random Gamma variables with parameters α1 , . . . , αk , β1 , . . . , βk , respectively. By Basu’s theorem U is independent of Z, V is independent of Z ∗ . Also P P U ∼ Dα, V ∼ Dβ and A ∼ B( αi , βi ). Thus the random variables (U, V, A) defined above have the same joint distribution as (U, V, A) in Lemma 1.1. 4 Z +Z ∗ 1 1 Now, W = AU + (1 − A)V = ( Z+Z ∗ ,..., Zk +Zk∗ ). Z+Z ∗ This shows that the distribution of W is Dα+β . Remark 1.1. The conditional probability of A given B denoted by P (A|B) is better viewed as a function of B, with A fixed. Thus we also have P (A|B c ), P (A|φ) and P (A|Ω), in fact we have P (A|C) for C ∈ σ(B) = {φ, B, B c , Ω}. It will have the property P (A ∩ C) = P (A|C)P (C) for all C ∈ σ(B). Similarly the conditional expectation of X given Y denoted by E(X) = E(E(X|Y )) should be viewed as function of Y satisfying the condition Z E(X|Y )I(Y ∈ C)dQ = Z XI(Y ∈ C)dQ for all Borel sets C, where Q is the joint distribution of (X, Y ). As a special case the conditional probability of X ∈ A given Y is a function Q(X ∈ A|Y ) of Y satisfying Z Z Q(X ∈ A|Y )I(Y ∈ C)dQ = I(X ∈ A)I(Y ∈ C)dQ for all Borel sets C. When (X, Y ) possesses a joint pdf p(x, y) under Q and λ(y) is the pdf (w.r.t ν) of Y , the conditional probability Q(X ∈ A|y) can be written as R X∈A or just that p(x,y)dν λ(y) p(x, y)dν λ(y) is the pdf of the conditional distribution of X given Y = y. Back to Example 1.2: The random vector N = (N1 , . . . , Nk ) at the beginning of this example can also be viewed in a different way. Let X be a random variable taking values in {1. . . . , k} with a probability distribution, p(·|θ), with a pmf given by Q(X = i|θ) = p(i|θ) = θi , i = 1, . . . , k. 5 Let X1 , . . . , Xn be i.i.d. with distribution p(·|θ). Then the joint distribution of (X1 , . . . , Xn ) has a pmf given by k Y θiNi 1 where Ni = Pk 1 I(Xj = i), i = 1, . . . , k. Recall that we assumed a finite dimensional Dirichlet prior for θ, namely θ ∼ Dα . The joint dist of ((X1 , . . . , Xn ), θ) is proportional to k Y θiNi 1 k Y θiαi 1 and the conditional distribution of θ given (X1 , . . . , Xn ) has pdf proportional to k Y θiNi +αi 1 which corresponds to the finite dimensional Dirichlet distribution Dα+N . Define e1 = (1, 0, . . . 0), e2 = (0, 1, . . . 0), . . . ek = (0, 0, . . . 1). Assume that n = 1. Then X1 can take k possible values and N = (N1 , . . . , Nk ) can correspondingly take the values e1 , . . . , ek . In fact, N = eX1 . From now on write X for X1 . The posterior distribution of θ given X is Dα+eX . From Remark 1.1 we 6 have Dα(B) = Q(θ ∈ B) k X = Q(θ ∈ B|X = i)Q(X = i) 1 = = k X 1 k X 1 Dα+ei (B)Q(X = i) Dα+ei (B) αi . Pk 1 αj Thus we have proved the following Lemma. Lemma 1.2. Dα = 2 k X 1 αi Dα+ei Pk . 1 αj The nonparametric problem Let X1 , X2 . . . Xn be i.i.d with distribution F . One can consider several nonparametric hypotheses concerning the df F . • The parameter F is completely unspecified. • F (x) = G(x − θ) for all x, θ is unknown and G is an unspecified symmetric df. • F has IFR; .. 7 January 17, 2008 We will now give some details on the use of Basu’s theorem to establish the the independence of U = (U1 , ...Uk ) and Z in Lemma 1.1. Note that U1 = Z1 , ...Uk Z = Zk Z where Z = Z1 + ...Zk where Z1 . . . Zk are independent Gamma(λ, α1), . . . , Gamma(λ, αk ). Then Z is a complete sufficient statistic for λ and the distribution of (U1 , . . . , Uk ) is free of λ. Basu’s theorem says that any λ-free statistic is independent of a complete sufficient statistic. This establishes the independence of (U1 , . . . , Uk ) and Z. It is easy to extend Lemma 1.1 to more than two component random vectors as follows. Lemma 2.1. Let (A1 , A2 , A3 ) ∼ D(Pk i=1 αi , Pk i=1 βi , Pk i=1 γi ) , U ∼ Dα , V ∼ Dβ , W ∼ Dγ . Note that A1 +A2 +A3 = 1. Assume that U, V, W and (A1 , A2 , A3 ) are independent. Then A1 U + A2 V + A3 W ∼ Dα+β+γ . The next Lemma is a consequence of Lemma 1.1. Lemma 2.2. Recall ei = (0, . . . , 1, . . . , 0), where the ith co-ordinate is 1 and the rest are 0. Let U ∼ Dei = δei , that is U = ei with probability 1. P Let V ∼ Dβ and let A ∼ B(1, ki=1 βi ). Furthermore, let U, V and A be independent. (We really need only that V and A are independent since U is degenerate.) Then AU + (1 − A)V ∼ Dβ+ei , i.e. Aδei + (1 − A)V ∼ Dβ+ei . 8 Here is another proof of the posterior distribution in Example 1.2 using moments to identify the posterior distribution. The pdf p(u) of the distribution of U ∼ D(α1 ,...αk ) is given by p(u) = Γ(α1 + · · · αk ) α1 −1 u . . . ukαk −1 . Γ(α1 ) . . . Γ(αk ) 1 Where uk = 1 − u1 − . . . uk−1 and α1 > 0, . . . αk > 0 Let U have the Dirichlet distribution Dα. Then the moments are given by E(U1r1 U2r2 ...Ukrk ) = Qk [r ] αi i (α1 + · · · + αk )[r1 +···+rk ] 1 (2.1) where ri ≥ 0 for i = 1 . . . k. This can checked from the pdf of U, ignoring coordinates corresponding to {i : αi = 0}. Note that we have used the notation for ascending factorials: a[r] = a(a + 1)(a + r − 1) if a ≥ 0 and 0[0] = 1,0[r] = 0 if r > 0. Conversely, the distribution of a random variable U on the simplex in Rk whose moments are given by (2.1) is uniquely determined (since U is bounded) and is the Dirichlet distribution Dα. Thus this can be considered as the third definition of a Dirichlet distribution. Let X be a discrete r.v. taking values in {1, . . . , k}, with a pmf given by Q(X = i|p) = p(i), i = 1, . . . k where p = (p(1), ...p(k)). Let p have (prior) distribution Dα . Then Z Z1 Zk Q(p ∈ B) = dDα = Q(( , ... ∈ B)). Z Z B We will now find the posterior distribution Q(p ∈ B|X1 = i) by the moment-characterization of the Dirichlet distribution. This will be another way to derive this posterior distribution. 9 The marginal distribution of X1 is given by αi Q(X1 = i) = E(Q(X1 = i|p)) = E(p(i)) = Pk 1 αj by using the moments of a Dirichlet distribution. Assume that αi > 0. Then Q(X1 = i) > 0. Thus Q(p ∈ B, X1 = i) Q(X1 = i) P Z ( k1 αj ) = p(i)dDα . αi p∈B Q(p ∈ B|X1 = i) = This means that ( E(f (p(1), ...p(k))|X1 = i) = E(p(1)f (p(1), ...p(k))) Pk 1 αj ) αi . (2.2) For any vector of non-negative integers r = (r1 , . . . , rk ), let r∗ = r + ei = Pk ∗ Pk ∗ ∗ (r1 , . . . , ri + 1, . . . , rk ). Let r = 1 rj . Also let α = α + 1 rj , r = Q ei = (α1 , . . . , αi + 1, . . . , αk ). By choosing f (p(1), . . . , p(k)) = k1 p(j)rj in equation (2.2) we get E( k Y j=1 rj p(j) |X = i) = E( k Y p(j) ) j=1 Y = rj∗ [r ∗ ] αj j ( Pk 1 αj ) αi P ( k1 αj ) P ( k1 αj )[r∗ ] αi [r ] (α1∗ )[r1 ] . . . (α∗ )k k P . ( αj∗ )[r] = This proves that the conditional distribution of p given X = i is Dα+ei . It also yields another proof of Lemma 1.2 as follows. Dα (p ∈ B) = Q(p ∈ B) = = k X 1 k X 1 Q(p ∈ B)|X = i)Q(X = i) αi Dα+ei (p ∈ B) Pk . 1 αi 10 We will examine this conclusion of Lemma 1.2 further. Let A ∼ B(1, Pk 1 αi ), V ∼ Dα and A, V be independent. Let X be independent of (A, V) with Q(X = i) = α P ki 1 αi for i = 1 . . . k, Let δx = (δx (1), . . . δx (k)) = ei if X = i, i.e. let δX = eX . So what is the distribution of AδX + (1 − A)V? Q(AδX + (1 − A)V ∈ B) = E(Q(AδX + (1 − A)V ∈ B|X = i)) k X = Q(AδX + (1 − A)V ∈ B|X = i)Q(X = i) = 1 k X 1 Q(Aei + (1 − A)V ∈ B|X = i)Q(X = i) αi = Dα+ei Pk 1 αi = Dα . This means that d V = A1 δx1 + (1 − A1 )V d = A1 δx1 + (1 − A1 )[A2 δx2 + (1 − A2 )V] d = A1 δx1 + A2 (1 − A1 )δx2 + A3 (1 − A1 )(1 − A2 )δx3 + . . . where A1 , A2 . . . , X1 , . . . , V are independent, A1 , A2 . . . are i.i.d B(1, X1 , X2 . . . are i.i.d. with Q(X = i) = α Pki 1 αi Pk 1 αi ), for i = 1 . . . k. Thus the right hand side is representation of the finite dimensional Dirichlet distribution Dα . 11 3 The nonparametric problem Let X1 , X2 . . . be i.i.d F or P . Recall that a probability measure is a set function P = P (B), B ∈ B satisfying • P (A) ≥ 0 for all A ∈ B, • P (R1 ) = 1 and P (φ) = 0, and • if A1 , A2 · · · ∈ B are pairwise disjoint, then P (∪∞ 1 Ai ) = P∞ 1 P (Ai ). Question: How to verify if a set function P is a probability measure on the real line? Do we know all the probability measures on the real line? Let P ((−∞, x]) = F (x) The function F (x) is called the distribution function (df) associated with the probability measure P . It satisfies the conditions: • F is monotone non-decreasing, • F (x) → 0 as x → −∞, F (x) → 1 as x → ∞, and • F (x) is right continuous. To show F (x) that satisfies the above conditions, we need to verify an uncountable number of conditions. Consider the restriction F ∗ (x) of F on {x ∈ R∗ }, where R∗ is the subset of rational numbers in R1 . It will satisfy • F ∗ is monotone non-decreasing, 12 • F ∗ (x) → 0 as x → −∞, F ∗ (x) → 1 as x → ∞, and • F ∗ (x) is right continuous when viewed as a function on R∗ . The number of conditions that we have to verify here is countable. Thus there is a 1 ↔ 1 correspondence among df’s F ∗ , F and pm’s P . So, one can determine if P is a probability measure by verifying just a countable number of conditions. This result holds when we look at probability distributions in separable complete metric spaces. 13 January 22, 2008 Let us recall some definitions and theorems: Definition 3.1. A field is a non-empty class of subsets of Ω closed under finite unions, finite intersections and complements, and containing Ω. Definition 3.2. A σ-field is a non-empty class of subsets of Ω closed under countable unions, countable intersections and complements, and containing Ω. Definition 3.3. P0 is countable additive on a field F0 , if P (∪∞ 1 An ) = ∞ X P (An ) 1 for every collection {An ∈ F0 ,n ≥ 1} of disjoint sets whose union ∪∞ 1 An is in F′. Definition 3.4. P is a probability measure (pm) if it is defined on a σ-field F and P (Ω) = 1 and is countable additive on F , i.e. if P (∪∞ 1 An ) = ∞ X P (An ) 1 for every collection {An ∈ F ,n ≥ 1} of disjoint sets. Definition 3.5. Let F0 = finite unions of sets of the form {(a, b], −∞ ≤ a < b < ∞} and F = smallest sigma field containing F0 . The σ-field F is called the Borel σ-field in R. 14 Theorem 3.1. Caratheodory Extension Theorem: Let F0 be a field of subsets of Ω, and let P0 : F0 → [0, 1] be a countable additive set function on F0 with P0 (Ω) = 1. Then there is a unique probability measure P on F = σ(F0 ) that extends P0 . Recall the field F0 of finite unions of half-open intervals in R and the Borel σ-field F in R. A function F on R is a distribution function if I limx→−∞ F (x) = 0, limx→∞ F (x) = 1, II F is non-decreasing, i.e. F (x) ≤ F (y) if x ≤ y, and III F is right continuous, i.e limx↓y F (x) =F (y). Let F be a distribution function and define P0 ((a, b]) = F (b)−F (a). Then P0 is well-defined on F0 and P0 is countable additive on F0 . By the Caratheodory Extension Theorem, there is a unique extension P of P0 which is a probability measure on F . Thus there is a 1-1 correspondence between distribution functions and probability measure. If we want to show that some function is a distribution function, there are uncountable conditions to verify. We can decrease that to a countable conditions as follows. Let F ∗ (x), x ∈ R∗ be the restriction of the df F to the rationals R∗ . Then F ∗ will satisfy conditions I, II and III. Conversely, if F ∗ is a function on R∗ , it takes only a countable number of conditions to verify that it satisfies conditions I, II and III. We can extend such a function F ∗ to a unique df F on the real line and to a unique pm P on the Borel sets. 15 Let P = {all probability measures P on (R, F )}. If we are going to place a pm on P we will need a σ-field on this space. Let NA,r = {P : P (A) < r} where A ∈ F , 0 ≤ r ≤ 1. Define σ(P) be the smallest σ-field containing NA,r , for all A ∈ F and r ∈ [0, 1]. This σ-field is also the smallest σ-field under which P (A) is measurable for all R A ∈ F . Further more, {P : f dP < r} will be in σ(P) if f is bounded and measurable. Let A = (A1 , . . . Ak ) be a finite measurable partition of R. Suppose that we can assign ΠA as the probability distribution of P (A) = (P (A1 ), . . . , P (Ak )), for each finite measurable partition A. We will examine conditions under which this defines a pm on P. We will definitely need a consistency condition. Let A∗ be the partition obtained by unions of some disjoint collections of subsets of A. Then we can add the corresponding elements in P (A) to form P (A∗ ). Call this mapping φA,A∗ . The consistency condition we need is Condition A: ΠA φ−1 A,A∗ = ΠA∗ . Assignment problem: Let h(θ1 , . . . θk ) be the pdf of prior distribution for θ. Then the posterior distribution of θ given X1 = i is ∝ θi h(θ1 , . . . θk ). As a particular example which satisfies the consistency condition A, we can postulate that (P (A1 ), . . . , P (Ak )) ∼ D(α(A1 ), . . . , α(Ak )) for each finite partition A, where α(·) is a non-zero finite countable additive measure on (R, F ). The consistency condition A holds because (P (A1 ) + P (A2 ), P (A3), . . . P (Ak )) ∼ D(α(A1) + α(A2 ), α(A3), . . . , α(Ak )) ∼ D(α(A1 ∪ A2 ), α(A3), . . . , α(Ak )) ∼ (P (A1 ∪ A2 ), , P (A3), . . . P (Ak )) 16 January 24, 2008 4 Definition of distributions of (P, σ(P)) It is intuitive that a distribution for a probability measure which is but a set function (P (A), A ∈ F ) could be defined by specifying the finite dimensional distributions of P (A) = (P (A1 ), . . . , P (Ak )) for all finite measurable partitions A = (A1 , . . . , Ak ) of R. In this section we present this idea in a rigorous fashion. Let A∗ = (A∗1 , . . . A∗m ) be a sub-partition of A, i.e. every set in A∗ is a P union of sets in A. Clearly, P (A∗i ) = j:Aj ⊂A∗ P (Aj ), i = 1, . . . , m. This i ∗ defines a function φA,A∗ from P (A) to P (A ). If we know the distribution of P (A), then we know the distribution of P (A∗ ), and it should agree with its assigned distribution. We call this the consistency condition among the finite dimensional distributions of P (A). We can now formally state the following theorem. Theorem 4.1. Consider a pm P on (R, F ), the real line with its Borel σfield. Suppose that we can assign distributions πA for P (A) = (P (A1 ), . . . , P (Ak )) for each finite partition A of R which satisfy the following two conditions. • Consistency: if A∗ is sub-partition of A then πA∗ = πA φ−1 A,A∗ , and • Continuity: if An ց φ then P (An ) ց 0 in distribution. (Note that this condition is equivalent to E(P (An )) ց 0 since {P (An )} is bounded..) 17 Then there is a unique pm ν on (P, σ(P)) such that the distribution of P (A) under ν is πA for all finite partitions A. Proof: Some preliminaries: Consider (R, F ), the Real line with its Borel σ-field. Let P be the class of all probability measures P on (R, F ). Given a pm P we can consider its df F and its restriction F ∗ to the rationals. Let H be the class of all df’s (i.e. functions on R satisfying conditions I, II and III) and let H∗ be the class of all functions F ∗ (i.e. functions on R∗ satisfying conditions I, II and III). We have shown that there is a 1 ↔ 1 relationship between a pm P ∈ P, F ∈ H and F ∗ ∈ H∗ . We can write this as P ↔ F ↔ F ∗. Let us denote the mapping from F ∗ to P as φ: φ H∗ → P. The space H ∗ can also be viewed as a subspace of [0, 1]∞ and incorporate the standard product σ-field into H∗ . The function φ is a measurable map from H∗ to (P, σ(P)). Thus if one can define a pm ν ∗ on H∗ , then ν = ν ∗ φ−1 will be a probability measure on (P, σ(P)). This is one easy way to define pm’s on (P, σ(P)). This is the end of the preliminaries. Consider the space H∗ of df’s F ∗ restricted to the rationals. By identi- fying F ∗ (x) = P ((−∞, x]), x ∈ R∗ one can define the distribution πx1 ,...,xk of (F ∗ (x1 ), . . . F ∗ (xk )) for finite collection of points (x1 , . . . xk ) in R∗ in a consistent fashion because the collection πA is consistent. 18 From the Kolmogorov consistency theorem there exists a unique pm ν ∗ on [0, 1]∞ under which (F ∗ (x1 ), . . . F ∗ (xk )) has distribution πx1 ,...,xk . Let xn ց x where xn , x are in R∗ . Then the distribution of F ∗ (xn )−F (x) is that of P ((x, xn ]) and tends to 0 in distribution and with probability 1 since we have defined a joint distribution of {F ∗ (x), x ∈ R∗ }. To verify that F ∗ is right continuous on R∗ we need to verify only a countable number of such cases. Thus we conclude that F ∗ is right continous in R∗ with ν ∗ -probability 1. It is also clear that F ∗ is non-decreasing in R∗ with ν ∗ -probability 1. Hence the pm ν ∗ is supported on the subset H∗ of [0, 1]∞ . φ We can now use the mapping F ∗ → P from H∗ to P, to obtain ν ∗ φ−1 which is now a pm on (P, σ(P)). Furthermore the distribution of P (A) under ν is πA . This completes the proof of this theorem. 5 First definition of Dirichlet distributions (processes) We will illustrate the method described in the previous section with an example. Let α(·) be a non-zero finite measure on (R, F). Postulate that the distribution of P (A) = (P (A1 ), . . . P (Ak )) is the finite dimensional Dirichlet distribution D(α(A1 ), . . . α(Ak )), for all finite partitions A of R. The properties of finite dimensional Dirichlet distributions show that the distributions {πA } are consistent. We know that the distribution of P (A) is the Beta distribution B(α(A), α(Ac )); thus E(P (A)) = α(An ) α(R) α(A) . α(R) Hence, if An ց φ, then E(P (An )) = ց 0 and the distributions {πA } satisfy the continuity condition. 19 Hence there is a unique pm Dα on (P, σ(P)) such that the distribution of P (A∗ ) = (P (A1 ), . . . P (Ak )) is D(α(A1), . . . α(Ak )), for all finite partitions A of R. This is the Dirichlet measure (also called a Dirichlet process) of Ferguson. 6 Posterior distribution under the Dirichlet prior We will now show how to obtain the posterior distribution under a Dirichlet prior. Now let P ∈ P and P ∼ D α . Let X be a r.v. such that X|P ∼ P . The distribution of P |X is the posterior distribution given X. The posterior distribution is also a distribution on (P, σ(P)). We will define such a distribution by employing the technique outlined in Section 4; in other words we will find the posterior finite dimensional distributions of P (A) for all finite partitions A. Let Q be the joint distribution of (X, P ). We will find Q(P (A) ∈ B|X) for all finite partition A and appropriate sets B (namely measurable subsets of Rk ). This conditional probability Q(P (A)|X) is a function satisfying E(Q(P (A) ∈ B|X)I(X ∈ C)) = E(I(P (A) ∈ B)I(X ∈ C)) for all C ∈ F . 20 Simplifying the right hand side (RHS) we get RHS = Q(P (A) ∈ B, X ∈ C) = E[Q(X ∈ C|P )I(P (A) ∈ B)] (6.3) = E[P (C)I((P (A) ∈ B)]. Let Ai1 = Ai ∩ C, Ai2 = Ai ∩ C c . Then C = ∪k1 Ai1 . We can write the RHS in (6.3) as E[P (C)I((P (A1), . . . P (Ak )) ∈ B)] = k X 1 E(P (A)) ∈ B)). (6.4) Using the fact that the distribution of (P (A11 ), P (A12 ), P (A2 , ) . . . P (Ak )) is D(α(A)) = D(α(A11 ), α(A12 ), α(A2 ), . . . , α(Ak )), we can write the first term in the summation in (6.4) as = = = = E(P (A11 )I((P (A11 ) + P (A12 ), P (A2, ) . . . P (Ak )) ∈ B)) Z y11 dD(α(A11 ), α(A12 ), α(A2 ), . . . α(Ak )) (y11 +y12 ,y2 ,...yk )∈B Z α(A11 ) dD(α(A11 ) + 1, α(A12 ), α(A2 ), . . . α(Ak )) α(R) (y11 +y12 ,y2 ,...yk )∈B Z α(A11 ) dD(α(A1 ) + 1, α(A2), . . . α(Ak )) α(R) B E[D((α + δX )(A1 ), . . . , (α + δX )(Ak ))(B)I(X ∈ A11 )]. since ((α + δX )(A1 ), . . . , (α + δX )(Ak )) = (α(A1 ) + 1, α(A2 ), . . . , α(Ak )) when X ∈ A11 Thus equation (6.3) becomes RHS = k X i E[D((α + δX )(A))(B)I(X ∈ Ai1 )] = E[D((α + δX )(A))I(X ∈ C))], and hence Q(P (A) ∈ B|X) = D((α + δX )(A))(B). 21 Therefore the distribution of P (A) under the posterior distribution is the finite dimensional Dirichlet distribution D((α + δX )(A)). From the technique in Section 4, these finite dimensional distributions uniquely define the posterior distribution of P given X to be Dα+δX . 22 January 29, 2008 7 Posterior distribution given n observations Let the parameter in the nonparametric problem, namely the unknown probability measure P , have Dα as its prior distribution . Suppose that the data consists of just one observation X (taking values in (X , F )), which is such that X|P ∼ P . We saw in the last class that the posterior distribution, the conditional distribution of P given X, can be written as P |X ∼ Dα+δx . What if we the data consisted of a a sequence of observations X1 , . . . , Xn ? i.i.d To repeat, suppose that X1 , X2 , . . . Xn |P ∼ P . Then what is the distribution of P |X1 , . . . Xn . Let us first obtain the distribution of P given X1 , X 2 . Note that given X1 , P has distribution Dα+δX1 and the distribution of X2 , given (X1 , P ) is P . From this it follows that the distribution of P given X1 , X2 is Dα+δX1 +δX2 . In the same way, it follows that the distribution of P given X1 , . . . , Xn is Dα+Pn1 δXi . 8 A little discussion on conditional probability The conditional probability of A given B denoted by P (A|B) is given by P (AB) P (B) and similarly, P (A|B c ) = P (AB c ) , P (B c ) when P (B) > 0, P (B c ) > 0. However,it is helpful to consider P (A|σ(B)), where σ(B) = {φ, B, B c , Ω) is the σ-field generated by B, as a function. Thus we will say that the 23 conditional probability P (A|σ(B)) is a function of ω, which satisfies the condition P (A|σ(B))(ω) = P (AB) P (B) P (AB) , P (B c ) if ω∈B if ω ∈ Bc. More generally, the conditional probability P (A|σ(B)) is a σ(B) measurable function satisfying E(P (A|σ(B))(ω)I(ω ∈ C)) = E(I(ω ∈ A)I(ω ∈ C)). for each C ∈ σ(B). Let (Ω, F , P) be a probability measure space. Let B be a sub-σ-field of F . Also let A ∈ F. Then conditional probability of A given B is a function measurable with respect to B, such that Z P (A|B)(ω)I(ω ∈ C)dP = P (A ∩ C) for each C ∈ B. Note that the function I(ω ∈ A) also satisfies this condition except that it is not be measurable with respect to B, in the interesting case when A 6∈ B, and so will not be a candidate for a conditional probability. Definition 8.1. If there is a version P (A|B)(ω) of the conditional probability such that it is a probability measure in A for each ω, then we will say that it is a regular conditional probability. If Ω is R1 , or Rk or separate complete metric space then a regular conditional probability exists. 24 9 Martingales Example 9.1. Let X1 , X2 . . . be i.i.d with E(Xi ) = 0. Let Sn = X1 +. . . Xn . Then E(Sn |X1 , . . . Xn−1 ) = E(Sn−1 + Xn |X1 , . . . Xn−1 ) = Sn−1 + E(Xn |X1 , . . . Xn−1 ) = Sn−1 + E(Xn ) = Sn . The sequence {Sn } is a example of a martingale. More formally: Definition 9.1. Let Fn be an increasing sequence of σ-fields and Xn be Fn measurable, n = 1, 2, . . . . Let E(|Xn |) < ∞, n = 1, 2, . . . . Then {Xn , Fn } is said to be a martingale if E(Xn |Fn ) = Xn−1 , n = 2, 3, . . . . Theorem 9.1. Let {(Xn , Fn )} be a martingale which is L1 bounded, i.e, there is K < ∞ such that E(|Xn |) ≤ K for all n. Then there exists a random variable X∞ , such that Xn −→ X∞ with probability 1. In our earlier example, E(|Sn |) is not L1 bounded, and so this theorem will not apply. The following is an example of another martingale which is L1 -bounded and hence convergent. Further more it is uniformly integrable. 25 Theorem 9.2. Let Fn be a sequence of σ-fields increasing to F∞ . Let X be a r.v. with E(|X|) < ∞. Let Xn = E(X|Fn ). Then {Xn } is a uniformly integrable martingale and Xn = E(X1 |Fn ) → E(X1 |F∞ ) with probability 1 and in L1 . Definition 9.2. Let Fn be a decreasing σ-fields and let Xn be Fn measurable with E(|Xn |) < ∞, n = 1, 2, . . . . We say that{Xn } is a reverse martingale if E(Xn |F n+1 ) = Xn+1 for n = 1, 2 . . . . The following is a result for reverse martingales. Theorem 9.3. Let {(Xn , Fn )} be a reverse martingale. Then there exits a random variable X∞ , such that Xn −→ X∞ with probability 1 and in L1 (i.e. E(|Xn − X∞ |) −→ 0.) The following is an example of a reverse martingale. Let Fn be a decreasing σ-fields decreasing to F∞ . Let X1 be F1 measurable and let E(|X1 |) < ∞. Then the sequence {Xn = E(X1 |Fn )} is a reverse martingale converging to E(Xn |F∞ ). 10 Product measure spaces Suppose that (X , F ) is a probability measure space. The product σ-field F 2 in the product space X 2 is defined as follows. Consider the rectangle set A × B = {(x1 , x2 ) : x1 ∈ A, x2 ∈ B}. The smallest σ-field containing such rectangle sets is the σ-field F 2 . 26 Question: How to induce a product σ-field in X ∞ ? Notice that a set C ∈ F 2 can be viewed a subset C∗ of X ∞ as follows: C∗ = {(x1 , x2 , . . . ), (x1 , x2 ) ∈ C, x3 ∈ X , x4 ∈ X , . . . } ⊂ X ∞ . Let F∗2 = {C∗ : C ∈ F }. Thus we can define σ-fields F∗2 ⊂ F∗3 ⊂ . . . . Let n ∞ F0 = ∪∞ in X ∞ is defined 1 F∗ , which is not a σ-field. The product σ-field F to be σ(F 0 ), the smallest σ-field containing F0 . Theorem 10.1. Let λn be a probability measure on (X n , F n ), n = 1, 2, . . . and let {λn } be a consistent sequence. Then there exists a unique probability measure λ∞ on (X ∞ , F ∞ ) which extends λn , i.e. λn (C) = λ(C∗ ) for all C ∈ Fn , n = 1, 2, . . . . Let us look at some other σ-fields on X ∞ . Consider the set A = {X : x1 + x2 ≤ 10}. If (x1 , x2 , x3 , . . . ) ∈ A, then (x2 , x1 , x3 , . . . ) ∈ A. We can say that the set A is invariant under τ , the permutation of the first two coordinates of (x1 , x2 , . . . ), i.e. under τ defined by τ (x1 , x2 , x3 , . . . ) = (x2 , x1 , x3 , . . . ). Define G 2 = {C : τ (C) = C, C ∈ F ∞ }. Then G 2 is a σ-field. Define the σ-field G n = {C : τ C = C, C ∈ F ∞ } for each permutation τ of the first n coordinates of (x1 , x2 , . . . ). Then F ∞ = G 1 ⊃ G 2 ⊃ · · · ⊃ G n −→ G ∞ . The σ-field G ∞ is called the invariant σ-field in X ∞ . There is also a σ-field T called the tail σ-field contained in the invariant σ-field G ∞ . Definition 10.1. A sequence of random variables (X1 , X2 . . . ) is said to be exchangeable, if the distribution of (X1 , . . . , Xn ) = (Xi1 , . . . , Xin ) for all 27 permutation (i1 , . . . , in ) of (1, . . . , n), n = 2, 3, . . . . Let Q be the distribution of (X1 , X2 . . . ), then Q is said to be exchangeable. An example of an exchangeable sequence of random variables is a sequence of i.i.d. random variables X1 , X2 , . . . . Let (X1 , X2 . . . ) be exchangeable. Let f be a bounded measurable function on X . Since (X1 , X2 , X3 , . . . ) ∼ (X2 , X1 , X3 . . . ) and f (x1 +x2 ) 2 is G2 - measurable, E(f (X1 )|G 2 ) = E(f (X2 )|G 2 ) f (X1 ) + f (X2 ) 2 = E( |G ) 2 f (X1 ) + f (X2 ) . = 2 More generally, n 1X def E(f (X1 )|G ) = f (xi ) = An (f ). n 1 n Therefore {An (f ), G n } is a reverse martingale with E(|A1 (f )|) < ∞. Hence An (f ) → A∞ (f ) = E(f (X1 )|G ∞ ) with probability 1 and E(|An (f ) − A∞ (f )|) → 0. 28 January 31, 2008 11 Review Let Xi ∈ X for i = 1, 2, . . . . The random variables (X1 , X2 , . . . ) with joint distribution Q is exchangeable, if (X1 , . . . Xn ) ∼ (Xi1 , . . . , Xin ) for all permutations (i1 , . . . , in ) of (1, 2, . . . n), for n = 2, 3, . . . . In other words, (X1 , X2 , . . . ) is exchangeable if Qτ −1 = Q for all finite permutations τ defined as τ (X1 , X2 , . . . , Xn Xn+1 ) = (Xi1 , . . . Xin , Xn+1 , . . . ), for n = 1, 2, . . . . A simple example of exchangeable random variables is a sequence (X1 , X2 , . . . ) of i.i.d. random variables. Another example of exchangeable random variables is given by: Example 11.1. Let X1 , X2 , . . . be r.v.on the (X , F). Let θ be a r.v., perhaps, on another space with distribution π. Suppose that Q(X1 ∈ A1 , . . . , Xn ∈ An |θ) = Pθ (A1 ) . . . Pθ (An ) for all A1 ∈ F , . . . An ∈ F , n = 1, 2, . . . where {Pθ } is a family of probability measures indexed by θ. Then Q(X1 ∈ A1 , . . . Xn ∈ An ) = E(Q(X1 ∈ A1 , . . . Xn ∈ An |θ)) Z Y n = Pθ (Ai ) dπ(θ) 1 = Q(X1 ∈ Ai1 , . . . Xn ∈ Ain ) = Q(Xj1 ∈ A1 , . . . Xjn ∈ An ) 29 where (j1 , . . . , jn ) is the anti-permutation of (i1 , . . . , in ). Hence (X1 , X2 , . . . ) is exchangeable. Theorem 11.1. (De-Finetti’s Theorem) Let (X1 , X1 . . . ) ∼ Q be exchangeable. Then there is a random probability measure P = P (·, ω) = P (A, ω) on P the space of probability measures on (X , F ) such that Q(X1 ∈ A1 , . . . Xn ∈ An |P ) = n Y P (Ai) 1 with probability 1 for all (A1 , . . . , An ). In other words, X1 , X2 , . . . |P are i.i.d P. The distribution of P will depend on Q and can be denoted by νQ . One of the consequences of this theorem is that every exchangeable distribution Q of (X1 , X2 , . . . ) determines a probability measure νQ on P, which can serve as a prior distribution for us in a nonparametric problem. To prove the De-Finetti Theorem, we will use the theorem from our last lecture. Recall that G n = {A : A ∈ F ∞ , τ A = A, for all permutations τ of (1, . . . n), n = 1, 2, . . . and F ∞ = G 1 ⊃ G 2 ⊃, . . . , ⊃ G ∞ . Let f be a P bounded measurable function; then An (f ) = n1 n1 f (xi ) = E(f1 (X)|G n ). We showed that {An (f ), G n } is a reverse martingale and An (f ) → A(f ) = E(f (X1 )|G ∞ ) with probability 1 and in L1 . This result can be used to prove the Kolmogorov’s large of law numbers P and also that E(| n1 n1 Xi − E(X1 )|) → 0. Proof of De-Finetti Theorem: Recall that (X , F, P) is a probability 30 space. Let A ∈ F and put f (X1 ) = I(X1 ∈ A). We also know that n 1X I(Xi ∈ A) Fn (A) = n 1 = E(I(X1 ∈ A)|G n ) = Q(X1 ∈ A|G n ) and that Fn (A) is a reverse martingale. Hence, from the reverse martingale convergence theorem Theorem 9.3, Fn (A) → F (A) with probability 1 and in L1 , where F (A) = E(I(X1 ∈ A)|G ∞ ) = Q(X1 ∈ A|G ∞ ). We also know that Q(X1 ∈ A|G ∞ ) exists a regular conditional probability, i.e. it is equal to P (A, ω), where P (A, ω) is measurable in ω for each A and is a probability measure in A for each ω. Thus we can say that Fn (A) → P (A, ·) with probability 1 and in L1 , for each A. We can also say that the random probability measures Fn (·) converge weakly to the random probability measure P (·) with probability 1, since weak convergence can be determined by a countable number of conditions. Let A1 , A2 ∈ F . For n ≥ 2 E(I(X1 ∈ A1 , X2 ∈ A2 )|G n ) = = X 1 I(Xi ∈ A1 , Xj ∈ A2 ) n(n − 1) i6=j n n X X 1 I(Xj ∈ A2 )) I(Xi ∈ A1 ))( ( n(n − 1) 1 1 n X 1 − ( I(Xi ∈ A1 )(I(Xi ∈ A2 ))) n(n − 1) 1 → P (A1 , ω)P (A2, ω) with probability 1 and in L1 . 31 Since Pn I(Xi ∈ A1 ) → P (A1 , ω). n Pn 1 I(Xj ∈ A2 ) → P (A2 , ω), and n n n 1X 1X I(Xi ∈ A1 )(I(Xi ∈ A2 )) = I(Xi ∈ A1 ∩ A2 ) → P (A1 ∩ A2 , ω) n 1 n 1 1 with probability 1 and in L1 , we obtain E(I(X1 ∈ A1 , X2 ∈ A2 )|Gn )(ω) → E(I(X1 ∈ A1 )I(X2 ∈ A2 )|G∞ ))ω) = P (A1 , ω)P (A2, ω) with probability 1 and in L1 , where P (A, ω) is a regular conditional probability and a version of Q(X1 ∈ A|G ∞ )(A, ω). Thus X1 , X2 . . . |G ∞ are i.i.d P (·). We can say a little more. Since the random probability measure P (·) is equal to Q(·|G ∞ ), it is G ∞ -measurable. We can therefore conclude that X1 , X2 , . . . |P are i.i.d. P . Here we have used the following fact - If E(X|S) is T -measurable where T ⊂ S then E(X|S) = E(X|T ). We want to push this discussion some more. For instance, since Fn (A) → P (A, ·) in L1 we obtain E(P (A)) = lim E(Fn (A)) = E(E(Q(X1 ∈ A|G1 )) = Q(X1 ∈ A). In fact, we can find all moments of the random probability measure P . Let A1 , . . . , Ak be subsets in F. We will evaluate E(P (A1 ) · · · P (A1 ) P (A2) · · · P (A2 ) · · · P (Ak ) · · · P (Ak )) where r1 , . . . , rk are | {z }| {z } | {z } r1 r2 rk 32 non-negative integers. Let n ≥ r1 + · · · + rk . Consider E(I(X1 ∈ A1 , . . . Xr1 ∈ A1 , Xr1 +1 ∈ A2 , . . . , Xr1 +···+rk ∈ Ak |Gn )) r1 +···+r r1 rY 1 +r2 Y Y k ∞ ∞ → Q(Xi ∈ A1 |G ) Q(Xi ∈ A2 |G ) · · · Q(Xi ∈ Ak |G ∞ ) 1 = k Y r1 +1 r1 +···rk−1 P ri (Ai ) 1 with probability 1 and in L1 . Thus Q(x1 ∈ A1 , . . . Xr1 ∈ A1 , Xr1 +1 ∈ A2 , . . . Xr1 +r2 ∈ A2 , . . . Xr1 +···+rk ∈ Ak ) k Y = E( Pir (Ai )). 1 12 Pólya sequences We will look at examples of exchangeable sequences called Pólya sequences. Let X = (1, . . . , k) and let α = (α(1), . . . α(k)) where α(1), . . . α(k) are nonnegative numbers and α(1) + · · · + α(k) = α(X ) > 0. The joint distribution Qα of X1 , X2 , . . . is defined as follows: Qα (X1 = i) = Qα (Xn+1 = i|X1 , . . . Xn ) = α(i) , α(X ) P α(i) + nj=1 δXj (i) (12.5) α(X ) + n n = 2, 3, . . . . We can understand the distribution Qα as a sampling from an urn with balls of k colors with weights α(i), i = 1, . . . , k. The probability of drawing a ball of color i is proportional to the weight of that ball. Let X1 be the color of the first ball drawn. Thus Q(X1 = i) = 33 α(i) . α(X ) Every ball drawn is replaced by a ball of the same color with its weight is increased by 1. Then the next ball is drawn. The distribution of X1 , X2 , . . . satisfies (12.5) and this provides a model for a Póly sequence in a finite space. We will now evaluate the joint distributions of this Pólya sequence. Q((X1 , . . . , Xn ) = (1, . . . , 1, 2, . . . , 2, . . . , k, . . . , k )) | {z } | {z } | {z } r1 r2 rk α(1)(α(1) + 1) . . . (α(1) + r1 − 1) . . . (α(k) + rk − 1) = α(X )(α(X + 1)) · · · (α(X ) + r1 + rk − 1) [r1 ] α (1) . . . α[rk ] (k) = α[r1+···+rk ] (X ) Where a[r] = a(a + 1) . . . (a + r − 1). Since this probability remains unchanged if the values of the colors X1 , . . . , Xn are permuted, the probability distribution Qα is exchangeable. We can extend this example to general spaces X . Let α be a non-negative finite measure on the probability space (X , F). Define Qα the distribution of the Pólya sequence X1 , X2 , . . . by α(A) α(X ) P α(A) + n1 δXi (A) ∈ A|X1 , X2 , . . . , Xn ) = α(X ) + n Q(X1 ∈ A) = Q(Xn+1 (12.6) for n = 2, 3, . . . . Let φ : X → Y, and let Y1 = φ(X1 ),Y2 = φ(X2 ), . . . . Then Y1 , Y2 , . . . is a Pólya sequence with parameter β = αφ−1 . In particular, let φ : X → {1, 2, . . . , k} with φ(x) = i ↔ x ∈ Ai , i = 1, . . . , k where (A1 , . . . , Ak ) is a finite partition of X . Then Y1 , Y2 , . . . has 34 the Pólya distribution Q(α(A1 ),...,α(Ak )) . From our previous calculations, Q(X1 ∈ A1 , . . . , Xr1 ∈ A1 , Xr1 +1 ∈ A2 , . . . , Xr1 +···+rk ∈ Ak ) = 35 Qk [ri ] (Ai ) 1α . [r +···+r ] (X ) 1 k α Feb. 5, 2008 13 Review Let the random variables X1 , X2 , . . . with values in (X ∞ , F ∞ ) and with joint P distribution Q be exchangeable. Let Fn (A) = n1 n1 I(Xi ∈ A), A ∈ F be the empirical distribution of X1 , . . . , Xn . We established that Fn (A) → P (A) with probability 1 and in L1 , for each A ∈ F , where P (A) = Q(X1 ∈ A|G ∞ ) is a regular conditional probability measure. Recall that G ∞ is the invariant σ-field in X ∞ . We also established that ∞ Q(X1 ∈ A1 , . . . Xn ∈ An |G ) = n Y 1 ∞ Q(Xi ∈ Ai |G ) = n Y P (Ai) 1 with probability 1 for each A1 , . . . , An . Since both sides exist as regular conditional probabilities, and two probability measures agree with each other if they agree on a countable determining class of sets, we can also conclude that Q(X1 ∈ ·, . . . Xn ∈ ·|G ∞ ) = n Y 1 Q(Xi ∈ ·|G ∞ ) = n Y P (·) 1 ∞ with probability 1. And finally since P (·) is G -measurable, we can further state this result as Q(X1 ∈ ·, . . . Xn ∈ ·|P ) = n Y 1 In other words, X1 , X2 , . . . |P are i.i.d P . 36 Q(Xi ∈ ·|P = n Y 1 P (·). Can we also say that Fn → P with probability 1 in some sense? Let us examine the case where X1 , X2 , . . . are i.i.d. F and also, hence, exchangeable. The Glivenko-Cantelli theorem states that sup |Fn (x) − F (x)| → 0 x∈X with probability 1. But it is not true that sup |Fn (A) − F (A)| → 0. A∈F For instance, let F be the standard normal distribution. Choose A = {X1 , . . . , Xn }. Then Fn (A) = 1, F (A) = 0 and supA∈F |Fn (A) − F (A)| 6→ 0. However, we can say that w Fn → F with probability 1. In the same way, we can also say that w Fn → P with probability 1 in the case of exchangeable random variables, since µn → µ R R if f dµn → f dµ for a countable collection of functions f . 14 Posterior distibutions in Pólya distributions Going back the case of exchangeable random variables, let νQ be the distribution of P , which will depend on the joint distribution Q. What is the 37 posterior distribution of P , i.e. what is the distribution of P given X1 ? We first look at the distribution of X2 , X2 3, . . . given X1 . This will also be exchangeable; denote it by QX1 . Note that, under this conditional joint P distribution, the sequence n1 n2 I(Xi ∈ A) also converges to P (A) with probability 1. Thus the distribution of P given X1 is the νQX1 . Thus we have found the posterior distribution just by clever notation. Definition 14.1. (Pólya sequence) Let X = {1, . . . , k},α = (α(1), . . . , α(k)). We say X1 , X2 , . . . is a Pólya sequence on X with parameter α and joint distribution Qα on X , if Qα (X1 ∈ A) = α(A) α(X ) Qα (Xn+1 ∈ A|X1 , . . . , Xn ) = α(A) + nFn (A) α(X ) + n for n = 1, 2, . . . . Let α = (α(1), . . . , α(k)) and denote this joint distribution as Q = Qα. We can understand the joint distribution of X1 , X2 , . . . as the distribution of the colors of balls chosen from an urn with k balls as follows. Initially the urn contains balls of k colors with weights (α(1), . . . , α(k)). The probability that a ball drawn from this urn is drawn is proportional to its weight. Each time a ball is drawn, it is replaced and the weight of that ball is increased by 1 before the next ball is drawn. Let X1 X2 , . . . be the colors of the balls that are drawn. Then the distribution of this sequence is the same as the Pólya sequence with parameter α = (α(1), . . . , α(k)). Hence Q(X1 = i1 , . . . Xn = in ) = 38 Qk 1 α[ri ] (i) α[n] where ri = Pn 1 I(Xj = i),n = Pk 1 ri ,α = Pk 1 α(i). This probability depends only on r1 , . . . , rk and thus Qα is exchangeable. We will now this calculation to show that general Pólya sequences are also exchangeable. Let B = (B1 , . . . Bk ) be a partition of X . What is Qα (X1 ∈ A1 , . . . Xn ∈ An ) where A1 , . . . , An ∈ B? ? Let φ : X → Y = {1, 2, . . . , k},i.e. φ(x) = i if x ∈ Bi . Then Y1 , Y2 , . . . is a Pólya sequence on Y with parameter β = (αφ−1 ({1}), . . . , α−1 ({k})). Let A1 = Bi1 , . . . , An = Bin . Then Qα (X1 ∈ A1 , . . . Xn ∈ An ) = Q(Y1 = i1 , . . . , Yk = in ) which is equal to Qα (X1 ∈ Aj1 , . . . Xn ∈ Ajn ) for any permutation (j1 , . . . , jn ) of (1, . . . , n). This is also true if the sets A1 , . . . , An are not sets from a partition as seen below. Let the sets A1 , . . . , An be arbitrary and let B be the partition generated by A1 , . . . , An . We can write Qα (X1 ∈ A1 , . . . , Xn ∈ An ) as sums of the form Qα (X1 ∈ B1 , . . . , Xn ∈ Bn ) where B1 , . . . , Bn come from B. Since each of this is invariant under permutations, the joint distribution Qα is exchangeable. We will now calculate moments from a Pólya sequence. Let Qα be the distribution of the Pólya sequence with parameter α, where α is a non- 39 negative finite measure. Let (A1 , . . . Ak ) be a partition of X . Then EQα (P r1 (A1 ) . . . P rk (Ak )) = Qα (X1 ∈ A1 , . . . , X· ∈ A1 . . . , X· ∈ Ak , . . . , . . . , Xn ∈ Ak ) | {z } | {z } r1 rk = Qα (Y1 = 1, . . . , Y· = 1, . . . , Y· = k, . . . , Yn = k ) | {z } {z } | r1 = Qk rk [ri ] α (Ai ) . α(X )[n] 1 From the characterization of a Dirichlet distribution by its moments, we get (P (A1 ), . . . , P (Ak )) ∼ D(α(A1 ), . . . , α(Ak )). Therefore we can identify vQα with Dα that we defined earlier. We will now find the posterior distribution of P . For this we will first find the distribution of X2 , X3 , . . . given X1 . We know that α(A) + nFn (A) α(X ) + n P α(A) + δX1 + n2 δXi = . α(X ) + 1 + n − 1 Qα (Xn+1 ∈ A|X1 , . . . Xn ) = Hence, the distribution of X2 , X3 , . . . given X1 is that of a Pólya sequence with parameter α + δX1 . Thus P |X1 ∼ vQX1 = Dα+δX1 . 15 The random pm P is dicrete with probability 1 We will push these ideas some more. We know that if X1 , X2 , . . . is a Pólya sequence with parameter α, then Pn 2 I(Xi ∈A) n P (A) ∼ B(α(A), α(Ac )). 40 → P (A) with probability 1 and Hence Pn 2 I(Xi ∈ {X1 }) → P ({X1 }) n and P ({X1 }) ∼ B(α({X1 }) + 1, α(X − {X1 })), and E(P ({X1})|X1 ) = α({X1 }) + 1 1 ≥ . α(X ) + 1 α(X ) + 1 Since this lower bound does not depend on X1 , we also have E(P ({X1 })) ≥ 1 . α(X ) + 1 In other words, E(P (X − {X1 })) ≤ α(X ) . α(X ) + 1 Similarly, E(P (X − {X1 , . . . , Xn })) ≤ α(X ) →0 α(X ) + n as n → ∞. Therefore, P is a measure that puts all its mass on a countable number of points or the random probability measure P is discrete with probability 1. We have used special properties of a Pólya sequence in this proof. Consider the example of exchangeable random variables X1 , X2 , . . . which are i.i.d with distribution F where F is not discrete. This is not a Pólya sequence. P The limiting random measure P (A) = lim n1 n1 I(Xi ∈ A) is degenerate at F and is not sitting on the class of discrete pm’s. 41 February 7, 2008 16 Solution to a homework problem Problem: Let X1 , X2 , . . . be a Pólya sequence on (X , F) with parameter α. Denote its joint distribution by Qα . Let φ be a measurable function from φ X to Y, i.e. let φ : (X , F ) → (Y, G). Let Y1 = φ(X1 ), Y2 = φ(X2 ), . . . . Then Y1 , Y2 , . . . is a Pólya sequence on (Y, G) with parameter β = αφ−1 . Solution: We know that Q(Xn+1 ∈ A|X1 , X2 , . . . , Xn ) = P α(A)+ n 1 δXi (A) . α(X )+n We need to show that Q(Yn+1 P αφ−1 (B) + n1 δYi (B) ∈ B|Y1, Y2 , . . . , Yn ) = . αφ−1(Y) + n (16.7) Let B ∈ G. Then Q(Yn+1 ∈ B|X1 , X2 , . . . , Xn ) = Q(φ(Xn+1 ∈ B|X1 , . . . , Xn )) = Q(Xn+1 ∈ φ−1 (B)|X1 , . . . , Xn ) P α(φ−1(B)) + n1 δXi (φ−1 (B)) = α(X ) + n P −1 αφ (B) + n1 δYi (B) = αφ−1 (Y) + n because x ∈ φ−1 (B) ↔ y = φ(x) ∈ B. Since this conditional probability is a function of Y1 , Y2 , . . . , Yn , (16.7) is true. In the above we have used the well known answer to the question of how to obtain Q(W ∈ A|Y ) from Q(W ∈ A|X, Y ): The conditional probability Q(W ∈ A|Y ) is a function which is σ(Y )measurable and satisfies Z Z Q(W ∈ A|Y )I(Y ∈ C)dQ = I(W ∈ A)I(Y ∈ C)dQ 42 (16.8) for all C ∈ σ(Y ). We know that Q(W ∈ A|X, Y ) satisfies Z Q(W ∈ A|X, Y )I(X ∈ D)I(Y ∈ C)dQ = Q(W ∈ A, X ∈ D, Y ∈ C) (16.9) for all appropriate C, D, and in particular when D is the whole space. This means that Q(W ∈ A|X, Y ) satisfies (16.8), but it may not be σ(Y )measurable. If if was σ(Y ) measurable (i.e. a measurable function of Y ) then Q(W ∈ A|Y ) = Q(W ∈ A|X, Y ). If Q(W ∈ A|X, Y ) is not σ(Y )-measurable we put equations (16.8) and (16.9) together (with D equal to the whole space) and get Z Q(W ∈ A, Y ∈ C) = Q(W ∈ A|X, Y )I(Y ∈ C)dQ Z = Q(W ∈ A|Y )I(Y ∈ C)dQ for all C ∈ σ(Y ). Thus, in this case, Q(W ∈ A|Y ) = E(Q(W ∈ A|X, Y )|Y ). 17 Support of Dirichlet measures In the last class, we proved that in a Pólya sequence with parameter α, the distribution of the random probability measure P , which is the limit of the empirical measure Fn , is the Dirichlet distribution Dα . We also showed that Dα ({P : P is discrete}) = 1. So, what is the support of Dα ? The support of a probability measure is the smallest closed set with probability 1. The set {P : P is discrete } is not a closed set. Its closure is P, the set of all probability measures. So we must do some more work to find the support of Dα . We will do this later. 43 18 Another way to introduce random probability measures We will look for another way to introduce probability measures on P. Consider X = (0, 1]. Every point in X can be written as x → z = (z1 , z2 , . . . ) in a unique way, where 1 if z1 = 0, if .. . zn = [2n x] 1 2 <x≤1 x≤ 1 2 mod 2, n = 1, 2 . . . As an example, when x = 7/8 you can check z1 (x) = 1, z2 (x) = 1. When x = 1/2, one can take z1 (x) = 0, z2 (x) = 1, z3 (x) = 1, . . . or as z1 (x) = 1, z2 (x) = 0, z3 (x) = 0, . . . . To make things unique, we will use only the former representation, namely that we will allow a recurring 1 but not a recurring 0. This means that we have a transformation from (0, 1] into Z ∞ where Z = {0, 1}. The transformation is 1 ↔ 1 if we remove sequences in Z ∞ with recurring 0’s. Thus to define a pm on (0, 1] it is enough to define a pm on Z ∞. How to define a pm on Z ∞ ? The singletons of Z n , which are zn = (z1 , . . . , zn ), zi = 0, 1, i = 1, . . . , n, can also be viewed as rectangle sets in Z ∞ . We will therefore allow zn to denote both the first n coordinates of z and also all sequences in Z ∞ whose first n coordinates agree with zn . The product σ-field in Z ∞ is the smallest 44 σ-field containing all these rectangle sets. A probability function p defined on rectangle sets satisfies the consistency condition if p(zn ) = p(zn 1) + p(zn 0) for all zn ∈ Z n , n = 1, 2, . . . , where zn 1 means that the (n + 1)th coordinate is 1. For instance, p(1) = p(10) + p(11). Such a consistent probability function defines a unique probability measure on Z ∞ by the Kolmogorov consistency theorem. One can also use the idea of compact fields to arrive at this conclusion. Here is a way to define a consistent probability function on rectangle sets. Let ue = p(1), 1 − ue = p(0), u1 = p(00) p(0) . . . u xn = p(xn 1) p(xn ) p(11) , u0 p(1) = p(01) ,1 p(0) − u1 = p(10) ,1 p(1) − u0 = for xn ∈ Z n , n = e, 1, 2, . . . . Conversely, p(1) = ue p(0) = 1 − ue p(11) = u0 ue p(10) = u1 (1 − ue ) .. . n−1 Y (ux(r−1) )xr p(xn ) = r=1 for xn ∈ Z where we define a = a, a0 = 1 − a and x0 = e. n 1 Let U ∞ = Ue × U1 × · · · × Un × . . . , where Ue = {ue }, Un = {uxn : xn ∈ Z n }, n = 1, 2, . . . . Then there is a one to one transformation between all probability measures P = P(Z ∞ ) and U. Thus a pm on U ∞ gives rise to a pm on P. Since U ∞ is a product space, the first task will be easier. 45 Let λ be the distribution of u taking values in U ∞ . This gives rise to a distribution ν of pm P taking values in P(Z ∞ ). Let z1 , z2 , . . . , zn be random variables in Z ∞ , whose distribution given P is i.i.d P , and P has the distri- bution ν. We want the conditional distribution ν ∗ of P given z1 , z2 , . . . , zn . This ν ∗ will arise from the conditional distribution λ∗ of u given z1 , z2 , . . . , zn . This can be obtained from looking at L(u|z1k , z2k , . . . , znk ), where zik is the first k coordinate of zi , i = 1, . . . , n. We will do this in the next class. 46 Feb 12, 2008 We begin with a quick review of what we did in the previous class. Let X = (0, 1]. There is a one-to-one transformation x ↔ z from X to Z ∞ = {0, 1}∞, the space of sequences of 0’s and 1’s (minus a few points like se- quences with recurring 0’s). The space of probability measures P ∗ = P(X ) on X is in a one-to-one relation to the space of probability measures P = P(Z ∞ ) on Z ∞ . Thus our nonparametric Bayes problem is the problem of finding probability measures ν on P. Let us explore the nature of a element P ∈ P. Such a P is completely defined by the probabilities it gives to rectangle sets in Z ∞ , i.e. by a specification of {P (zn ), zn ∈ Z n , n = 1, 2, . . . } (where zn = (z1 , . . . , zn ) stands for both a point in Z n and the corresponding cylinder set in Z ∞ ) satisfying the consistency conditions P (zn ) = P (zn 1) + P (zn 0), n = 1, 2, . . . We can now define constants u = (ue , u1 , . . . , un , . . . ) with ue = ue , u1 = (u1 , u0), . . . un = (uxn , xn ∈ Z n ), . . . (for convenience we will set x0 = e, Z 0 = {e}, with e standing for “empty”) as follows: ue = P (1) P (11) u1 = = P (z2 = 1|z1 = 1) P (1) P (01) = P (z2 = 1|z1 = 0) u0 = P (0) .. . uxn = P (zn+1 = 1|zn = xn ) for xn ∈ Z n , n = 1, 2, . . . 47 where one can choose to define the ratio 0 0 as equal to 1. Conversely, given constants u = ({uxn , xn ∈ Z n }, n = 0, 1, . . . ) in [0, 1], one can recover the probability measure P by its probability on cylinder sets as follows: P (xn ) = uxe 1 uxx21 . . . uxxnn−1 , n = 1, 2, . . . (where a1 = a, a0 = (1 − a)). Thus there is a one-one-map from P to U = n U⌉ × U1 × U2 × · · · = [0, 1]∞ . Note also that Un = [0, 1]2 , n = 0, 1, . . . . It will be easy to define a probability measure on U since it is just a product space. Each such probability measure λ will correspond to a probability measure ν on P. This will complete our program to introduce probability measures on P for nonparametric Bayesian analysis. We will still have questions on posterior distributions, Bayes estimates, etc. Thus we have the following picture. Basic Space Space of 0’s & 1’s X = (0, 1] ↔ Z ∞ = {0, 1}∞ x ↔ z = (z1 , z2 , . . . ) Space of pm’s P ∗ = P(X ) P∗ Space of pm’s Space of Cond. Prob’s ↔ P = P(Z ∞ ) U = Ue × U1 × · · · = [0, 1]∞ ↔ P ↔ u = (ue , u1 , . . . ) P (xn ) x n−1 ×j=0 uxjj−1 ν∗ ↔ ν ↔ λ Dα ↔ Dα ↔ λα Let λ be a pm on U corresponding to a pm ν on P. Let the prior distribution of the unknown pm P be ν i.e. let P ∼ ν. This P corresponds 48 to a rv U on U which has distribution λ. Conditional on P (i.e. on U), let the observations Y 1 , Y 2 , . . . , Y n be i.i.d P . We will now find the conditional distribution of P given Y 1 , Y 2 , . . . , Y n . We will do this by first finding the conditional distribution of (Ue , U1 , . . . , Um ) given Y 1 , Y 2 , . . . , Y n for all m. By the Kolmogorov’s consistency theorem, this leads us to the conditional distribution of the sequence U given Y 1 , Y 2, . . . , Y n , which is the posterior distribution. To find the conditional distribution of (Ue , U1 , . . . , Um ) given (Y 1 , Y 2, . . . , Y n ) we will find the conditional distribution of (Ue , U1 , . . . , Um ) given (Yki = (Y1i , . . . , Yki ), 1 ≤ i ≤ n) for each k. This can be done with the joint distribution of just a finite number of random variables. We then will first allow m → ∞ and then let k → ∞ to obtain the posterior distribution. We now proceed to do the calculations, where will assume that (Ue , U1 , . . . , Um ) has a joint pdf q(ue , u1 , . . . , um ). With Q standing for the joint distributions of all the random variables, we have Q(Yn = yn |P ) = Q(Yn = yn |U) = uye1 uyy21 · · · uyynn−1 where, as before, we use the notation a1 = a and a0 = 1 − a. 49 Let N(e) = n n X N(1) = I(Y1i = 1) 1 N(0) = n X I(Y10 , i = 0) N(11) = n X I(Y2i = (1, 1)), etc. 1 N(yr ) = 1 n X 1 I(Yri = yr ) for yr ∈ Z r , r = 0, 1, . . . The function N(yr ) is just the frequency of yr in the sample. With this notation, Q(Yk i = yki , 1 ≤ i ≤ n|Ue , U1 , . . . , Um ) Y = uxNr(xr 1) (1 − uxr )N (xr 0) , xr ∈Z r ,0≤r≤k−1 where we have taken m to be larger than k. The joint distribution of ((Yki , 1 ≤ i ≤ n), (Uj , 0 ≤ j ≤ m)) is Q(((Yki = yki , 1 ≤ i ≤ n), (Ur , 0 ≤ r ≤ m))) Y = uxNr(xr 1) (1 − uxr )N (xr 0) q((ur , 0 ≤ r ≤ m)). xr ∈Z r ,0≤r≤k−1 The conditional distribution of (Ur , 0 ≤ r ≤ m) given (Yki , 1 ≤ i ≤ n) is proportional to xr Y ∈Z r ,0≤r≤k−1 (xr 1) uN (1 − uxr )N (xr 0) q((ur , 0 ≤ r ≤ m)). xr Let us consider three classes of probability measures on U: 50 (18.10) • Λ1 = {λ : Ue , U1 , U2 , U3 . . . are independent under λ}. • Λ2 = {λ : Uxn , xn ∈ Z n , n = 0, 1, . . . are all independent under λ}. • Λ3 = {λ : Uxn ≡ Vn , n = 0, 1, . . . , and Ve , V1 , . . . independent under λ}. It is easy to see that Λ2 ⊂ Λ1 and Λ3 ⊂ Λ1 . We will show that Λ1 is a conjugate family. The proofs that λ2 and Λ3 are conjugate families are similar. The density of {Uxj , xj ∈ Z j , 0 ≤ j ≤ m} under a typical λ ∈ Λ1 is a product of densities of the form given below j q((uxj , xj ∈ Z j , 0 ≤ j ≤ m)) = ×m 0 qj ((u)xj , xj ∈ Z ). (18.11) The conditional density of (Ue , . . . , Um ) given ((Yk1 , . . . , Ykn ), the sample restricted to the first k coordinates is proportional to the joint density given in (18.10) Y xr ∈Z r ,0≤r≤k−1 j uxNr(xr 1) (1 − uxr )N (xr 0) ×m 0 qj ((u)xj , xj ∈ Z ). This is a product of densities similar to the prior density given in (18.11). Thus the posterior density is similar in form to the density of a pm in Λ1 . Thus Λ1 is a conjugate family. We will now look at particular λ = λα ∈ Λ2 where α is a non-zero finite measure on Z ∞ . The pm λα is defined as follows: under λα , Uxj are independent (18.12) Uxj ∼ B(α(xj 1), α(xj 0)) (18.13) and 51 as xj varies over Z j , j = 0, 1, . . . As before, the conditional distribution of {Uxj , xj ∈ Z j , 0 ≤ j ≤ m} given (Yk1 , . . . , Ykn ) is given by Q(Uxj , xj ∈ Z j , 0 ≤ j ≤ m|Yk1 = yk1 , . . . , Ykn = ykn ) Y Y r 1)−1 = uxNr(xr 1) (1 − uxr )N (xr 0) uα(x (1 − uxr )α(xr 0)−1 . xr xr ∈Z r ,0≤r≤k−1 xr ∈Z r ,0≤r≤m Under this posterior distribution, the random variables {Uxj , xj ∈ Z j , 0 ≤ j ≤ k − 1} are independent and uxr ∼ B(α(xr 1) + N(xr 1), α(xr 0) + N(xr 0)), xr ∈ Z r , 0 ≤ r ≤ k − 1. This means that U has the distribution λα+N (·) . We will now identify λα defined in (18.12) and (18.13) with the Dirichlet pm Dα on P(Z ∞ ). Note that P (1) = Ue . Thus P (1) ∼ B(α(1), α(0)) from the distribution of Ue . One can use independent Gamma random variables to model the joint distribution of (Ue , (U1 , U0 )) as follows. Let Z(11), Z(10), Z(01), Z(00) be independent Gamma random variables with parameters α(11), α(10), α(01), α(00), respectively. Define Z(1) = Z(11) + Z(10), Z(0) = Z(01) + Z(00), Z(e) = Z(1) + Z(0). Then Ue = Z(1)/Z(e), U1 = Z(11)/Z(1), U0 = Z(01)/Z(0) are independent random variables with the Beta distributions specified in (18.12) and (18.13) (proved by a repeated use Basu’s Theorem). Thus the probabilities on the partition of Z ∞ defined by the two-dimensional cylinder sets ((11), (10), (01), (00)) given by (P (11), P (10), P (01), P (00)) = (Ue1 U11 , Ue1 U10 , Ue0 U01 , Ue0 U00 ) 52 is equal to (Z(11), Z(10), Z(01), Z(00))/Z(e) which has a Dirichlet distribution with parameters (α(11), α(10), α(01), α(00)). Similarly, the probabilities (P (xm ), xm ∈ Z m ) are distributed as D(α(xm ), xm ∈ Z m ) for m = 1, 2, . . . This is the Dirichlet pm on P(Z ∞ ). This is the third definition of the Dirichlet pm on P. Since the posterior distribution expressed as the distribution of U was λα+N (·) , it follows that the posterior distribution of P in Dα+N (·) . 53 February 14, 2008 19 Examples and properties of pm’s on Z ∞ Let δ be the fair coin tossing probability measure on Z ∞ = {0, 1}∞ , i.e. Under δ, (z1 , z2 , . . . ) are i.i.d with δ(zi = 1) = 1/2, δ(zi = 0) = 1/2, i = 1, 2, . . . This corresponds to the Lebesgue measure δ ∗ on the space X = (0, 1], since δ(z1 = 0) = 1/2 translates to δ ∗ ((0, 1/2]) = 1/2, δ(zi = 1) = 1/2 translates to δ ∗ ((1/2, 1]) = 1/2, etc. Under δ ∗ the measure of an interval with diadic rational end points is the length of that interval. This means that δ ∗ is the Lebesgue measure on (0, 1] or the uniform distribution on (0, 1]. What the a point in U which corresponds to this δ? Notice that ue = P (1) = 12 , P (11) = 14 , u1 = P (11) P (1) = 21 , u0 = 12 , . . . . Thus the point corresponding to δ is u = (1/2, (1/2, 1/2), . . . ). 20 The set of all discrete pm’s P We will give a necessary and sufficient condition for a pm P ∈ P(Z ∞ ) to be discrete. Fix a P ∈ P(Z ∞ ). This can be considered as the distribution of random variables (Y1 , Y2 , . . . ), the coordinates in Z ∞ . It can also be considered as the vector u ∈ U which corresponds to P . Theorem 20.1. The pm P is discrete if and only if P (E) = 1 where E is a 54 countable set defined by ∞ E = ∪∞ m=1 ∩n=m Dn where Dn = {yn = I(uyn−1 )}, n = 1, 2, . . . (20.14) A sufficient condition for P to be discrete is yj X ∈Z j ,1≤j<∞ n−1 yj ×j=1 uyj−1 uyn−1 (1 − uyn−1 ) < ∞. (20.15) A pm ν on P gives measure 1 to the collection of all discrete pm’s if Eν yj X ∈Z j ,1≤j<∞ n−1 yj ×j=1 uyj−1 uyn−1 (1 − uyn−1 ) < ∞. (20.16) The random pm P is discrete with probability 1 under the Dirichlet pm να . Proof: Let y be a point in Z ∞ for which P (y) > 0. This is equivalent to P (y) = uye1 uyy21 uyy32 . . . > 0 which implies the following successively weaker conditions: uyynn−1 → 1 as n → ∞ 1 uyynn−1 ≥ for all large n 2 1 yn = I(uyn−1 ≥ ) for all large n 2 Dn occurs for all large n y∈E with Dn and E as defined in the statement of the theorem. Thus {y : P (y) > 0} ⊂ {y ∈ E}. 55 m−1 The set ∩∞ points, since once ym−1 is fixed, m=n Dm contains at most 2 the yk ’s for k ≥ m are uniquely determined. This implies that E is a countable set. Note that P is discrete is equivalent to P (y : P (y) > 0}) = 1 and this implies P (E) = 1. Conversely, since E is countable, P (E) = 1 implies that P is discrete. This completes the proof of (20.14). We will explore the condition P (E) = 1 further. Note that E = {Dnc occurs only finitely often } ∞ X I(Dnc ) < ∞}. = { 1 P c If we want to show that P (E) = 1, it is enough to show that P ( ∞ 1 I(Dn ) < P c ∞) = 1 or show that the stronger condition ∞ 1 P (Dn ) < ∞ holds. Now P (Dnc ) = EP (P (Dnc |yn−1 )) = EP (E(yn = I(uyn−1 ≤ 1/2)|yn−1)) = EP (min(uyn−1 , 1 − uyn−1 )). ≤ 2EP (uyn−1 (1 − uyn−1 )) X yj ×1n−1 uj−1 uyn−1 (1 − uyn−1 ). = 2 yn−1 ∈Z n−1 since u(1 − u) ≤ min(u, 1 − u) ≤ 2u(1 − u). Hence P (E) = 1 if X yj ∈Z j ,1≤j<∞ y j uyn−1 (1 − uyn−1 ) < ∞. ×1n−1 uj−1 56 This establishes the assertion in (20.15). Consider the Dirichlet pm να on P arising from the pm λα on U, under which all the uyn independent with the Beta distributions B(α(yn 1, α(yn 0)). We now verify that the series in (20.16) is finite with probability 1 under να : X Eνα yn−1 ∈Z n−1 ,1≤n<∞ = n−1 ×j=1 X α(yn 1)α(yn 0) α(e) yn−1 ∈Z n−1 ,1≤n<∞ ≤ α(e) yn−1 X α(yj ) α(yn 1)α(yn 0) α(yj−1 ) α(yn−1 )(α(yn−1) + 1) X yn−1 ∈Z n−1 ,1≤n<∞ = n−1 yj ×j=1 uyj−1 uyn−1 (1 − uyn−1 ) ᾱ(yn 1)ᾱ(yn 0) ∈Z n−1 ,1≤n<∞ = α(e)ᾱ(Y 1 6= Y 2 ) < ∞, where ᾱ(·) = α(·) α(e) is the normalized pm of α and Y 1 , Y 2 are i.i.d. with pm ᾱ on Z ∞ . Thus condition (20.16) is verified and P is discrete with probability 1 under να . This completes the proof of the theorem. 21 Absolute continuous pm’s P wrt Lebesgue measure We already saw that the fair coin tossing measure δ corresponds to the Lebesgue measure δ ∗ on (0, 1]. Thus we are interested in finding conditions for P to be absolutely continuous wrt δ. 57 Define fn (x) = P (xn ) = 2n P (xn ) δ(xn ) be the Radon-Nikodym derivative of P wrt δ on Z n , n = 1, 2, . . . . From standard results, we know that fn (x) → f (x), and that P is absolutely R continuous wrt δ if and only if f (x)dδ = 1. In this case, dP = f (x). dδ Recall that we P can also be viewed as u = (ue , u1 , . . . ). Notice that P (A) = P (1)P1(A) + P (0)P0 (A) where Px (·) is the conditional distribution of P on Z2 × Z3 × · conditional on Y1 = x. We can also write this as P = ue Pue (A) + u0e Pu0e (A) where for any yn , Puyn (·) denotes the pm arising from the tail of the sequence u whose initial segment is uyn . It is easy to see that P is absolutely continuous wrt δ if and only if Pue and Pu0e are absolutely continuous wrt δ. This depends only (U2 , U3 , . . . ), and by extension, this is a tail event for the sequence (Ue , U1 , . . . ). Let λ is a p.m. on P. The probabilities of such tail events under λ are 0 or 1. Hence λ({P << δ}) = 0 or We will develop this further in the next class. 58 1. Feb 19, 2008 22 Is the random P absolutely continuous, singular or discrete? Review from the previous class: Basic Space Space of 0’s & 1’s X = (0, 1] ↔ Z ∞ = {0, 1}∞ x ↔ z = (z1 , z2 , . . . ) Space of pm’s P ∗ = P(X ) P∗ Space of pm’s Space of Cond. Prob’s ↔ P = P(Z ∞ ) U = Ue × U1 × · · · = [0, 1]∞ ↔ P ↔ u = (ue , u1 , . . . ) P (xn ) x n−1 ×j=0 uxjj−1 ν∗ ↔ ν ↔ λ Dα ↔ Dα ↔ λα Conditional on P (i.e. on U), let the observations Y 1 , Y 2 , . . . , Y n be i.i.d P . We have P (Yn = yn ) = uye1 uyy21 · · · uyynn−1 where, as before, we use the notation a1 = a and a0 = 1 − a. We also considered the following three classes of probability measures on U: • Λ1 = {λ : Ue , U1 , U2 , U3 . . . are independent under λ}. 59 • Λ2 = {λ : Uxn , xn ∈ Z n , n = 0, 1, . . . are all independent under λ}. • Λ3 = {λ : Uxn ≡ Vn , n = 0, 1, . . . , and Ve , V1 , . . . are independent under λ}. We also defined a pm λα in Λ2 under which uyn ∼ B(α(yn 1), α(yn 0)). Under this pm λα , {P (yn) = uye1 uyy21 · · · uyynn−1 } ∼ D(α(yn ), yn ∈ Z n ). Thus, λα corresponds to the Dirichlet pm να on P(Z ∞ ). We also showed that να (P is discrete ) = 1. We can view P as u = (ue , u1 , . . . ) and so we can write P = Pu . Again, we can write P (A) = P (1)P1 (A) + P (0)P0(A) = ue Pu1 + u0e Pu0 and more generally, Pu = X uye1 Puy1 y=0,1 = X yn ∈Zn uye1 uyy21 · · · uyynn−1 Puyn where uyn = (uyn , (uyny1 , y1 ∈ Z), (uyn y1 y2 , (y1 , y2) ∈ Z 2 ), . . . ). Let δ be a fair coin tossing measure, and δ(yn ) = 1 , yn 2n ∈ Z n, n = 1, 2, . . . . We know that correspondingδ ∗ in (0, 1] is Lebesgue measure and also that the corresponding uyn = 21 , yn ∈ Z n , n = 1, 2, . . . . We will examine what it means to say that P << δ, i.e.P is absolutely continuous wrt δ. One says that P is absolutely continuous wrt δ if δ(A) = 0 implies P (A) = 0. This is equivalent to saying that P1 and P0 are absolutely continuous 60 wrt δ. Note that P (A) = P (1)P1 (A) + P (0)P0(A) where the conditional distributions P1 and P0 are intuitively clear. Thus P << δ implies that P1 << δ and P0 << δ, since δ(A) = 0 ⇒ P (A) = 0 which implies P1 (A) = 0 and P0 (A) = 0. Trivially, conversely P1 << δ and P0 << δ imply P << δ. Similarly, P << δ if and only if Puyn << δ for all yn ∈ Z n . Thus the event P << δ expressed in term of u depends only (Un , Un+1 , . . . ), for n = 1, 2, . . . , and hence this event is a tail event for the sequence (Ue , U1 , . . . ). Suppose that P ∼ λ, where λ ∈ Λ1 . Then by the Kolmogorov 0-1 law λ(P << δ) = 0 or 1. Note: λ(P singular wrt δ) = 0 or 1 is same with λ(P << δ)0 or 1. Define fn (y) = P (yn ) = 2n P (yn ) = 2n uye1 uyy21 · · · uyynn−1 . δ(yn ) From standard theory, we know that, under δ, fn (y) is a martingale and converges to f (y) with probability 1. We also know that Z P << δ ⇔ {fn }u.i. ⇔ f (y)dδ(y) = 1. p p Now we consider fn (y) which converges to f (y). p If limn Eδ ( fn (y)) = 0, then f (y) = 0 and P ⊥ δ, i.e. P and δ are supported on disjoint sets. p If limn Eδ ( fn (y)) > 0, then P is not singular wrt δ. If we knew a priori p that P is such that P << δ or P is singular wrt δ, then limn Eδ ( fn (y)) > 0 implies that P << δ. 61 We compute v v u n u n uY uY yr Eδ (t 2uyr−1 ) = Eδ (Eδ (t 2uyyrr−1 )|yn−1 ) r=1 r=1 v un−1 q uY y r t = Eδ ( 2uyr−1 Eδ ( 2uyynn−1 |yn−1 )) r=1 v un−1 q uY y 1 1p = Eδ (t 2uyrr−1 ( 2uyn−1 + 2(1 − uyn−1 ))). 2 2 r=1 Define Hn = sup yn−1 ∈Z n−1 1 2 1p 1 2uyn−1 + 2 2 q 2(1 − uyn−1 ). + θ. Then 1√ 1p 2u + 2(1 − u) 2 2 1√ 1√ 1 + 2θ + 1 − 2θ = 2 2 1 2θ 1 1 2θ 1 = (1 + − (2θ)2 + · · · ) + (1 − − (2θ)2 + · · · ) 2 2 8 2 2 8 2 ∼ 1 − cθ if θ ∼ 0. Let u = Thus Hn ≤ 1 − cθn2 where θn = max|uyn−1 − 21 |, and v u n uY Eδ (t 2uyyrr−1 ) ≤ H1 H2 · · · Hn r=1 n Y (1 − cθr2 ) ≤ r=1 → 0 if → X θr2 = ∞ a limit which is > 0 if 62 X θr2 < ∞. One can also look at a special case where uyn ≡ Vn for all yn ∈ Z n , n = 0, 1, . . . , in which case, we get the estimate v u n n Y uY yr ∗ ∗ ∗ ∗ 2 t Eδ ( 2uyr−1 ) = H1 H2 · · · Hn = (1 − c(θr−1 )) r=1 r=1 p √ where Hr∗ = 2 2Vr + 2 2(1 − Vr ) and θr∗ = |Vr − 12 |, r = 1, 2, . . . . Example 22.1. Let λ ∈ Λ3 . Then Un = {uyn , yn ∈ Zn } ≡ Vn and V0 , V1 , . . . are independent. Suppose further that Vn ∼ B(an + rn , an + sn ) with |rn | ≤ K < ∞, |sn | ≤ K < ∞ and an → ∞. Then an + rn 2an + rn + sn rn − sn 1 1 ∼ E(Vn − ) = 2 2(2an + rn + sn ) an 1 (an + rn )(an + sn ) ∼ V (Vn ) = 2 (2an + rn + sn ) (2an + rn + sn + 1) an 1 1 E(Vn − )2 ∼ , 2 an E(Vn ) = and X X 1 1 1 E(Vn − )2 < ∞ ⇔ < ∞. (Vn − )2 < ∞ ⇔ 2 2 an P 1 Therefore, when an additional condition < ∞ holds, we obtain λ(P << an X δ) = 1. Example 22.2. Consider λ ∈ Λ3 with Vn , n = 0, 1, . . . i.i.d. B(1, 1) i.e. with a has a uniform distribution. Then X 1 (Vn − )2 = ∞ 2 63 with probability 1 under λ. Therefore, λ(P ⊥ δ) = 1 P is singular. We can actually show that it is singular continuous. 23 Other mappings from R to Z ∞ Originally, we took X to be (0, 1]. How about X = (0, 1) or X is real line? Let B0 = (B0 , B1 ) be a two-set partition of the real line. We partition B0 into two sets B01 ,B00 and partition B1 into two sets B10 ,B11 . We then have the partition B2 = (B11 , B10 , B01 , B00 ) which is a subpartition of B1 . Continuing this way, we will get partitions B1 , B2 , . . . of the real line. Byn 1 ∪ Byn 0 = Byn . Then we can map the real line to Z ∞ as follows: x ∈ y with y1 (x) = 1 if x ∈ B1 , y1 (x) = 0 if x ∈ B0 . y2 (x) = 1 if x ∈ B11 ∪ B01 , y2 (x) = 0 if x ∈ B01 ∪ B00 , etc. As an example, let F be a distribution function on real line. We can choose the the partitions as follows: 1 B0 = (−∞, F −1( )] 2 1 B1 = (F −1( ), ∞] 2 1 B00 = (−∞, F −1 ( )] 4 1 1 B01 = (F −1 ( ), F −1 ( )] 4 2 etc. 64 In this case the fair coin tossing measure δ on Z ∞ corresponds to the distribution F of the real line. From our earlier results, we can now choose prior distributions λ which picks pm’s absolutely continuous wrt F . 65 Feb 21, 2008 24 Support of Dirichlet pm’s Let us go back to our unknown parameter P in the nonparametric problem. Let α be a non-zero finite measure on real line and let Dα be the prior distribution for P . Let D∗ = {P : P is discrete }. We know that Dα (D ∗ ) = 1. Is D ∗ the support of Dα ? Consider N(0, 1). Let Iir be the set all irrationals on R. Then N(0, 1)(Iir ) = 1. Can we say that the Normal distribution sits just on the irrationals? We can remove a countable number of irrationals, and the Normal distribution will give probability to the new set. Is there a smallest set which has probability 1. The answer is no. However, if we define the support of the normal distribution to be the smallest closed set with probability 1, then the set of irrationals is not closed and its closure is the whole real line R. Any closed set strictly included in R will have probability less than 1, and thus the support of N(0, 1) is R. Definition 24.1. Support: Let µ be a pm on R. A set K is called the support of µ, if 1 K is closed. 2 µ(K) = 1. 3 K is the smallest such closed set, i.e. if K ∗ is closed and K ∗ ⊂ K then µ(K ∗ ) < 1. 66 In the above definition one can replace 3 above by either of the conditions below 3a If U is open and U ⊂ K, then µ(U) > 0. 3b If x ∈ K and N(x, ǫ) ⊂ K for some ǫ > 0, then µ(N(x, ǫ)) > 0. (Here N(x, ǫ) = {y : |y − x| < ǫ}). The support of the Poisson distribution with parameter λ > 0 is the set of non-negative integers. Coming back to the Dirichlet pm Dα , we know that Dα (D ∗ ) = 1. How- ever, since D ∗ is not closed (under weak convergence), it is not the support of Dα . So the question of the support of Dα is still unanswered. Let us examine open balls in P. We know that a sequence Pn converges to P if and only if Z f dPn → Z f dP for all bounded continuous function f vanishing outside a compact set of R. Consider the open set defined by f : Z Z ∗ N(P ; f, ǫ) = {P : | f dP − f dP ∗ | < ǫ} Finite intersections of such sets form a base for open sets in the weak topology in P. Given δ > 0, we can find a0 < . . . < aT , such that f (x) = 0 outside [a0 , aT ] and |f (x) − f (ai )| < δ for ai ≤ x ≤ ai+1 , i = 0, . . . , T − 1. Let Ai = (ai , ai+1 ]. If |P (Ai) − P ∗ (Ai )| < θ, i = 0, . . . , T − 1, then Z Z | f dP − f dP ∗| < T cθ + δ 67 where c = supx |f (x)|, and ∗ {P : |P (Ai ) − P (Ai )| < θ, i = 0, . . . , T − 1} ⊂ {P : | Z f dP − Z f dP ∗ | < ǫ} if T cθ + δ < ǫ. Similarly, we we can find a set of the form {P : |P (Ai ) − P ∗ (Ai )| < θ, i = 0, . . . , T − 1} contained in an open set defined by the finite intersection of the form N(P ∗ ; f, ǫ). Let ᾱ(A) = α(A) , α(X) then α(A) = ᾱ(A)α(X). Let M be the support of ᾱ. Consider M ∗ defined by M ∗ = {P : support of P ∈ M} = {P : P (M) = 1}. Theorem 24.1. The support for Dα is M ∗ . Proof. Let Pn ∈ M ∗ and let Pn → P . Then P (M) ≤ lim supn Pn (M) = 1, which implies that P ∈ M ∗ . Thus M ∗ is closed. We will show that Dα (M ∗ ) = 1. Under Dα , P (M) ∼ B(α(M), α(M c )). We know that α(M) = α(R)ᾱ(M) = α(M) and thus α(M c ) = 0. Thus P (M) ∼ B(α(M), 0), which means that P (M) ≡ 1 with probability 1 under Dα , and Dα {P : P (M) = 1} = 1 or Dα (M ∗ ) = 1. We will show that M ∗ is the smallest such closed set in P. Let P ∗ ∈ M. Consider N(P ∗ ; f, ǫ) ⊂ M ∗ for some bounded continuous function f which vanishes outside a bounded set in R. There exist θ > 0 and δ > 0, such that {P : |P (Ai ) − P ∗ (Ai )| ≤ θ, i = 0, . . . , T − 1} (24.17) R R which is a subset of {P : | f dP − f dP ∗| < ǫ}. Since both P and P ∗ are in M ∗ , we can write the set in (24.17) as {P : |P (Ai ∩ M) − P ∗ (Ai ∩ M)| < θ, i = 0, . . . , T − 1}. 68 (24.18) Let AT +1 = [a0 , aT ]c . Let I = {i : P (Ai ∩ M) > 0, 1 ≤ i ≤ T + 1}. The joint distribution of (P (Ai ∩ M), i ∈ I) is finite dimensional Dirichlet with all parameters positive. Such a distribution gives positive probability all open sets in the simplex, and in particular to the set of the form (24.18). Hence P (N(P ∗; f, ǫ)) > 0. This proves that M ∗ is the smallest such closed set. 25 Bayes estimates Let θ be the unknown state of the nature. A standard statistical problem is to estimate some desired function g(θ) of θ. Assume that θ has a prior distribution π. If I claim that ĝ is an estimator of the g(θ), I should use some loss (or cost) function L(g(θ), ĝ) or the average of this cost function to evaluate the performance of the estimate. To do the latter, I should choose R ĝ to minimize L(g(θ), ĝ)dπ(θ). R If L(g(θ), ĝ) = (g(θ) − ĝ)2 and g(θ)2 dπ < ∞, the minimizer is ĝ = R g(θ)dπ, the expectation of g(θ) under π. This is the Bayes estimator of g(θ) based just on the prior distribution before collecting any data. If we use L(ĝ, g(θ)) = |ĝ − g(θ)|, then expected loss is minimized at ĝ is the median of the distribution of g(θ) under π. For more realistic and complicated loss function, we can use a computer to do this minimization. Now, back to our nonparametric Bayes problem where P ∼ Dα , with α a non-zero finite measure, is our prior distribution of the unknown distribution P. Example 25.1. Consider the function g(P ) = P ((−∞, t]) = FP (t) (the 69 distribution function at the point t). The Bayes estimator is F̂P (t) = EDα (P ((−∞, t])) α((−∞, t]) = α(X) = ᾱ(t) under square error loss function, since P ((−∞, t]) ∼ B(α((−∞, t]), α((t, ∞))). Here we do not need to check the second moment condition since Fp (t) is a bounded function. Suppose that X1 , . . . , Xn |P are i.i.d P and P ∼ Dα . What is the Bayes estimate of FP (t)? It is just the expectation of FP (t) under the distribution of P given X1 . . . . , Xn , the posterior distribution. We already know that P |X1, . . . , Xn ∼ Dα+Pn1 δXi = Dα+nFn . Thus α((−∞, t]) + nFn (t) α(X) + n α(X)ᾱ((−∞, t]) + nFn (t) = α(X) + n F̂P (t) = R Question: What is the Bayes estimator of g(P ) = h(x)dP ? R R If EDα (( h(x)dP )2 ) < ∞, it is the expectation EDα ( h(x)dP ). Do we have a simple expression for this Bayes estimator? But before that, do we know that g(P ) is well defined on the whole space? 70 Feb 26, 2008 Let us look at the definition of a random variable. Usually, we say that it is a measurable mapping X from a probability space (Ω; A, Q) to the real line R. Note that the function f (x) = 1 x from (R, B, P ) to R is not well defined at x = 0, but f (x) will be considered as a random variable if P ({x}) = 0. Thus random variables need to be well defined on sets of probability 1 under the underlying probability measure. We can then talk about expectations, distributions, etc. of such random variables. We can also allow values −∞ and ∞ for a random variable X as long as the probability that X is finite is 1 under P . How do we define the expectation of X, or more generally of g(X) where g is a measurable function. Denote the distribution of X by P . This exR pectation E(g(X)) is defined to be g(x)dP . How do we define the integral of g? If g is an indicator function, more specifically if g(x) = I(x, A), then we define E(g) = P (A). Consider a non-negative measurable function g(x). Define gn (x) = i , 2n 0, i 2n ≤ g(x) < (i+1) ,i 2n = 0, 1, . . . n2n g(x) ≥ n Then gn (x) ≤ g(x) for all n and gn (x) → g(x) as n → ∞, for all x ∈ {x : g(x) < ∞}. Further more n n2 X i i (i + 1) E(gn (x)) = P ({ n ≤ x < }) n 2 2 2n 1 increases with n and its limit (finite or infinite) will be the definition of E(g(X)). Thus for any non-negative measurable function g(x) we can find a sequence of simple functions which increase to it and the integral of g is 71 the limit of the integrals of these simple functions. In this sense the integral of a non-negative random variable always exists. However, we say that g is integrable only when the integral is finite. If g is not non-negative, let g = g + − g − where g + , g − are the positive and negative parts of g. We say that g is integrable if both g + and g − are integrable. If only one g + and g − is integrable, then we say that the integral of g exists and it may be +∞ or −∞. Thus E(g) is integrable if and only if both g − and g + have finite expectations which amounts to saying that |g| is integrable. Thus we see the following statement in text books: |E(g(X))| < ∞ ⇔ E(|g(X)|) < ∞. Let P be the space of probability measures on R. Remember that we endowed P with a σ-field. We want to consider rv’s taking values on this space; they will be measurable functions from some probability space into P. In particular, we would like to consider functions like Z Z φg (P ) = φ(P ) = g(x)dP (x) = lim gn (x)dP (x) where gn are simple functions converging to g. Clearly, the function φg (P ) is not defined for all P ’s. Thus we can consider φg (P ) as a random variable under a pm ν if ν({P : φg (P ) is finite }) = 1. R Let m(P ) = xdP (x) and consider the Dirichlet pm Dα . Under what conditions of α is m(P ) a well defined random variable? We will answer this question later or make it into a project for this class. It can be shown that Z m(P ) is a genuine rv if and only if log(1 + |x|)dᾱ(x) < ∞ 72 where ᾱ is the normalized pm arising from α. Or, more generally, we can state the following theorem: Theorem 25.1. Let g be a measurable function on R. The random variable R φg (P ) = g(x)dP (x) well defined and finite under Dα , i.e. Dα (P : φg (P ) finite) = 1 if and only if Z log(1 + g(x)2 )dᾱ < ∞ R ⇔ log(1 ∧ |g(x)|)dᾱ < ∞ R ⇔ |g(X)|>100 log|g(x)|dᾱ < ∞ i.e. log|g(x)| is integrable in the tail under ᾱ. If we also wish to require that m(P ) be integrable we will need a stronger condition. In this connection, we can state the following theorem. Theorem 25.2. A necessary and sufficient condition that EDα (φg (P )) is R finite is that |g(x)|dᾱ(x) < ∞. Proof: We will establish this result when g(x) ≥ 0. The general case will follow immediately. We will use the sequence {gn (x)} which approximates g(x) and defined as previously gn (x) = i , 2n 0, i 2n ≤ g(x) < (i+1) ,i 2n = 0, 1, . . . n2n − 1 g(x) ≥ n Suppose that φg (P ) is well defined and EDα (φg (P )) < ∞. Then Z Z φgn (P ) = gn (x)dP (x) ր φg (P ) = g(x)P (x). By the monotone convergence theorem EDα (φgn (P )) ր EDα (φg (P )) 73 which, by our assumption, is finite. However, EDα (φgn (P )) = X i=0n2n −1 = ր Hence R Z Z i i (i + 1) ᾱ( n ≤ g(x) < ) n 2 2 2n gn (x)dᾱ(x) g(x)dᾱ(x) g(x)dᾱ(x) is finite and is equal to EDα (φg (P )). To prove the converse assertion, assume that EDα (φg (P )) < ∞. This automatically means that φg (P ) is well defined with Dα probability 1. Since φgn (P ) ր φg (P ) we have EDα (φgn (P )) ր EDα (φg (P )). We also have R R EDα (φgn (P )) = gn (x)dᾱ(x) ր g(x)dᾱ(x) and this is finite. This com- pletes the proof of this theorem. R Let φg (P ) = gdP , P ∼ Dα . Then under square error loss function, the R Bayes estimator φ̂g for φg (P ) under the prior is EDα (φg (P )) = g(x)dᾱ(x), R provided that g(x)2 dᾱ)(x) < ∞. We can derive this result in another way. Let P ∼ Dα and X1 |P ∼ P . R Notice that we may write φg (P ) = gdP = EQ (g(X1 )|P ) where Q is the joint distribution of P and X1 . We can compute this as follows. Z Z EDα ( gdP ) = EQ ( gdP ) = EQ (EQ (g(X1)|P )) = EQ (g(X1)). Since Q(X1 ∈ A|P ) = P (A) we have Q(X1 ∈ A) = EQ (P (A)) = ᾱ(A). Thus Z EQ (g(X1 )) = g(x)dᾱ 74 provided that R |g|dᾱ < ∞. Thus for the case of the population mean m(P ) = R estimate is m̂ = xdᾱ. R xdP , the Bayes Suppose that we have date X1 , . . . , Xn and X1 , X2 , . . . , Xn |P are i.i.d P , then the Bayes estimate of the population mean is R Z α(X ) xdα + nX̄n m̂X1 ,...,Xn = XdαX1 ¯,...,Xn = α(X ) + n Where ᾱX1 ,...,Xn = α(X )ᾱ+nFn α(X )+n and Fn is the empirical df of X1 , . . . , Xn . For the no-sample case, what is the Bayes risk of this estimator? EDα (m̂ − m(P ))2 = EDα (m2 (P )) − m̂2 . Let P ∼ Dα and X1 , X2 |P be i.i.d. P . Denote their joint distribution by Q. Then Z g(x)dP Z h(x)dP = EQ (g(X1 )h(X2 )|P ). Therefore Z Z EQ ( gdP hdP ) = EQ (EQ (g(X1)h(X2 )|P )) = EQ (g(X1)h(X2 )). Also since Q(X2 ∈ A|X1 ) = α(A) + δx1 (A) , α(X ) + 1 we get EQ ( Z gdP Z hdP ) = EQ (EQ (g(X1 )h(X2 )|P )) = EQ (EQ (g(X1 )h(X2 ))) = EQ (EQ (g(X1 )h(X2 )|X1 )) R hdα + h(X1 ) = EQ (g(X1) ) α(X ) + 1 R R R gdα hdα + ghdα . = α(X )(α(X ) + 1) 75 Thus the Bayes risk in estimating m(P ) is EDα (m̂ − m(P ))2 = EDα (m2 (P )) − m̂2 R R Z α(X )( xdα̂)2 + x2 dα̂ − ( xdα̂)2 = α(X ) + 1 R 2 R x dα̂ − ( xdα̂)2 = . α(X ) + 1 Now the posterior Bayes risk, EDα+P δX (m̂X1 ,...,Xn − m(P ))2 , after iid i observations X1 , . . . , Xn is 1 α(X )Vᾱ + ns2n + α(X )n(Eᾱ − X̄n )2 ( ) α(X ) + 1 + n (α(X ) + n)2 R R 2 P where Eᾱ = xdᾱ, Vᾱ = x dᾱ − Eᾱ2 , X̄n = (1/n) n1 Xi , and P s2n = (1/n)( n1 Xi2 ) − X̄n2 . 76 Feb 28, 2008 In the last class, we found the Bayes estimator m̂X1 ,...,Xn for m(P ) = R xdP based on sample X1 , . . . , Xn and with a prior distribution Dα , i.e when • P ∼ Dα . • X1 , X2 , . . . , Xn |P are i.i.d. P . R R Let us now look at the Bayes estimator for V (P ), where V (P ) = (x − xdP )2 dP . One can express V (P ) in another way. Let P ∼ Dα and let X1 , X2 |P ∼ P and let Q be their joint distribution. Note that X1 ∼ ᾱ, and X2 |X1 ∼ α+δX1 . α(X )+1 The alternate expression is Z Z (x − xdP )2 dP Z 1 (x1 − x2 )2 dP (x1 )dP (x2 ) = 2 1 = EQ ((X1 − X2 )2 |P ) 2 V (P ) = 77 provided all the moments exist. We can calculate = = = = = = = 1 V̂ = EDα (V (P )) = EQ (V (P )) = EQ ( EQ ((X1 − X2 )2 |P )) 2 1 1 EQ (X1 − X2 )2 = EQ [(EQ ((X1 − X2 )2 )|X1 )] 2 2 1 EQ [EQ (X22 |X1 ) − 2X1 EQ (X2 |X1 ) + X12 )] 2 R R α(X ) xdᾱ + X1 α(X ) x2 dᾱ + X12 1 EQ [ − 2X1 + X12 ] 2 α(X ) + 1 α(X ) + 1 R 2 R 2 R R R Z α(X ) xdᾱ xdᾱ + x2 dᾱ 1 α(X ) x dᾱ + x dᾱ [ −2 + x2 dᾱ] 2 α(X ) + 1 α(X ) + 1 R R 1 (α(X ) + 1 − 2 + α(X ) + 1) x2 dᾱ − 2α(X )( xdᾱ)2 2 (α(X ) + 1) Z Z α(X ) ( x2 dᾱ − ( xdᾱ)2 ) α(X ) + 1 α(X ) V (ᾱ). α(X ) + 1 The Bayes estimator V̂X1 ,...,Xn based on a sample X1 , . . . , Xn , where P ∼ Dα and X1 , . . . , Xn |P are i.i.d. P is given by α(X )ᾱ + nFn α(X ) + n V( ) α(X ) + n + 1 α(X ) + n R α(X ) + n α(X )2V (ᾱ) + n2 s2n + 2α(X )n( xdᾱ − X̄n )2 = α(X ) + n + 1 (α(X ) + n)2 V̂X1 ,...,Xn = where X̄n and sn are the sample mean and variance. If let α(X ) → 0, then m̂X1 ,...,Xn R α(X ) xdᾱ + nX̄n α(X )→0 = −→ X̄n . α(X ) + n Therefore, sample mean is the limit of the Bayes estimator as α(X ) → 0. Again, V̂X1 ,...,Xn n n 2 1 X −→ (Xi − X̄)2 sn = n+1 n+1 1 α(X )→0 78 which is the best equi-variant estimate for the variance. Now let us look at the median, med(P ) = median of P . A value a is said to be a median of P , if P ((−∞, a]) ≥ 1 1 and P ((a, ∞)) ≥ . 2 2 The collection of all medians of P forms an interval of the form [c, d]. Further more, 1 med(P ) ≥ g ⇔ P ((−∞, g]) ≤ , 2 and 1 med(P ) ≤ g ⇔ P ((−∞, g]) ≥ . 2 ˆ of med(P ) under absolute error loss, based Let the Bayes estimator med on just the prior Dα is the median of med(P ). Then {median of med(P ) under Dα ≥ g} 1 ⇔ Dα (med(P ) ≤ g) ≤ 2 1 1 ⇔ Dα (P (−∞, g] ≥ ) ≤ 2 2 1 1 ⇔ Q(B(α(−∞, g], α(g, ∞)) ≥ ) ≤ 2 2 1 α(−∞, g]) ≤ ⇔ α(X ) 2 ⇔ g ≥ med(ᾱ) and g ≤ med(ᾱ) Thus the median of ᾱ is our Bayes estimator. Let P ∼ Dα and X1 , . . . , Xn |P be i.i.d.P , then Q(Xn |X1 , . . . , Xn−1 ) = 79 α + (n − 1)Fn−1 α(X ) + n − 1 where Fn−1 is the empirical df of X1 , . . . , Xn−1 . Thus α({X1, . . . , Xn }) + (n − 1) α(X ) + n − 1 n−1 . ≥ α(X ) + n − 1 Q(Xn ∈ {X1 , . . . , Xn−1 }|X1 , . . . , Xn−1 ) = There will be an equality in the above if α is non-atomic. Thus in all cases, there is positive probability that Xn will be a repeated observation. We will assume that α is non-atomic and define Bernoulli random variables D1 , D2 , . . . to indicate that observations number 1, 2, . . . are new observations. More precisely, Example 25.2. Let D1 = 1 and 0 if X ∈ {X , . . . , X } n 1 n−1 Dn = n = 2, 3, . . . 1 if X 6∈ {X , . . . , X }, n 1 n−1 We will also keep track of the observation number that corresponds to a new observation and keep track of its magnitude as follows. Let τ1 = 1, Y1 = Xτ1 = X1 and τn = max {k : k ≥ τn−1 and Xn 6∈ {X1 , . . . , Xn−1 }}, Yn = Xτn , n = 2, 3, . . . . A simple example will be: {Xn } = {1, 2, 1, 6, 2, 1, 1, . . . } {Dn } = {1, 1, 0, 1, 0, 0, 0, . . . } {τn } = {1, 2, 4, . . . } {Yn } = {1, 2, 6, . . . } Notice that Pn 1 Di is the number of distinct observations among {X1 , . . . , Xn }. 80 Theorem 25.3. Assume that the finite measure α is non-atomic. Then D1 , D2 , . . . are independent, and P (Dn = 1) = α(X ) , n = 1, 2, . . . α(X ) + n − 1 Proof: Notice that Q(Dn = 1|X1 , . . . , Xn−1 ) = Q(Xn 6∈ {X1 , . . . , Xn−1 }|X1 , . . . , Xn−1 ) α(X − {X1 , . . . , Xn−1 }) = α(X ) + n − 1 α(X ) . = α(X ) + n − 1 Since this conditional probability is a constant, it is also equal to Q(Dn = 1|D1 , . . . , Dn−1 ). P (Dn = 1) = Thus D1 , D2 , . . . are independent and α(X ) . α(X )+n−1 When α(X ) = 1, P (Dn = 1) = 1 ,n n = 1, 2, . . . . An example of inde- pendent Bernoulli random variables with such a distribution with arises in a different situation. as described below. Example 25.3. Let X1 , X2 , are i.i.d. with continuous distribution. Let D1 = 1, and 1 Dn = 0 if Xn > max {X1 , . . . , Xn−1 } n = 2, 3, . . . otherwise If X1 , X2 , . . . are the flood levels on a river in successive years, the Bernoulli random variables Dn give the year numbers on which records occur. Assume that X1 , X2 , . . . are i.i.d. and the common distribution is continuous. The probability of a record on year n is P (Xn > max (X1 , . . . , Xn−1 )) is n1 , independent of previous history. This is since all possible vector of ranks of the 81 first n observations are equally likely. The event {Xn > max (X1 , . . . , Xn−1 )} is just the event that the rank of Xn among the first n observations is equal to n. This is therefore equal to 1 n and does not depend the values of the previous observations. 82 March 4, 2008 26 The sequence Y1, Y2, . . . Recalling from our last class, let X1 , X2 , . . . |P be i.i.d P and P ∼ Dα where α is non-atomic. Let Dn = I{Xn 6∈ {X1 , . . . , Xn−1}}, i.e. Dn = 1 if Xn is a new observation, and Dn = 0, if it is equal to a previous observation. Define τ1 = 1 and Y1 = Xτ1 , τn = inf{m, m > τn−1 , Xm 6∈ {X1 , . . . , Xm−1 }} = inf{m, m > τn−1 , Xm 6∈ {Y1 , . . . , Ym−1 }}, and Yn = Yτn . We established that D1 , D2 , . . . are independent and Q(Dn = 1) = α(X ) α(X ) + n − 1 We will now establish the following. Theorem 26.1. Under the assumption that α is non-atomic, the random variables Y1 , Y2, . . . are i.i.d ᾱ. 83 Proof: Note that τn ≥ n and n − 1 ≤ τn−1 ≤ τn − 1. We calculate Q(Yn ∈ A, τn = m|Y1 , . . . , Yn−1) m−1 X = Q(Yn ∈ A, τn = m, τn−1 = r|Y1 , . . . , Yn−1) r=n−1 = m−1 X r=n−1 Q(Yn ∈ A|τn = m, τn−1 = r, Y1 , . . . , Yn−1 )Q(τn = m, τn−1 = r|Y1, . . . , Yn−1 ). ∗ The event {τn = m, τn−1 = r} and also be written as {Xi ∈ Yn−1 ,r+1 ≤ i ≤ ∗ ∗ m − 1, Xm 6∈ Yn−1 , τn−1 = r} where Yn−1 = {Y1, . . . , Yn−1 }. It follows that Q(Yn ∈ A|τn = m, τn−1 = r, X1 , . . . , Xm−1 ) ∗ α(A − Yn−1 ) = α(X ) + m − 1 α(A) , since α is non-atomic = α(X ) + m − 1 = Q(Yn ∈ A|τn = m, τn−1 = m, Y1 , . . . , Yn−1) since the previous expression depends only on m. Substituting this in the previous calculation, we get Q(Yn ∈ A, τn = m|Y1 , . . . , Yn−1 ) m−1 X α(A) = Q(τn = m, τn−1 = r|Y1 , . . . , Yn−1 ) α(X ) + m − 1 r=n−1 and = Q(Yn ∈ A|Y1 , . . . , Yn−1) ∞ m−1 X X α(A) α(X ) + m − 1 m=n r=n−1 (26.19) Q(τn = m, τn−1 = r|Y1 , . . . , Yn−1 ). 84 Since ∞ X m=n Q(Yn ∈ X |Y1, . . . , Yn−1 ) = 1, we obtain, by letting A = X in (26.19), ∞ m=1 X X α(X ) Q(τn = m, τn−1 = r|Y1 , . . . , Yn−1 ) = 1. α(X ) + m − 1 m=n r=n−1 Substituting this in (26.19) we get Q(Yn ∈ A|Y1 , . . . , Yn−1) = α(A)) = ᾱ(A). α(X ) This completes the proof of theorem 26.1. 27 The sequence D1, D2, . . . We will now study some properties of the sequence D1 , D2 , . . . which indicates occurrences of a new observations in the sequence X1 , X2 , . . . . Recall that we are assuming that α is non-atomic. Notice that ∞ X E(Di ) = ∞ X 1 i=1 α(X ) = ∞. α(X ) + n − 1 This means that ∞ X i=1 Q(Di = 1) = ∞ i. e. P ({Di = 1 infinitely often }) = 1. We can strengthen this in the form of the following theorem: Theorem 27.1. Pn Di → α(X ) with probability 1. log n 1 85 Proof: Recall two standard results. P Theorem 27.2. If Xn < ∞ and an ր ∞, then Pn ai Xi an 1 → 0. Theorem 27.3. Let X1 , X2 , . . . iid and let E(Xi ) = 0, i = 1, 2, . . . . Assume P P that V (Xi ) < ∞. Then Xi < ∞ with probability 1. ) ). From last class, we know that Dn ∼Bin(1, α(Xα(X )+n−1 Dn −E(Dn ) . log n Let Wn = V (Wn ) = and hence P n Then 1 α(X ) α(X ) (1 − ) 2 (log n) α(X ) + n − 1 α(X ) + n − 1 α(X ) 1 ≤ 2 (log n) α(X ) + n − 1 1 ∼ n(log n)2 V (Wn ) < ∞. Thus X and Wn = X Dn − E(Dn ) log n n < ∞. n 1 X (Di − E(Di )) → 0 log n 1 with probability 1. Since n n 1 X 1 X α(X ) E(Di ) = → α(X ). log n 1 log n 1 α(X ) + i − 1 Therefore, Pn 1 Di log n → α(X ) with probability 1. 86 28 Estimation of the parameters of the Dirichlet distribution Consider the standard Bayes nonparametric problem, where the unknown distribution (parameter) P has distribution Dα and, given P , the observations X1 , X2 , . . . are i.i.d. P . In general, one will not be able to estimate the parameter of the prior distribution from such data, but when α is non-atomic, we will show that the parameter α of the prior distribution consistently from X1 , X2 , . . . . We will later amplify on the reasons why this is possible. We already saw in Theorem 27.1 that n 1 X Di → α(X ) with probability 1. log(n) 1 Again, from Theorem 26.1, we also have, from the Glivenko-Cantelli theorem, that n 1X δXi (A) → α(A) with probability 1. n 1 Combining these two results we get n n 1 X 1X d Di · δXi → α with probability 1 log(n) 1 n 1 d (where → means converges weakly) which can also be written as Dα P : P ∞ n n 1X 1 X d Di · δXi → α = 1 log(n) 1 n 1 = 1. (28.20) Let us examine this result further. We have data X1 , . . . , Xn , which given the unknown common distribution P , are i.i.d.P . Further, from our 87 prior knowledge, P has some prior distribution; in this case, a Dirichlet distribution with parameter α. We have established that we can estimate the parameter α of the prior distribution, consistently from the data, when α is non-atomic. This does not happen in standard problems. Here is an example where this can happen Example 28.1. Let the data X1 , X2 , . . . be i.i.d. P . Suppose that there are two possible prior distributions for P , one giving all its mass to {P : P = Unif(a, b), 0 ≤ a < b ≤ 1}, and the other assigning all its mass to {P : P = Unif(a, b), 2 ≤ a < b ≤ 3}. We can tell which is the prior distribution from just the first observation X1 . This phenomenon occurred because the two prior distributions are singular with respect to one another. In the same way, for two dirichlet priors, Dα and Dα′ , where α and α′ are non-atomic and α 6= α′ , the sample data can consistently tell which of the two priors is the correct Pn Pn 1 1 prior. In fact, the limit of log(n) 1 Di · n 1 δXi is correct α. Moreover, let n Aα = {P : P ∞ n 1X 1 X d Di · δXi → α = 1. log(n) 1 n 1 Then the subsets Aα and Aα′ of P are disjoint, and Dα (Aα ) = 1, Dα′ (Aα′ ) = 1. This clearly shows that Dα and Dα′ are singular with respect to each other in this case. For later discussion, we will use the following result, which is being assigned as a homework problem. 88 Let A ∈ X . Then (A, Ac ) is a partition of X . Any probability measure P on X can be expressed as: P (B ∩ Ac ) P (B ∩ A) + P (Ac ) P (A) P (Ac ) c = P (A)PA (B) + P (A )PAc (B) P (B) = P (A) where PA and PAc are the restrictions of the pm P to A and Ac . Theorem 28.1. If P ∼ Dα then (P (A), P (Ac )),PA and PAc are independent and P (A) ∼ Beta(α(A), α(Ac )), PA ∼ DαA , PAc ∼ DαAc . Here αA (B) = α(A ∩ B), whereas PA (B) = P (A∩B) ; P (A) we define restricted measure and restricted probability measures differently. Let α be a finite measure. Let M be the set of points where α puts positive mass. We always have the decomposition α = α1 + α2 where α1 is non-atomic (continuous but may not have a density) and α2 is discrete or singular with respect to Lebesgue measure. Clearly α1 (M) = 0. Taking M to be A in Theorem (28.1) and writing P = P (M)PM + P (M c )PM c we have PM ∼ DαM = Dα2 and PM c ∼ DαM c = Dα1 . Let X1 , X2 , . . . given P be iid P . Let X1M , X2M , . . . be the elements of c c X1 , X2 , . . . which fall in M. Define X1M , X2M , . . . similarly. Then X1M , X2M , . . . c c given P are i.i.d. PM and X1M , X2M , . . . given P are i.i.d. PM c . If α′ is another finite measure, we can also look at M ′ the set of points of positive mass under α′ and consider the decomposition α′ = α1′ + α2′ where α1′ is the non-atomic part and α2′ is the discrete part. 89 We can state the following result. Theorem 28.2. 1 If M 6= M ′ , then Dα and Dα′ are singular with respect to each other, P since for an x ∈ M ′ ∩ M c , n1 n1 I(Xi = x) → P ({x}), and P ({x}) = 0 under Dα and P ({x}) > 0 under Dα′ . 2 If M = M ′ and α1 6= α1′ , then Dα and Dα′ are singular with respect to each other, because Dalpha and Dα′ are singular with respect to each c c other arguing as before using the sequence X1M , X2M , . . . . 3 If M = M ′ , α1 = α1′ and α2 6= α2′ , one can take M to be {1, 2, . . . }. This is the case of looking at two Dirichlet distributions on P = {(p1 , p2 , . . . ) : P 0 ≤ pi ≤ 1, pi = 1}. These Dirichlet measures can be expressed in terms of countable number of independent Gamma random variables. From Kakutani’s theorem it will follow that Dα and Dα′ are either absolutely continuous with respect to each other or singular. Necessary and sufficient conditions can be given for absolute continuity and singularity. 29 Other ways to introduce prior distributions on P Consider a distribution function F (t). It is a right-continuous non-decreasing function and we want a probability measure on the space of such functions, i.e. we want a random function F (t). We will assume, for convenience, that X = [0, ∞). 90 Consider a right continuous non-decreasing stochastic process {X(t) : t ≥ 0 with X(t) ր X(∞) < ∞}. Define F (t) = X(t) . X(∞) Then F (t) is a random distribution function, and its distribution can be used as a prior distribution in a nonparametric problem. Here is an example of such a stochastic process. Example 29.1. Let 0 < t1 < · · · < t∞ < ∞, and assume that the increments {X(t1 ), X(t2 ) − X(t1 ), . . . , X(tk ) − X(tk−1 )} are non-negative and independent. In particular assume that these increments have independent Gamma distributions with shape parameters {α(t1 ), α(t2 )−α(t1 ), . . . , α(tk )− α(tk−1)},where α is a non-decreasing function with α(∞) < ∞. Such a stochastic process exists and is called a Gamma process. The function α can also be considered as a finite measure on [0, ∞). It is clear that the distribution of (F (t1 ), . . . , F (tk ) − F (tk−1 )) is D(α(t1 ), α(t2 ) − α(t1 ), . . . , α(tk ) − α(tk−1)), in other words, the distribution of the pm P derived from F is our Dirichlet distribution Dα . This is therefore another definition of Dirichlet Distribution. Another property of the Gamma process is that it is a pure jump process. This implies that F (and hence P ) is discrete with probability 1 under Dα . We can then ask for the distribution of the jumps of P . We cannot talk about the first jump of P . However, we can talk about the largest jump 91 p∗1 , the second largest jump p∗2 , etc. and also the locations of those jumps X1∗ , X2∗ . . . . In that case, P = ∞ X p∗n δXn∗ . 1 The exact distribution of (p∗1 , p∗2 , . . . ) has been obtained in Klass and Ferguson, and it is independent of X1∗ , X2∗ , . . . which are i.i.d. ᾱ. This representation is different from our representation to be given in the next class. However, when our probability masses (p1 , p2 , . . . ) are rearranged in increasing order we get the distribution of (p∗1 , p∗2 , . . . ). 30 More on the estimation of the median m(P ) We already discussed the estimation of the median m(P ) = median of (P ). The median is any number x satisfying P ((−∞, x]) ≥ 1 , P ([x, ∞)) 2 ≥ 1 . 2 Also, {P : m(P ) ≥ x} = {P : P ((−∞, x]) ≤ 21 }. We will now obtain the distribution of m(P ) rather than the median of this distribution, which is the Bayes estimate of m(P ) under the absolute deviation loss function. Under the prior distribution Dα 1 Dα ({m(P ) ≥ x}) = Dα ({P (−∞, x]) ≤ }) 2 Z 1 α(x)−1 α(X )−α(x)−1 2 u (1 − u) du. = B(α(x), α(X ) − α(x)) 0 The following two graphs give the survival function and density function of m(P ) under Dα for the two cases: • α(x) is k times the uniform distribution on [0, 1], i.e. α(x) = kx. • α(x) = kΦ(x), where Φ(x) is normal distribution function. 92 with k = 10. 2.5 2 pdf. of m(P) survival fn. of m(P) 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 Figure 1: Survival function and pdf of m(P ) under Dirichlet with paramter 10 times uniform. 93 1.4 1.2 1 pdf.of m(P) 0.8 0.6 0.4 survival fn. of m(P) 0.2 0 −3 −2 −1 0 1 2 3 Figure 2: Survival function and pdf of m(P ) under Dirichlet with paramter 10 times the standard Normal. March 18, 2008 31 Under the assumption that α is non-atomic Let us recap some results from previous classes. Let P ∼ Dα and X1 , X2 , . . . , given P , be i.i.d P . Let α be non-atomic and let Y1 .Y2 , . . . be the distinct observations with Y1 = X1 . We also established that 1 P is discrete distribution with probability 1 2 limn Fn = P with probability 1, where the limit is “in distribution”, and Fn is the empirical distribution of X1 , X2 , . . . , Xn 94 We also saw that P ({X1 }) is well defined measurable random variable and P ({X1 }) = lim Fn ({X1 }) = lim F2n ({X1 }) where F2n is the empirical distribution of X2 , X3 , . . . , Xn . To obtain the conditional distribution of F2n ({X1 }) given X1 , we should look at the joint distribution of X2 , X3 , . . . given X1 , which is a Polyá sequence with parameter α + δX1 . Hence the conditional distribution of P (A) = lim F2n (A) given X1 is Beta(α(A) + δX1 (A), α(Ac ) + δX1 (Ac )). Putting A = {X1 }, we find that the conditional distribution of P ({X1 }) given X1 is B(1, α(X )) since α is non-atomic. Again, since this does not depend on X1 , the marginal distribution of P ({X1}) is B(1, α(X )) and P ({X1 }) is independent of X1 . Let us now look at Y1 , Y2 , . . . the distinct observations. Then Y1 = X1 and P ({Y1}) is independent of Y1 , and P ({Y1}) ∼ Beta(1, α(X)) What is the distribution of Y2 ? Let {X1′ , X2′ , . . . } be the sequence {X1 , X2 , . . . } after removing all mem- ′ } be the sequence bers of this sequence equal to X1 . Let {X1′ , X2′ , . . . , Xm n X1 , X2 , . . . , Xn after removing all members equal to X1 ’s in it. It is easy to see that mn ∞ as n → ∞. Notice that X1′ = Y2 . Let Gmn be the empirical ′ and nFn ({X1′ }) be the number of X1′ that distribution of X1′ , X2′ , . . . , Xm n occur in X1 , X2 , . . . , Xn . Then nFn ({X1′ }) n(1 − Fn ({X1′ })) Fn ({X1′ }) = (1 − Fn ({X1′ })) Gmn ({X1′ }) = 95 Since X1′ , X2′ , . . . |P are i.i.d. P(X −{X1 }) and P(X −{X1 }) ∼ Dα(X −{X1 }) and α(X −{X1 }) (X − {X1 }) = α(X ), it follows that Gmn (Y2 ) → P ({Y2}) ∼ Beta(1, α(X )) 1 − P ({Y1}) and independent of Y2 from the same arguments as before and using the fact that α is non-atomic. Define pi = P ({Yi}), i = 1, 2, . . . and θ1 , θ2 , . . . as follows θ1 = p1 p2 (1 − p1 ) p3 θ3 = (1 − p1 − p2 ) .. . θ2 = Then, from the above results, θ1 , θ2 , . . . are i.i.d. with Beta(1, α(X )) distributions and independent of Y1 , Y2, . . . . Since Fn is concentrated on Y1 , Y2 , . . . , the distinct observations, and converges to P , we also have P = ∞ X i=1 P ({Yi})δYi = ∞ X pi δYi i=1 with (p1 , p2 , . . . ) independent of (Y1 , Y2 , . . . ). Further more the distribution of (p1 , p2 , . . . ) is given through the iid random variables (θ1 , θ2 , . . . ) and Y1 , Y2 , . . . are iid with common distribution ᾱ. All this was established under the curious assumption that α is nonatomic. 96 32 Sethuraman’s definition of the Dirichlet distribution We will now establish that the above results are true without any assumption on α giving a new and direct definition of the Dirichlet distribution on the space of probability measures on an arbitrary measurable space X . Consider random variables, (on some probability space (Ω, A, Q)), θ = (θ1 , θ2 , . . . ) taking values in [0, 1] which are iid Beta(1, α(X )) and random variables Y = (Y1 , Y2 , . . . ) taking values in X which are iid ᾱ. Assume that θ and Y are independent. Define p1 = θ1 p2 = θ2 (1 − θ1 ) p3 = θ3 (1 − θ1 )(1 − θ2 ) .. . Then P n pn = 1 with probability 1 since 1 − p1 − · · · − pn = Qn 1 (1 − θi ) →0 with probability 1. We can say that the discrete pm defined by p1 , p2 , . . . on the integers has discrete failure rates θ1 , θ2 , . . . which are iid Beta(1, α(X )). Define a random probability measure P on X by P (B) = P ((θ, Y), B) = ∞ X pi δYi (B). (32.21) i=1 Theorem 32.1. The random measure P defined in (32.21) has the Dirichlet distribution Dα , i.e. P ∼ Dα . 97 Proof: 1 Note that the random measure δX1 is a measurable map into P, since {ω : δX1 (ω) (A) < r} = {ω : X1 (ω) 6∈ A} if r < 1 {ω : X1 (ω) ∈ X } if r ≥ 1 and these are measurable sets. Hence P is a measurable map into P and is a random probability measure. 2 It is also clearly a discrete pm for each ω. 3 To show that P has the Dirichlet distribution Dα we have to show that (P (B1 ), . . . , P (Bk )) ∼ D(α(B1 ), . . . , α(Bk )) for each partition (B1 , . . . , Bk ). Define θ ∗ = (θ2 , θ3 , . . . ) and Y ∗ = (Y2 , Y3 , . . . ). Then (θ ∗ , Y ∗) has the same distribution as (θ, Y) and is independent of θ1 and Y1 , which are independent of each other. We see that P (B) = P ((θ, Y), B) ∞ X = pi δYi (B) 1 = p1 δY1 (B) + ∞ X (pj δYj (B)) j=2 = θ1 δY1 (B) + (1 − θ1 ) ∞ X = θ1 δY1 (B) + (1 − θ1 ) ∞ X 98 2 2 pj δY (B) 1 − θ1 j p′j δYj where pj 1 − θ1 Q θj 1j−1 (1 − θr ) = 1 − θ1 j−1 Y (1 − θr ), j ≥ 2 = θj p′j = 2 and hence, P = P ((θ, Y)) = θ1 δY1 + (1 − θ1 )P ((θ∗ , Y ∗)) d = θ1 δY1 + (1 − θ1 )P since P ((θ, Y)) and P ((θ ∗ , Y ∗ )) have the same distribution. Also, notice in the last line above that θ1 ,Y1 and P are independent. Hence P (B1 ) .. d = θ . 1 P (Bk ) δY1 (B1 ) .. + (1 − θ ) . 1 δY1 (Bk ) P (B1 ) .. . P (Bk ) (32.22) Where θ1 ∼ Beta(1, α) , Y1 ∼ ᾱ, and (θ1 , Y1 , (P (B1), P (B2 ), . . . , P (Bk ))) are independent. We will now check that (P (B1 ), . . . , P (Bk )) ∼ D(α(B1 ), . . . , α(Bk )) satisfies the distributional equation (32.22), by finding the distribution of the RHS of (32.22). Conditional on Y1 ∈ Bi , the distribution of δY1 (B1 ) .. . δY1 (Bk ) 99 is degenerate at ei and is also therefore equal to Dei . The distribution of P (B1 ) .. . P (Bk ) is Dα∗ where α(B1 ) .. α∗ = . α(Bk ) . From a standard result on finite dimensional Dirichlet distributions, the conditional distribution of the RHS, given that Y1 ∈ Bi , becomes Dα∗ +ei . Since P (Y1 ∈ Bi ) = αi∗ /α(X ), 1 ≤ i ≤ k, it follows from an- other standard fact on finite Dirichlet distributions that the distribution of the RHS is Dα∗ . This checks this as a solution of the distributional equation (32.22). 3 We will now show that this is the unique solution by proving the following theorem. Theorem 32.2. Consider the distribution equation d V = U + WV (32.23) where U and V are in Rk (or in some linear space), W in R1 and (U, V ) is independent of V . Assume that P (|W | < 1) = 1. Then if there is a solution for the distribution equation of V , it is unique. Proof of Theorem 32.2 100 Suppose that V ,V ′ are two random variables whose unequal distributions solve the distributional equation (32.23). Let (U1 , W1 ),(U2 , W2 ) be i.i.d.with the distribution of (U.W ), and independent of V ,V ′ . Define V1 = V V1′ = V ′ V2 = U1 + W1 V1 V2′ = U1 + W1 V1′ V3 = U2 + W2 V2 .. . V3′ = U2 + W2 V2′ .. . Then V2 ∼ V1 , V2′ ∼ V1′ ,V3 ∼ V1 ,V3′ ∼ V2′ ,etc. d |V1 − V1′ | = |Vn − Vn′ | ′ = |Wn ||Vn−1 − Vn−1 | = |W1 ||W2 | · · · |Wn ||V1 − V1′ | d → 0, as n → ∞ since W are i.i.d. with |Wi | < 1. d Hence, |V1 − V1′ | = 0, which leads to a contradiction. This completes the proof of Theorem 32.2. Continuation of proof of Theorem (32.1) Thus (P (B1 ), P (B2), . . . , P (Bk )) ∼ Dα∗ is only solution for (32.22). This shows that the distribution of P defined in (32.21) is Dα . 101 March 25, 2008 33 Finiteness of integrals of random probability measures We defined Dirichlet distributions to use them as priors in the standard nonparametric problem, where the unknown distribution P is the parameter. In such problems, we would also like to estimate functions of this parameter, R especially functions like g(x)dP (x) where g is some measurable function on X . It will be useful to ask a question even before asking for a Bayes estimator R R of g(x)dP (x). Is the function of interest, namely g(x)dP (x) finite under the prior distribution Dα ? Only if we know that it is finite can we ask for an estimator of that function. We showed in a previous lecture that the Bayes estimator of this function, with no data and under a prior which is Dα is Z Z EDα ( g(x)dP (x)) = g(x)dᾱ(x) under the assumption that |g| is integrable under ᾱ. This means that R R g(x)dP (x) is not only finite but it is also integrable if g(x)dᾱ(x) < ∞. Several researchers have discovered a necessary and sufficient condition R for the finiteness of g(x)dP (x) under a Dirichlet distribution. We will give a simple and new proof of that result, and also obtain similar necessary and sufficient conditions for a larger class of distributions that include Dirichlet distributions. 102 Theorem 33.1. Let g be a measurable function on X . Let Ag = {P : R g(x)dP (x) is finite}. Then Dα (Ag ) = 1 if and only if Z Proof: Since R log(1 + |g(x)|)dᾱ(x) < ∞ where ᾱ = g(x)dP (x) is finite if and only if R α . α(X ) (33.24) |g(x)|dP (x) is finite, we can, without loss of generality, assume that g(x) is positive to prove this theorem. Also, there is no loss of generality in assuming that g(x) = x, and we will do so to complete the proof. Recall our constructive definition of the Dirichlet distribution of a random probability measure P : P = ∞ X pi δYi 1 where P = P (θ, Y), with p1 = θ1 , pn = θn Qn−1 1 (1 − θi ) and θ = (θ1 , θ2 , . . . ) are i.i.d. random variables with a Beta(1, α) distribution and Y = (Y1 , Y2 , . . . ) are i.i.d. random variables with distribution ᾱ. Aside: The random probability measure P above is measurable since the random probability measure δY1 is measurable. This is so because {Y 6∈ A} if r < 1 1 {ω : δY1 (A) < r} = X if r > 1 and both the sets on the right hand side are measurable. This makes δY1 a measurable map into P. Since P is a convex combination of such measures with measurable coefficients, it is a measurable map into P. Notice that Z x dP (x) = ∞ X 1 103 pi Y i . R P x dP is finite if and only if ∞ 1 pi Yi converges, and Dα (Ag ) = 1 is P P same as Dα ( Pi Yi converges) = 1 or Q( Pi Yi converges) = 1, where, as Thus usual, Q is the joint distribution of (θ, Y). We will recall the famous Kolmogorov’s three-series theorem: Theorem 33.2. Let X1 , X2 , . . . be independent random variables. Define X if |X| ≤ ǫ ǫ X = = XI(|X| ≤ ǫ) 0, if |X| > ǫ Then P Xi converges with probability 1 if and only if ∞ X 1 P (|Xi| > ǫ) < ∞, ∞ X E(Xiǫ ) < ∞, (33.26) ∞ X V (Xiǫ ) < ∞ (33.27) 1 and (33.25) 1 Conditional on (p1 , p2 , . . . ), the random variables p1 Y1 , p2 Y2 , . . . are independent. To continue the proof of Theorem 33.1, we will establish the following theorem. Theorem 33.3. Let X1 , X2 , . . . be i.i.d. random variables and let 0 < ρ < 1. Then ∞ X ρn Xn converges with probability 1 (33.28) 1 if and only if E(log(1 + |X1 |)) < ∞. 104 (33.29) Let (an , n = 1, 2, . . . ) be random variables independent of X1 , X2 , . . . and suppose that 1 log(an ) → −K > 0 with probability 1 and K is a constant. n Then series ∞ X (33.30) an Xn converges with probability 1 1 if and only if (33.29) holds. Proof: We can assume without loss of generality that the Xn ’s are positive also that Xn ≥ 1, n = 1, 2, . . . . Let ǫ > 0. Then ∞ > ∞ X n=1 n P (|ρ Xn | > ǫ) = = ∞ X ∞ X = ∞ X n=1 i=1 ≤ E( and ∞ X n=1 P (|X1| > ǫ ) ρn ∞ i X ǫ ǫ ǫ ǫ X P ( i < |X1 | < i+1 ) = P ( i < |X1 | < i+1 ) 1 ρ ρ ρ ρ i=n i=1 n=1 iP (i < log |X1 | − log(ǫ) ≤ i + 1) − log ρ log |X1 | − log(ǫ) ) − log(ρ) log |X1 | − log(ǫ) )−1 − log(ρ) ⇔ E(log |X1 |) < ∞ ≥ E( This shows that condition (33.25) is equivalent to condition (33.29). Consequently, (33.28) implies condition (33.29). Conversely let (33.29) hold. Then the condition analogous to (33.25) P n holds for the series ρ Xn . We will now show that condition (33.29) also 105 implies the conditions analogous to (33.26) and (33.27). This will complete the proof that (33.29) implies (33.28). We show the details for just one of these. (Recall that, WLOG, we assumed X1 > 0.) ∞ X n ǫ E(ρ Xn ) = n=1 ∞ X n=1 n n E(ρ Xn I(|ρ Xn | ≤ ǫ)) = ∞ X E(ρn X1 , X1 < n=1 ǫ ) ρn ∞ X n X ǫ ǫ = [ E(ρn X1 , i−i < X1 ≤ i ) + E(ρn X1 , X1 ≤ ǫ)] ρ ρ n=1 i=1 ≤ ∞ X E(X1 , i=1 ǫ ρi−1 ∞ < X1 ≤ ∞ ǫ X n ρǫ ) ρ + ρi n=i (1 − ρ) 1 X i X1 ǫ ǫ ρǫ = ρ E( log X1 , i−1 < X1 ≤ i ) + 1 − ρ i=1 log X1 ρ ρ (1 − ρ) ∞ ρi ρǫi ǫ ǫ ρǫ 1 X E(log X1 , i−1 < X1 ≤ i ) + ≤ 1 − ρ i=1 log(ǫ) − i log(ρ) ρ ρ (1 − ρ) ∞ ǫ ǫ ρǫ 1 X ǫ E(log X1 , i−1 < X1 ≤ i ) + ≤ 1 − ρ i=1 log(ǫ) ρ ρ (1 − ρ) ǫ ρǫ = E(log X) + <∞ (1 − ρ) log(ǫ) (1 − ρ) This completes the proof of the first part of Theorem 33.3. The second part is immediate. Continuation of the proof of Theorem 33.1: From Theorem 33.3 it is enough to show that 1 log(pn ) → −K n where K > 0 is a constant. Note that n−1 log(θn ) 1 X 1 log(1 − θi ) log pn = + n n n i=1 → E(log(1 − θ1 )) < 0 106 since E(log(1 − θ1 )) > −∞ when θ1 ∼ Beta(1, α(X )). We have also used the fact log(θn ) n → 0 with probability 1 since E(| log(θ1 )|) < ∞. We can state the following easy generalization. Theorem 33.4. Let P be a random probability measure defined by P = ∞ X pn δYn 1 where (p1 , p2 , . . . ) is independent of the i.i.d. random variables {Y1 , Y2 , . . . } satisfying the following condition: 1 log(pn ) → −K with probability 1 (33.31) n R where K > 0 is a constant. Then g(x)dP (x) is finite with probability 1 if and only if E(log(1 + |g(Y1)|)) < ∞ We can now ask “What is the distribution of (33.24) R xdP (x)” under the DirichR let distribution Dα . Yamato showed in a paper that xdP (x) has a Cauchy distribution of if ᾱ = Cauchy. The result does not depend on the value of α(X ). For a quick proof of this result, notice that EDα (eit R P = E(e XdP ) pn itYn P ) = E(E(e pn itYn |p1 , p2 , . . . )) = E(E(e− P )) pn |t| = e−|t| 107 since P pn = 1. This result does not depend on the actual distributions of (p1 , p2 , . . . ); it is enough if (p1 , p2 , . . . ) is independent of (Y1 , Y2 , . . . . If we assume that α ∼ k ∗ N(0, 1), then exists Z, such that Z and Z ∼ P XdP |Z ∼ N(0, Z) p2n . The moments of for instance the first moment is P p2n can be determined with determination; 1 . α(X )+1 108 March 27, 2008 34 Convergence of random probability measures We begin with describing the several objects that we have defined with this sketch. Basic space pm’s on X pm’s on P (X , B) (P, σ(P)) M x P µ X ∼P P ∼µ element rv’s and dist’s w w convergence xn → x Pn → P µn → µ We know the meaning of convergence in the first two columns. How do we define the convergence in the last column? Let us first look at the definition of convergence in P. This is often done through random variables taking values on X even though it is a property of only their distributions. w We say that Pn → P if Fn (x) = P ((−∞, x]) → F (x) = P ((−∞, x]) for all x ∈ C(F ) where C(F ) is the set of continuity points of F . If Xn , X w are random variables on X with Xn ∼ P, X ∼ P and Pn → P , we will also w say that Xn → X or that Xn converges to Xin distribution. (Note that all convergences here are n → ∞.) 109 Another equivalent definition is Z Z w Pn → P if and only if g(x)dPn (x) → g(x)dP (x) for all bounded continuous functions g on X . We have other notions of convergence of random variables and probability measures. For instance, we say that Pn → P in variation norm if sup ||Pn (B) − P (B)|| → 0. B For random variables, we have the following notions of convergence. Let Xn , X be measurable functions from a probability space (Ω, A, λ) into (X , B). We say that Xn → X with probability 1, if λ({ω : Xn (ω) → X(ω)}) = 1. p We say that Xn → X in probability (Xn → X) if λ(|Xn − X| > ǫ) → 0 for each ǫ > 0. If Xn ∼ Pn , X ∼ P and Xn → with probability 1 or in probability, we w can conclude that Pn → P . We cannot strengthen this conclusion and claim that Pn → P in variation norm. With these standard ideas in mind we will now define convergence of random probability measures, that is, of elements of M. Let µn , n = 1, 2, . . . , µ be random pm’s, i.e. be elements of M. In a way w analogous to our previous definition, we say that µn → µ if Z Z H(P )dµn(P ) → H(P )dµ(P ) P P 110 for all bounded functions H(P ) on P which are continuous with respect to weak convergence in P. This definition is not totally clear since we do not know all the continuous functions H(P ) with reselect to weak convergence in P. However, the following results are true, which will allow us to show weak convergence of random pm’s. Let Pn , n = 1, 2, . . . , P be random mappings from some probability space (Ω, A, λ) into (P, σ(P)). w Let Pn ∼ µn , n = 1, 2, . . . and P ∼ µ. Suppose that Pn → P with λ-probability 1, i.e. w λ(ω : Pn (ω) → P (ω)) = 1. w Then µn → µ. The stronger condition ||Pn − P || → 0 with λ − probability 1 w will also only imply µn → µ. One generally shows that a sequence converges by first showing that it is pre-compact. Let {Pn } be a sequence of probability measures. It is said to be precompact if every subsequence of {Pn } has a further subsequence that converges weakly to some probability measure. The sequence {Pn } is said to be tight if for any δ > 0, there exists a compact set Kδ ∈ X , such that Pn (Kδ ) ≥ 1 − δ for all n 111 which is the same as Pn (Kδc ) ≤ δ for all n. The famous Prohorov theorem states Theorem 34.1. {Pn } is pre-compact if and only if {Pn } is tight. Similar definitions and results hold for random pm’s, i.e. for elements of M. Suppose that µ ∈ M and P ∼ µ, we can compute Z def µ̄(B) = Eµ (P (B)) = P (B)dµ(P ) P and it is easy to see that µ̄(·) is a probability measure on (X , B), i.e. an element in P. The pm µ̄is called the mean measure of the random pm µ. The following theorem gives a nice characterization of tightness for random pm’s. Theorem 34.2. Let {µn } be a sequence inM. Then {µn } is tight if and only if {µ̄n } is tight. Proof: We first describe some sets in P which are compact under weak convergence. Fix δ > 0. For k = 1, 2, . . . , find a compact set Kk . Such that {µ̄n }(Kkc ) ≤ δ 6 . k3 π2 Let C = ∩k {P : P (Kkc ) ≤ 112 1 } k From one of the equivalent conditions in the Portmanteau theorem for weak convergence, it is easy to see that C is closed. We now claim that C is compact. Let {Pm } ∈ C. Fix ǫ > 0, find k, such that 1 k < ǫ. c Then {Pm } ∈ {P : P (KK ) < ǫ}, i.e. Pm (Kkc ) < ǫ for all m. Hence {Pm } is tight and a subsequence converges to some probability measure. Hence C is compact. Let {µ̄n } be tight and C be as above.. Then µn ({P : P (Kkc ) > 1 }) k ≤ kEµn (P (Kkc )) = k µ̄n (Kkc ) δ 6 ≤ 2 2 k π and µn (C c ) ∞ X 1 ≤ µn (P : P (Kkc ) > ) k k=1 ∞ X 1 6 ≤δ k2 π2 1 =δ for all n. This shows that {µn } is tight. Conversely, suppose that {µn } is tight. Then exists {nr }, such that w µ nr → µ Then Eµnr (P (A)) → Eµ (P (A)) 113 if P (A) is continuous with respect to µ. Hence µ̄nr (A) → µ̄(A) for all A which are such that P (A) is continuous with respect to ]mu. The w class of such sets A is sufficiently rich, and thus µ̄nr → µ̄. This shows that {µ̄n } is tight. Now we explore the convergence for Dαn for sequences αn . Example 34.1. Let αn = αn (X )β for n = 1, 2, . . . , where β = ᾱn is a pm. P Let Pn = ∞ m=1 pn,m δYm , where Y1 , Y2 , . . . are i.i.d. β, and θn,1 , θn,2 , . . . are i.i.d. Beta(1, αn (X )) and pn,m = θn,m (1 − θn,1 ) . . . (1 − θn,m−1 ), m = 1, 2, . . . . From our constructive definition, Pn ∼ Dαn and Pn = ∞ X pn,m δYm m=1 = pn,1 δY1 + (1 − pn,1 )Pn∗ where Pn∗ ∼ Dαn . Let P = δY1 . Then ||Pn − P || = ||pn,1δY1 + (1 − pn,1)Pn∗ − δY1 || = ||(1 − pn,1 )(Pn∗ − δY1 )|| ≤ 2(1 − pn,1 ) because that Pn and δY1 are probability measures. Since pn,1 ∼ Beta(1, αn (X )) and αn (X ) → 0, E(1 − pn,1 ) = αn (X ) →0 αn (X ) + 1 114 and V ar(1 − pn,1 ) ≤ αn (X ) → 0. (αn (X ) + 1)2 (αn (X ) + 2) This means that (1 − pn,1 ) → 0 in probability and ||Pn − δY1 || → 0 in probaw bility, and that Dαn → δY1 . 115 April 1, 2008 35 Convergence of sequences of Dirichlet distributions Let X be the basic space, P be the space for all the probability measure on X and M be the space of probability measures on P. We will also refer µ as a random probability measure since an example of such a µ is the distribution of a random variable P taking values in P. A random pm µ is well defined if µ({P : P ∈ C}) is defined for all Borel sets C in P. This is not a convenient definition. We have already seen that if the distribution of (P (B1 ), . . . , P (Bk )) under µ is defined for all partitions (B1 , . . . , Bk ) (or partitions based on special classes of sets like intervals with rational end points, etc.) of X , then it identifies µ uniquely. A random pm µ is uniquely defined if the distribution of R g(x)dP (x) under µ is defined for all bounded continuous functions g on X . Let Y be a random variable on X with distribution ᾱ. Then δY is a mapping into P and its distribution will be a random pm in M. We will not give a name for this random pm and will simply use δY to denote it. We restate a few results from the last class. w Theorem 35.1. Let {µn } ∈ M. Then µn → µ if Z Z g(P )dµn(P ) → g(P )dµ(P ) for every bounded continuous function g on P. 116 Theorem 35.2. A sequence {µn } in M is tight if and only if the sequence of mean pm’s {µ̄n } is tight. The mean pm µ̄n is defined by the relation Z µ̄n (B) = P (B)dµn (P ). We will now consider the convergence of sequences of Dirichlet distributions Dαn under different conditions on αn . The next result was already established in the last class. Theorem 35.3. Let µn = Dαn , with αn = αn (X )ᾱ and let αn (X ) → 0. Then w µn → δY where Y is a random variable on X with distribution barα. Proof: Note that the mean measures µ̄n (B) = EDα (P (B)) = ᾱ(B) do not depend on n. Hence {µ̄n } is tight and from Theorem 34.2 {µn } is tight. However, we do not know if the sequence converges. Using the constructive definition, define random probability measures Pn , n = 1, 2, . . . , as follows: Pn = ∞ X pj,n δYj , j=1 where Y1 , Y2, . . . are i.i.d ᾱ, θ1,n , θ2,n , . . . are i.i.d. B(1, αn (X )) and pj,n = θj,n (1 − θj−1,n ) . . . (1 − θ1,n ), j ≥ 1, n ≥ 1. 117 We know that Pn ∼ Dαn . Then |Pn (B) − δY1 (B)| = |p1,n δY1 (B) + (1 − p1,n )Pn∗ (B) − δY1 (B)| ≤ (1 − p1,n )(|Pn∗ (B)| + δY1 (B)) ≤ 2(1 − p1,n ). Therefore, supB |Pn (B) − P (B)| ≤ 2(1 − p1,n ). Also we know that (1 − p1n ) ∼ B(αn (X ), 1) and, since αn (X ) → 0, E((1 − p1n )) = αn (X ) →0 αn (X ) + 1 and V ar((1 − p1,n )) = αn (X ) → 0. (αn (X ) + 1)2 (αn (X ) + 2) This shows that ||Pn − δY1 || → 0 in probability and thus w µn → δY1 . Notice that µ is not a Dirichlet distribution. Since if we look at a partition of X ,(B1 , . . . , Bk ) (δY1 (B1 ), . . . , δY1 (Bk )) = ei with probability P (Y1 ∈ Bi ), i = 1, 2, . . . , k. Thus µ is a mixture of degenerate Dirichlet distributions. We will now look at another example. Theorem 35.4. Let µn = Dαn and let αn = αn (X )βn , with αn (X ) → α(X ), w with 0 < α(X ) < ∞, and βn → β. Write α = α(X )β. w Then Dαn → Dα . 118 Proof: We already know that {Dαn } is tight since the mean measures are converging. We need to identify a unique limit. From the constructive definition, we can write Pn = P pi,n δYi,n , where (Y1,n , Y2,n , . . . ) are i.i.d. βn , pi,n = θi,n (1 − θi−1,n ) . . . (1 − θ1,n ), i = 1, 2, . . . and (θ1,n , θ2,n , . . . ) are i.i.d. B(1, αn (X )), n = 1, 2 . . . with all the random variables defined on a single space Ω. We already know that Pn ∼ Dαn = µn . w Since αn (X ) → α(X ), βn → β, we can choose θ1,n , θ2,n , . . . , such that θi,n → θi , i = 1, 2 . . . with probability 1, Yi,n → Yi , i = 1, 2, . . . with probw ability 1 and θ1 ∼Beta(1, α(X )), Y1 ∼Beta(1, α(X )). Thus Pn (B) → P (B) with probability 1 for each B. This indetifies Dα as the unique limit for all w convergent sequences and that the whole sequence Dαn → Dα . A third example is given by the following theorem Theorem 35.5. Let µn = Dαn , with αn = αn (X )βn , αn (X ) → ∞ and w βn → β. Let µ be the degenerate random pm δβ . Then w µn → µ. Proof: Note that under P = β with probability 1 under µ. Since the mean measures {βn } converge, the sequence {µn } is tight. Once again, we need to identify the unique limit. Under µn , the distribution of (P (B1 ), . . . , P (Bk )) is Dirichlet distribution with parameter (αn (X )β(B1 ), . . . , αn (X )β(Bk )). Further Eµn (P (B1 )) = βn (B1 ) → β(B1 ) V arµn (P (B1 )) = β(B1 )β(B1c ) →0 (αn (X ) + 1) 119 since αn (X ) → 0. Hence the distribution of (P (B1 ), . . . , P (Bk )) under µn converges to the distribution degenerate at (β(B1 ), . . . , β(Bk )). This shows w that µnk → µ. Corollary 35.1. The conclusion of the previous theorem is also true if µn = w Dαn where αn = αn (X )βn is random with αn (X ) → ∞ and βn → β in probability. Consider that the standard nonparametric problem: P ∼ Dα and X1 , X2 , . . . |P i.i.d P . We know that the posterior distribution of P given X1 , X2 , . . . , Xn is Dα+P δXi . Suppose that the random variables X1 , X2 , . . . are actually i.i.d. P0 . From the Glivenko-Cantelli lemma, we know that Pn w 1 1 δXi → P0 with probability 1. Since α(X ) + n → ∞, we have n P α(B) + δXi (B) → P0 (B) α(X ) + n and thus w Dα+P δXi → δP0 with probability 1 under P0 This is called posterior consistency. 36 An example with censoring Let us look at another application Let P ∼ Dα and X|P ∼ P . Suppose that we cannot observe X in its entirety, but that we can observe Y defined by. X if X ∈ Ac Y = θ, if X ∈ A where A is a known subset in X . What is the distribution of P |Y ? 120 If Y ∈ Ac , then Y = X. So, P |Y = P |X ∼ Dα+δX . If Y = θ, then we need to find the distribution of P |(Y = θ). Even though we do not know this conditional distribution, we can write down the conditional distributions of (P, X) given Y = θ as follows: P |(Y = θ, X) ∼ Dα+δX and X|(P, Y = θ) = PA where PA (B) = P (AB) . P (A) Now we can employ Gibbs sampling, which can be described as follows. Consider the bivariate random variable (X, Y ). Suppose we know L(X|Y ) and L(Y |X) and can generate observations from them. Then we can start with some initial value (X0 , Y0 ) and proceed as follows to generate (X1 , Y1 ), (X2 , Y2 ), . . . . Let X1 ∼ L(X|Y = Y0 ) Y1 ∼ L(Y |X = X1 ) .. . This generates a Markov chain {(Xn , Yn )} starting from (X0 , Y0 ) and w (Xn , Yn ) → (X∞ , Y∞ ) where (X∞ , Y∞ ) ∼ (X, Y ). This is generally proved by showing that the distribution of (X, Y ) is an invariant distribution for this Markov chain, and the chain is ergodic. For more details see my paper on this topic. 121 Going back to our problem, we can consider using Gibbs sampling to find the distribution of (P, X)|Y = θ and obtain the distribution of P |Y = θ by taking just the marginal distribution. To generate an observation P from L(P |X, Y = θ), we can take P = P∞ α(.)+δX (.) 1 pi δYi where (Y1 , Y2 , . . . ) are i.i.d. α(X )+1 and independent of (p1 , p2 , . . . ) which is generated as usual from i.i.d. random variables with distribution B(1, α + 1). Choose J = i with probability pi . This can be done by taking a U is uniform on [0, 1] and choosing J as J = i if {p1 + · · · + pi−1 < U ≤ p1 + · · · + pi }. Generate Y1 , Y2 , . . . as i.i.d. α(.)+δX (.) . α(X )+1 Then YJ will have the distribution of X |P, Y = θ. Thus starting from an initial value (P0 , X0 ), we can generate X1 , X2 , . . . , Xn without generating P1 , P2 , . . . , Pn . The distribution of Pn+1 is Dα+Pn0 δXi and will approximate P|Y = θ. To repeat, the steps are: 1. Generate a U from uniform [0, 1]. 2. Generate p1 ∼ B(1, α(X ) + 1). 3. Find J. If U ≤ p1 putJ = 1. If U > p1 , generate p2 , and if p1 < U ≤ p1 + p2 , then put J = 2. etc. 122 Put X1 = YJ . Generate X2 , X3 , . . . , X10,000 successively. Then Dα+P10,000 Xi 0 is an approximation to L(P|Y = θ). 123 April 3, 2008 37 Variants of the constructive definition of Dirichlet distributions A random pm P with a Dirichlet distribution Dα was given as follows. P = ∞ X pi δYi (37.32) 1 where (θ1 , Y1 ), (θ2 , Y2 ), . . . are i.i.d. with Y1 , Y2 , . . . i.i.d. ᾱ and θ1 , θ2 , . . . i.i.d. Beta(1, α) and pi = θi (1 − θ1 ) . . . (1 − θi−1 ), for i = 1, 2, . . . . This constructive definition has inspired others to make several variants to at least formally define more random pm’s. The distributions of such random pm’s may not be well understood or described. For instance, in the above, one may change just the common distribution of θ1 , θ2 . . . to be Beta(2, α(X )), and thus define a new random pm P through (37.32), but its distribution is not well understood and it is not described in terms of Dirichlet distributions. However, if the common distribution of θ1 , θ2 , . . . is assumed to be Beta(1, 2α(X )), the the random P defined in (37.32) is just D2α . Another generalization arises by allowing for the independent random variables θ1 , θ2 , . . . to have non-identical distributions. For instance, when ᾱ is non-atomic, and θk ∼Beta(1 − δ, γ + kδ), k = 1, 2, . . . with 0 ≤ δ < 1, γ > −δ, the distribution of the random pm P defined by (37.32) is called the “two-parameter” Dirichlet distribution, denoted by Dγ,δ,ᾱ . This distribution has been studied in other contexts, but has not found applications in Bayesian 124 nonparametrics. Another generalization comes by assuming that (Y1 , Y2, . . . ) are exchangeable with a Polya distribution Qα . As before it will be assumed that (Y1 , Y2 , . . . ) and (θ1 , θ2 , . . . ) are independent and that the latter sequence consists of i.i.d. Beta(1, δ) random variables. The distribution of the random pm P defined by (37.32) is not Dirichlet, but it can be described as follows. From the properties of a Polya distribution Qα , there exists a random pm R with distribution Dα , such that given R, the random variables Y1 , Y2 , . . . are i.i.d. R. Thus from our constructive definition of a Dirichlet distribution, the distribution of P given R is DδR . Thus the distribution of P is the mixture of Dirichlet distributions Z DδR (·) Dα (dR). This is an unusual mixture of Dirichlet distributions. We will now study the usual mixtures of Dirichlet distributions and also show that there is a constructive definition of a random pm with the mixture Dirichlet distribution. Let P ∼ Dα . Given P , let the random variables Y1 , Y2 . . . . be i.i.d. P . Given P, Y1 = y1 , Y2 = y2 , . . . , let the random variables X1 , X2 , . . . be independent random variables with distributions given by the df’s K(x1 |y1 ), K(x2 , y2), . . . , where K(x|y) is a df. Then, given P , the random variables X1 , X2 , . . . are i.i.d. common distriR R bution given by the df K(x|y)P (dy). Thus K(x|y)P (dy) is a random df and its distribution is usually referred to as a Dirichlet mixture distribution. One may describe this model by saying that there is a random pm R = R K(x|y)P (dy) whose distribution is a Dirichlet mixture distribution, and 125 given R, the random variables X1 , X2 , . . . are i.i.d. R. In the special case when K(x|y) = I(x ≥ y) is the discrete measure concentrated at y, then the random df R((−∞, x]) reduces to Z I(x ≥ y)P (dy) = P ((−∞, x]) which is the random df with the usual Dirichlet distribution Dα . From the constructive definition (37.32), one can give the following constructive definition of the Dirichlet mixture random pm R: X R((−∞, x]) = pn K(x|Yn ) n where pn , Yn , n = 1, 2, . . . are as defined in (37.32). The statistical problem is to determine the posterior distribution of P given the data X1 , . . . , Xn in a Dirichlet mixture model. Let X = (X1 , . . . , Xn ), Y = (Y1 , . . . , Yn ) and let ν(Y|X) denote the conditional distribution of Y given X. Since the conditional distribution of P given X, Y = y is Dα+Pn1 δyi , one can write the posterior distribution as Z L(P |X) = Dα+Pn1 δyi dν(y|X). To obtain ν(y|x), we will first write down the joint distribution of (X, Y). Q(Y1 ∈ A1 , . . . , Yn ∈ An , X1 ∈ B1 , . . . , Xn ∈ Bn ) Z n Y = E( K(Bi |yi )P (dy1) . . . P (dyn )) A1 ×,...,×An = Z A1 ×,...,×An 1 n Y 1 Pi−1 δyj (dyj )) (α(dyi) + j=1 K(Bi |yi ) α(X )(α(X ) + 1) . . . (α(X ) + n − 1) Qn i=1 Assuming that the df’s K(x|y) have pdf’s k(x|y), it follows that ν(y|x) ∝ n Y 1 n i−1 Y X k(xi , yi ) (α(dyi) + δyj (dyj )). i=1 126 j=1 Some simplifications can be made to the above if one assumes that α is non-atomic. To repeat, the posterior distribution of P given X = x is given by Z Q(P ∈ A|x) = Dα+Pn1 δyi (A)dν(dy|x). It should be noted that this posterior distribution is not a Dirichlet mixture distribution. 38 Approximations to Dirichlet distributions One way to approximate Dirichlet distributions is to approximate the random pm given in the constructive definition (37.32) by another random pm. If these two random pm’s are close to one another then their distributions will also be close to one another. One such choice is PN = N X pi δYi 1 where N is chosen to be large, and pN adjusted to make PN 1 pi = 1. This is the same as taking θN = 1. Q Clearly, ||PN − P || ≤ 1N −1 (1 − θi ) which will go to 0 with probability 1 as N → ∞. Thus the distribution of PN will converge weakly to Dα . If we want more control on the actual error in this approximation, we can take random N as follows. Given ǫ > 0 let Nǫ = inf{n : n−1 Y 1 (1 − θi ) < ǫ} 127 As before put θNǫ = 1 and look at the truncated PNǫ . This random measure will satisfy ||PNǫ − P || < ǫ. One can give write down the distribution of Nǫ and give bounds on the tails of its distribution. One can use such approximations either with the prior distribution or the posterior distribution Dα+Pn1 δYi , since they are both Dirichlet distributions. Then what is the distribution of (p1 , . . . , pN ) when we put θN = 1? We see that p2 pN −1 ,..., 1 − p2 1 − p1 − · · · − pN −1 are i.i.d. ∼ B(1, α(X )). Thus the joint distribution of (p1 , . . . , pN ) is the p1 , Connor-Moissiman distribution, which was given as a homework in this class. Another approximation to the Dirichlet distribution is given by PN∗ = n X p∗i δYi 1 where (p∗1 , . . . , p∗n ) ∼∼ ) ) D( α(X , . . . , α(X ) N N and is independent of (Y1 , . . . , YN ) which are i.i.d. ᾱ. Note that, conditional on (Y1 , . . . , Yn ), N X 1 p∗i δYi = N X 1 p∗i DδYi ∼ D α PN δY N 1 i from the standard result on Dirichlet distributions that AU + (1 − A)V ∼ P P Dα+β , if U ∼ D(α), A ∼ B( Xi , βi ) and V ∼ Dβ and independent. From the Glivenko-Cantelli theorem, N 1 X w δYi → ᾱ N 1 128 with probability 1. From the theorems on convergence of Dirichlet distributions, it follows that w Pn∗ → Dα with probability 1. 129 April 8, 2008 39 Example of an inconsistent Bayes estimator I will now present an example of an inconsistent Bayes estimator due to Ferguson, Phadia and Tiwari. Let I = 1, 2 with Q(I = 1) = Q(I = 2) = 21 . Let P1 = δβ where β is the distribution function of U, a uniform random variable on [0, 1]. Let P2 = Dβ . Let the data consist of X1 , . . . , Xn . Assume that, given P, I, the random variables X1 , . . . , Xn are i.i.d.P where P ∼ PI . Another way to describe this model is the following. The data X1 , . . . , Xn given I = 1 are i.i.d. P and P ≡ β and given I = 2 are i.i.d. P where P ∼ Dβ . What is the posterior distribution of P given X1 , . . . , Xn ? Clearly, if any two Xi ’s are equal, then I = 2 and hence the posterior distribution of P is Dβ+Pn1 δXi . We will now compute the posterior distribution of P given that the Xi ’s are distinct. Note that Q(I = 1, X1 , . . . , Xn , Xi ’s are distinct) = Q(X1 , . . . , Xn , Xi ’s are distinct) 1 1 =1∗ = . 2 2 Here Q stands for the probability mass function of I and the pdf of X1 , . . . , Xn . 130 We know that, when I = 2, we have P ∼ Dβ and X1 , . . . , Xn given P are i.i.d. P , and the joint distribution of X1 , . . . , Xn is Qn Pi 1 [β(Xi ∈ Ai ) + 1 δXj (Ai )] . n! Hence Q(I = 2, X1 , . . . , Xn , Xi ’s are distinct) 1 1 = ∗ . n! 2 This leads to Q(I = 1|X1 , . . . , Xi ’s are distinct) = 1 n! , 1 = n! + 1 1 + n! Q(I = 2|X1, . . . , Xn , Xi ’s are distinct) = 1 . n! + 1 Hence, the posterior distribution of P is given by Q(P |X1 , . . . , Xn , Xi ’s are distinct) = Q(P, I = 1|X1 , . . . , Xn , Xi’s are distinct) +Q(P, I = 2|X1 , . . . , Xn , Xi ’s are distinct) = Q(P |X1 , . . . , Xn , Xi ’s are distinct, I = 1) ∗Q(I = 1|X1 , . . . , Xn , Xi ’s are distinct) +Q(P |X1 , . . . , Xn , Xi ’s are distinct, I = 2) ∗Q(I = 2|X1 , . . . , Xn , X − i’s are distinct) n! 1 = δβ + Dβ+Pn1 Xi n! + 1 n! + 1 w → δβ as n → ∞. when the Xi ’s are distinct. 131 If our data X1 , X2 , . . . are i.i.d. coming from a distribution G on [0, 1], then above formula tell us that the posterior distribution of P , goes to δβ , where β is the uniform distribution function. Thus the Bayes estimator converges to β irrespective of the common continuous distribution of X1 , . . . , Xn and is therefore not consistent. 132 April 10, 2008 40 Bayesian Analysis of Failure Models Consider a system consisting of a single item that is maintained, after each failure, by an instantaneous repair or by a replacement with a new item. We will study several models for the inter failure times X1 , X2 , . . . of such a system. Any failure model will have to specify the joint distribution of X1 , X 2 , . . . . Let F be the distribution function of the life of a new item. Such an F is a df on [0, ∞) with F (0) = 0. For a ≤ 0 define the residual life distribution function Fa as follows: Fa (x) = F (a + x) − F (a) , x>0 (1 − F (a) The following define several repair models which are standard in the literature The distribution of Xn after the (n − 1)th failure describes the nature of the repair at that stage. Here are some examples. • Minimum repair Q(Xn ≤ x|X1 , . . . , Xn−1 ) = FXn−1 (x), x > 0 • Perfect repair Q(Xn ≤ x|X1 , . . . , Xn−1) = F (x) ≡ F0 (x), x > 0. 133 • Brown-Proschan model of repair A number p in (0, 1) is fixed in advance. With probability p, Q(Xn ≤ x|X1 , . . . , Xn−1 ) = F0 (x) and with probability 1 − p, Q(Xn ≤ x|X1 , . . . , Xn−1 ) = FXn−1 (x), x > 0. This can be restated as follows. Define i.i.d. uniform random variables U1 , U2 , . . . in advance. F (x) if Un−1 ≤ p 0 Q(Xn ≤ x|X1 , U1 , . . . , Xn−1 , Un−1 ) = F Xn−1 (x), if Un−1 > p = Fǫn−1 (x) where ǫn−1 = 0, if Un−1 ≤ p and ǫn−1 = Xn−1 ,if Un−1 > p. • The Block-Borges-Savits model Fix a function p(t) : [0, ∞) → (0, 1). Define F (x) if Un−1 ≤ p(Xn−1 ) 0 Q(Xn ≤ x|X1 , U1 , . . . , Xn−1 , Un−1 ) = F Xn−1 (x), if Un−1 > p(Xn−1 ) = Fǫn−1 (x) where ǫn−1 = 0, if Un−1 ≤ p(Xn−1 ) and ǫn−1 = Xn−1 ,if Un−1 > p(Xn−1 ). • The Kijima models Together with the inter-failure times X1 , . . . we also have random variables D1 , D2 , . . . which represent degree of repair and Q(Xn ≤ |X1 , D1 , . . . , Xn−1 , Dn−1 ) ∼ Fǫn−1 (x) 134 where ǫn−1 is a function of X1 , . . . , Xn−1, D1 , . . . , Dn−1 ; The two Kijima models are – Kijima Model I ǫn−1 = n−1 X Di Xi , i=1 and – Kijima II model ǫn−1 = n−1 n−1 X Y D j Xj . i=1 j=i We will now show how all these repair models can be described in a single general repair model. Let (X , A) be our basic space. In the above examples with life distributions, the space X was equal to (0, ∞). Let P be a pm on (X , A). Define the joint distribution of the dependent random variables (inter-failure times) X1 , X2 , . . . together with some environmental variables (or covariates) U1 , U2 , . . . by X1 ∼ P L(X2 |X1 ) = PA1 .. . L(Xn |X1 , . . . , Xn−1 , U1 , . . . , Un−1 ) = PAn−1 where An−1 is a set in X that depends (measurably) on (X1 , . . . , Xn−1 , U1 , . . . , Un−1 ); i.e. Q(Xn ∈ A|X1 , . . . , Xn−1, U1 , . . . , Un−1 ) should be well defined for A ∈ A. When there are no covariates U1 , U2 . . . , this is accomplished if An−1 is a (x1 , . . . , xn−1 ) − section of a set in An . 135 In the above we have used the definition of the restricted measure PA defined by PA (B) = P (AB) P (A) B ∈ B. Our goal is to estimate P , given X1 , . . . , Xn . 41 Bayesian methods with covariates A Bayesian will consider the covariates to be constants if their distribution is independent of the parameter of interest, namely P . To make this idea more precise, let us examine the following two models. Theorem 41.1. The data consists of (X, Y ) and their distribution depends on two parameters (θ, δ). • First Model In this model, (θ, δ) are independent θ ∼ π1 δ ∼ π2 Y |δ, θ ∼ g(y|δ) X|Y, δ, θ ∼ k(x|y, θ) • Second Model 136 In this model θ ∼ π1 Y is fixed X|Y, θ ∼ k(x|y, θ) In both these models the posterior distribution of θ given the data (X, Y ) is π1 (θ|x, y) ∝ π1 (θ)k(x|y, θ) which is what one gets if one assumes that Y is a constant. Proof: In the first model, the joint density of (X, Y, θ, δ) is q(x, y, δ, θ) = π1 (θ)π2 (δ)g(y|δ)k(x|y, θ) and the posterior distribution of θ given X, Y is proportional to this joint density after retaining only those parts with dependence on X and θ. If we do this we get π1 (θ|x, y) ∝ π1 (θ)k(x|y, θ) In the second model, the joint density is q(x, θ) = π1 (θ)k(x|y, θ) and the posterior density of θ given X is also π1 (θ|x) ∝ π1 (θ)k(x|y, θ). This completes the proof. 137 42 Partition based (PB) distributions Let B = (B1 , . . . , Bk ) be a measurable partition of X . Then we can write P (.) = k X P (Bi)PBi (.). (42.33) 1 Define the vector P (B) = (P (B1 ), . . . , P (Bk )) which varies in the simplex of Rk . There is a one-to-one relationship P ↔ ((P (B), PB1 , . . . , PBk ) in view of (42.33). We can denote the random vector P (B) as (Y1 , . . . , Yk ) and assume that it has a pdf c h(y1 , . . . , yk ), where c is a constant, with respect to the Lebesgue measure on the simplex of Rk . Assume that P (B), PB1 , . . . , PBk are independent. Further assume that the random pm’s PB1 , . . . , PBk have distributions G1 , . . . , Gk , respectively. This gives a distribution for P which we define to be a Partition Based (PB) distribution and denote it by H(B, h, G1 × · · · × Gk ). Theorem 42.1. The Dirichlet measure Dα is a partition based (PB) distribution with respect to every partition B. In fact Dα = H(B, h, G1 × · · · × Gk ) with h(y1 , . . . , yk ) = 1, . . . , k. Qk 1 α(Bi )−1 yi and Gi = Dα(Bi ) if α(Bi ) > 0 for i = We have already proved this theorem as a homework. 138 Recall: If P ∼ Dα , then (P (A), P (Ac)), PA and PAc are independent. We define a Partition Based Dirichlet (PBD) distribution D(B, α) if the distribution of (P (B), PB1 , . . . , PBk ) is the PB distribution H(B, h, G1 × · · · × Gk ) with h(y1 , . . . , yk ) = k Y α(Bi )−1 yi 1 and Gi = DαBi , i = 1, . . . , k where αB (A) = α(A ∪ B). We will show in the next class that if P has the PBD distribution D(B, h, α), then it is also a PBD distribution on any sub-partition of B. 139 April 15, 2008 43 Partition based distributions In the last class we defined a large class of probability measures for P called Partition Based (PB) priors denoted by H(B, h, G) where G = G1 ×· · ·×Gm , as follows. Definition 43.1. Under H(B, h, G), (a) the real vector P(B) and the m restricted pm’s PB1 , PB2 , . . . , PBm are independently distributed, (b) P(G) has the pdf c · h(y1 , . . . , ym ) for some normalizing constant, and (c) PBr has distribution Gr , r = 1, 2, . . . , m. Here and later, we will not worry about the non-uniqueness of h arising from a multiplicative constant. The Dirichlet distributions Dα is the PB distribution H(B, h, G) with Q α(Bi )−1 h(y) = m and G = DαB1 × · · · × DαBm . 1 yi A PB distribution H(B, h, G) is called the Partition Based Dirichlet (PBD) distribution D(B, h, α) if G = DαB1 × · · · × DαBm . These are more general than Dirichlet distributions. Let B∗ be a sub-partition of B. A PBD prior D(B, h, α) can also be represented as the PBD prior D(B∗ , h∗ , α) with an explicit expression of h∗ , from properties of the Dirichlet measure, as can be seen from the following theorem. 140 Theorem 43.1. Let B = (B1 , . . . , Bm ) be a partition and consider the PBD distribution D(B, h, α), where h is a pdf on the simplex Rm and α is a measure on (X , A) with α(Br ) > 0 for r = 1, . . . , m. Split the set Br as Br1 ∪ Br2 with α(Br1 ) > 0, α(Br2) > 0. Denote the partition (B1 , . . . , Br−1, Br1 , Br2 , Br+1 , . . . , Bm ) as B∗ . Let h∗ (y1 , . . . , yr−1 , yr1, yr2 , yr+1, . . . , ym ) = c∗ · h(y1 , . . . , yr−1, yr , yr+1, . . . , ym ) α(Br1 )−1 α(Br2 )−1 yr2 α(Br )−1 yr yr1 (43.34) where yr = yr1 + yr2 and c∗ is a normalizing constant. Then D(B, h, α) = D(B∗ , h∗ , α). (43.35) Proof We can write P = X P (Bi )PBi = 1≤i≤m X P (Bi)PBi + P (Br1 )PBr1 + P (Br2 )PBr2 . 1≤i≤m,i6=r Since P ∼ D(B, h, α), the vector P(B) has pdf h(y1 , . . . , ym ) and is independent of PB1 , . . . , PBm which are themselves independently distributed as DαB1 , . . . , DαBm . Note that (PBr )Brj = PBrj , (αBr )Brj = αBrj , j = 1, 2. r1 ) P (Br2 ) Since PBr ∼ DαBr , it follows that ( PP(B , ) has pdf proportional to (Br ) P (Br ) α(Br1 )−1 α(Br2 )−1 yr2 α(B )−1 yr r yr1 with yr = yr1 + yr2 . This vector is also independent of PBr1 , PBr2 which are independent themselves with distributions DαBr1 , DαBr2 , respectively. This means that P(B∗ ), (PBj , Bj ∈ B∗ ) have the requisite distributions to say that P ∼ D(B∗ , h∗ , α), where h∗ is as defined in (43.34). 141 44 Posterior distributions under PB priors when the data is from the general repair model The theorem below shows that when the prior is a PB distribution, the posterior distribution is also PB, in the general repair model. Theorem 44.1. Let B = (B1 , . . . , Bm ) be a partition and let G = G1 × · · · × Gm , where Gr is a random pm such that Gr ({P : P (Br ) = 1}) = 1, r = 1, . . . , m. Let the random pm P have distribution H(B, h, G). Let A ∈ B be a union of some sets in the partition B, and in particular let A = ∪s∈E Bs where E is a subset of indexes in {1, 2, . . . , m}. Let the random variable Y be such that its distribution given P is PA , i.e. Y |P ∼ PA . Let r = r(Y ) be the random index in E such that Y ∈ Br . Then the posterior distribution of P given Y can be described as follows: 1 The random vector P(B) and the random pm’s PB1 , . . . , PBm are independently distributed. 2 The density of P(B) is c · hY (y1 , . . . , ym ) where hY (y1 , . . . , ym ) = h(y1 , . . . , ym) · yr /( izing constant. P s∈E ys ) and c is a normal- 3 For t 6= r the distribution of PBt is Gt , which is unchanged from the prior distribution. 4 The distribution of PBr is GYr (the posterior distribution in a standard nonparametrics problem). 142 This result may be succinctly stated as: T he posterior distribution is H Y (B, h, G) = H(B, hY , G Y ), (44.36) def where G Y = G1 , × · · · × Gr−1 × GYr × Gr+1 × · · · × Gm . Proof: Let Q be the joint distribution of (P(B), (PBt , t = 1, . . . , m), Y, r(Y )). Notice that Q(r(Y ) = j|P ) = P P (Bj ) P (Bt ) t∈E and Q(Y ∈ C|P, r(Y ) = j) = PBj (C). This immediately shows that, given Y and r(Y ) = j, the random vector P(B) is independent of PB1 , . . . , PBm and has pdf proportional to h(y) P yj t∈E yt . Also since the distribution of Y given P and r(Y ) = j is Pj , the distribution of Pj given Y and r(Y ) = j is GYj and the distribution of Pr , r 6= j given Y and r(Y ) = j is unchanged. Thus the distribution of P given Y is H(B, hY , G Y ) as stated in (44.36). In Theorem 44.2 below, we show that when the prior is a PBD distribution the posterior is also a PBD distribution, whether or not the restriction set B is a union of sets in the partition B. Theorem 44.2. Let P have a PBD distribution D(B, h, α). Let the random variable Y be such that its distribution given P is PA , i.e. Y |P ∼ PA . Let B∗ = {B∗1 , . . . , B∗m∗ } be the partition generated by B and A. The distribution of P can be written as D(B∗ , h∗ , α) for some pdf h∗ on the simplex in Rm∗ , in view of Theorem 43.1. Now A is a union of some sets B∗s in the partition B∗ . Thus there is a subset E∗ ⊂ {1, 2, . . . , m∗ } such that A = ∪s∈E B∗s . Let r = r(Y ) denote the random index in E∗ such that Y ∈ B∗r . Then the posterior distribution of P given Y can be written as D(B∗ , hY∗ , α + δY ) 143 where hY∗ (y) = c∗ · h∗ (y) · P yr s∈E∗ ys . (44.37) Proof: This follows immediately from Theorem 44.1. We have implicitly assumed that α(B∗t ) > 0 for all t. Otherwise, we will just have to remove the sets B∗t with α(B∗t ) = 0 from the partition B∗ . Theorem 44.1 obtains, under a PB prior, the posterior distribution in the general repair model when B is a union of sets in the partition B. When B is not necessarily a union of sets in the partition B, Theorem 44.2 shows how to obtain, under a PBD prior, the posterior distribution by enlarging the partition B with the restriction set B to the larger partition B∗ which will insure that B is a union of sets in B∗ . This step is valid because the restriction set B is non-random. The same step can justified, under PBD priors, even if we enlarge the partition with a still another set C that depends on the observation Y ; this is described in Theorem 44.3. Theorem 44.3. Let P ∼ D(B, h, α). Let the conditional distribution of Y P given P be PA where we can assume that A = i=1r Bi , without loss of generality, in view of Theorem 43.1. Let r(Y ) = j if Y ∈ Bj . The posterior distribution of P given Y, r(Y ) = j is D(B, hY , α + δY ) (44.38) where from Theorem 44.2. yj hY (y) = Pr h(y) 1 yi (44.39) Suppose that C is a set that depends on Y in a measurable way, i.e. {(x, y) : x ∈ C(y)} ∈ A2 . Consider the enlarged partition B∗ = Bi1 = 144 Bi ∩ C, Bi2 = Bi ∩ C, 1 ≤ i ≤ m. After Y has been observed, C is a nonrandom set, and hence we can use Theorem 43.1 to rewrite the posterior distribution in (44.38) and (44.39) as D(B∗ , hY∗ , α + δY ) where hY∗ (y∗ ) yj = Pr 1 yi ×n1 (44.40) α(Bi1 )−1 α(Bi2 )−1 yi2 h(y) α(Br )−1 yr yi1 (44.41) and yi1 + yi2 = yi , 1 ≤ i ≤ m. This same answer is also obtained if we incorrectly express the prior D(B, h, α) (by an invalid application of Theorem 43.1) as D(B∗ , h∗ , α) with h∗ (y∗ ) = (44.42) α(Bi1 )−1 α(Bi2 )−1 yi2 n yi1 ×1 h(y) α(Bi )−1 yi (44.43) and formally calculate the posterior distribution using Theorem 44.2. Proof: Note that {r(Y ) = j} = {Y ∈ Bj } = {Y ∈ Bj1 } ∪ {Y ∈ Bj2 }, 1 ≤ j ≤ r. Using the (unjustified) distribution in (44.42) and (44.43) as the prior, a blind application of Theorem 44.2 leads us to the following as the posterior distribution: D(B∗ , h∗∗ , α + δY ) with (44.44) α(B )−1 α(B )−1 y i1 yi2 i2 yj1 + yj2 ×n1 i1 α(Br )−1 h∗∗ (y∗ ) = Pr yr 1 (yi1 + yi2 ) which are the same as (44.40) and (44.41). 145 h(y) (44.45) April 17, 2008 45 Example illustrating Theorem 44.3 We will illustrate Theorem 44.3 with a simple example. Let P ∼ Dα and X|P ∼ P . Let α have a positive non-atomic part and let A(x) = (x − ǫ, x + ǫ). Note that and α(A(x)) > 0 for each x. We know that the distribution of P given X is Dα+δX . We will obtain this answer by using the steps outlined in Theorem 44.3. Consider the random A = A(X) and the partition B = (A, Ac ). By a blind application of Theorem (43.1) one write the prior as a PBD distribution on the partition B as follows Dα = D(B, h(y1, y2 ), α) (45.46) α(A)−1 α(Ac )−1 y2 . where h(y1 , y2) = y1 Now X|P ∼ P = PX and r(X) = 1 since X ∈ A. From another blind application of Theorem 44.2, the posterior distribution of P given X is D(B, hX , α + δX ) where hX = h(y1 , y2 ) · y1 y1 +y2 α(A) α(Ac )−1 y2 = y1 α(A)+δX (A)−1 α(Ac )+δX (Ac )−1 y2 . = y1 From the standard properties of the Dirichlet distribution, the above PBD distribution is equal to Dα+δX , which is the correct answer. 146 46 Bayesian methods with censored observations Let P have some prior distribution. Let X be a sample from P , i.e. X|P ∼ P . Let A be a measurable subset, that can depend on previous observations and covariates, if any. The when X is censored by the set A, we just observe the random variable Y defined by X Y = θ, if X ∈ Ac (46.47) if X ∈ A The question is what is the posterior distribution of P given the censored value Y . Theorem 46.1. Let the prior distribution of P be the PB distribution H(B, h, G), where B = (B1 , . . . , Bk ), his the pdf of P(B) and G is the joint distribution of (PB1 , . . . , PBk ). Suppose that the censoring set A satisfies A= r X Bi . 1 Let X|P ∼ P and let Y be as defined in (46.47), be the censored value. Then the posterior distribution of P given Y = θ is H(B, hY , G) where hY (y) = yA h(y) with yA = (46.48) r X 1 Proof: If Y ∈ Ac , then P |Y ∼ Dα+δY . 147 yi . Q((P (B1 ), . . . , P (Bk )) ∈ C, (PB1 , . . . , PBk ) ∈ D, X ∈ A) Z Z = yA h(y)dy G(dPB1 , . . . , dPBk ). y∈C D Therefore (PB1 , . . . , PBk )|X ∈ A) = G and (P (B1 ), . . . , P (Bk )|X ∈ A) ∝ yA h(y). This completes the proof of this theorem. The following theorem stated without proof is immediate. Theorem 46.2. Let P ∼ H(B, h, G). Let A1 ∈ A. For n = 2, 3, . . . , let An depend measurably on (X1 , . . . , Xn−1), i.e. {(x1 , . . . , xn ) : xn ∈ An (x1 , . . . , xn−1 )} ∈ An . Let X1 , X2 , . . . , Xn be such that X1 |P ∼ P Y1 = θ1 if X1 ∈ A1 X2 |P, X1 ∼ P Y2 = θ2 if X2 ∈ A2 X3 |P, X2, X1 ∼ P Y3 = θ3 if X3 ∈ A3 .. . 148 For each i assume that Ai is the union of subsets in the partition B. Then P |(Y1 = θ1 , . . . , Yn = θn ) ∼ H(B, h∗ , G) where h∗ = hyA1 . . . yAn . Note that, in the above two theorems, we have slightly extended the definition of PB priors from the previous lecture, by allowing PB1 , . . . , PBk to be dependent. 47 Application We will now apply the above results to the estimation of the distribution function in the standard censoring problem. The Kaplan-Meyer estimate is the frequentist estimate in this problem. Susarla and Van Ryzin gave the Bayes estimate under a Dirichlet prior. Let P ∼ Dα and X1 , . . . Xn |P be i.i.d.P . Suppose that X1 , . . . , Xr are uncensored and therefore observed and let the rest be censored as follows Xr+1 ∈ Ar+1 , . . . , Xn ∈ An , where Ar+1 , . . . , An can depend on previous observations or independent covariates. We will now obtain the posterior distribution P given such data. Consider the partition B defined based on Ar+1 , . . . , An and view Dα as the PBD distribution D(B, h, α). An extension of Theorem 46.1 on the lines of Theorem 44.3 is valid, and we can use the above PBD distribution (though it is based on sets depending on the sample) as a prior and blindly apply Theorem 46.1 to obtain the posterior distribution as 149 ∗ D(B, h , α + r X δXi ) 1 ∗ where h = hyAr+1 . . . yAn . The Bayes estimate of P (A) is just the expectation of P (A) under the P above PBD posterior distribution. Since P (A) = P (Bi )PBi (A) this is calculated as P̂ (A) = E(P (A)) = X E(yi ) α(ABi) α(X ) where R yi h(y)yAr+1 . . . yAn E(yi) = R h(y)yAr+1 . . . yAn E(Zi ZAr+1 . . . ZAn ) = E(ZAr+1 . . . ZAn ) where (Z1 , . . . Zk ) ∼ D(α(B1), . . . , α(Bk )) which can be generated as Zi = Pwi wi and w1 , . . . , wk are i.i.d. Γ(α(B1 ), . . . , α(Bk )). (j) (j) Otherwise, one can generate independent vectors (Z1 , . . . , Zk ), j = 1, . . . , N as above and use the Law of Large Numbers (LLN) to approximate E(yi) by P (j) (j) (j) Zi ZAr+1 . . . ZAn . P (j) (j) j ZAr+1 . . . ZAn j In the right censoring example of Susarla and Van Ryzin, the sets Aj , j = r + 1, . . . , n are intervals of the form (x, ∞). Then the pdf h∗ posterior distribution is the pdf of a Connor-Mosimann distribution and quantities like E(yi ) can be evaluated in closed form. The Bayes estimate of P ((x, ∞) is the same as the Susarla-Van Ryzin estimate. 150 48 Reexamining the Susarla-Van Ryzin example Susarla and Van Ryzin considered the data set 0.8, 1.0+, 2.7+, 3.1, 5.4, 7.0+, 9.2, 12.1+, where a+ denotes that the censoring set was [a, ∞) and that potential observation was censored. They used a Dirichlet prior for P with parameter α, which was 8 times the exponential distribution with failure rate 0.12. The posterior distribution given the uncensored observations is Dirichlet with parameter α∗ = α + δ0.8 + δ3.1 + δ5.4 + δ9.2 . We can take this to be the prior distribution and say that the remaining data 1.0+, 2.7+, 7.0+, 12.1+, are all censored. The partition formed by the censoring sets is B = (B1 , . . . , B5 ) = ([0, a1 ), [a2 , a3 ), . . . , [a4 , ∞)) = ([0, 1.0), [1.0, 2.7), [2, 7, 7.0), [7.0, 12.1), [12.1, ∞)). The posterior distribution given all the data is therefore D(B, h(y) Y2 Y3 Y4 Y5 , α∗ ) where Yj = yj + · · · + y5 , j = 2, . . . , 5 are the tail sums of the y’s and α∗ (Bj )−1 h(y) ∝ ×51 yj 151 . We use our method to recalculate the estimates of F (t) based on SusarlaVan Ryzin data. When the transformation to independent Beta variables is used we get the same results as Susarla and Van Ryzin. However, this depends on the fact that all the censoring were right censoring. We also use on LLN method to compute estimates of F (t). A comparison of the results is given below: One can also employ the MCMC method outlined earlier to obtain these Bayes estimates. The table below compares these methods and also gives the Kaplan-Meyer estimator. F̂ (t) t 0.80 1.00+ 2.70+ 3.10 5.40 7.00+ 9.20 12.10+ Exact 0.1083 0.1190 0.2071 0.3006 0.4719 0.5256 0.6823 0.7501 LLN 0.1082 0.1189 0.2068 0.3000 0.4708 0.5244 0.6810 0.7487 MCMC 0.1083 0.1190 0.2084 0.3011 0.4706 0.5261 0.6802 0.7476 KME 0.1250 0.1450 0.1250 0.3000 0.4750 0.4750 0.7375 0.7375 152 49 Bayes estimates with left and right censoring Suppose that we write the data from the example of Susarla-Van Ryzin as 0.8−, 1.0+, 2.7+, 3.1−, 5.4−, 7.0+, 9.2−, 12.1 + . where a− indicates that a potential data was censored to the interval [0, a) (i.e. right censored). The data generates a partition B = (B1 , . . . , B9 ) of 9 intervals. If we use a Dirichlet prior Dα , it can also be viewed as PB Dirichlet on this partition. The prior distribution of P (B) = (P (B1 ), . . . , P (B9)) is the finite dimensional Dirichlet with pdf proportional to α(Bi )−1 h(y) = ×91 yi . The posterior distribution of P is a PB Dirichlet distribution D(B∗ , h(y) Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 , α) where Zj = Pj 1 yi , Yj = P9 j yi . This also illustrates how to handle any kind of censoring in an arbitrary space. A closed form expression for E(yj ) under the posterior distribution is not available. However, it is just Eh (yj Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 ) . Eh (Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 ) 153 where Eh denotes expectation under the finite dimensional Dirichlet distribution D(α(B1), . . . , α(B9 )). We can generate samples (y1r , . . . , y9r ), r = 1, . . . , N from h(y) by using independent Gamma random variables and approximate the above as PN r r r r r r r r r r=1 yj Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 . PN r r r r r r r r r=1 Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 This method is justified by the Law of Large Numbers and good rates of convergence are well known. It does not use imputation and MCMC methods, which add another level of randomness in the final answer. This method will handle any kind of censoring and is applicable to distributions in multidimensional spaces. Of course, we can estimate many features other than just the distribution function, i.e median, quantiles, etc. We used the same Dirichlet prior for P , namely the Dirichlet prior with parameter α, which was 8 times the exponential distribution with failure rate 0.12. Here is the Bayes estimate of F (t) under this two-way censoring. F̂ (t) t 0.80- 1.00+ 2.70+ 3.10- 5.40- 7.00+ 9.20- LLN 0.1737 0.1937 0.3625 0.3968 0.5274 0.5990 0.6796 154 12.10+ 0.7721 The figure below gives the Bayes estimates of the distribution function under right censoring, under both kinds of censoring and the prior mean of the distribution function. Estimates of the df from censored (all types) data, based on PB priors 0.9 0.8 Two−way censoring 0.7 F(t) 0.6 Prior df 0.5 0.4 0.3 Right censoring 0.2 0.1 0 50 0 5 t 10 15 Revisiting Bayes estimation for repair models We will now give an example of Bayes estimation of the distribution (survival) function when the data comes from a repair model. The following data consists of inter-failure times, with observations from a chi2 -distribution with 5 degrees of freedom and some maintenance schedule. When repairs are made it is minimal repair, indicated by a 0 on the column on the left, and at random times replacements (perfect repair) are made, indicated by a 1.. 155 new/repair age 1 5.0093 0 8.9197 1 9.2638 0 12.7893 0 14.6230 0 16.9651 0 19.4155 1 2.4406 0 7.7570 0 8.2593 0 9.8226 0 11.7806 0 13.2864 156 Based on the order statistics of these inter-failure times (ages), one can partition (0, ∞) into 15 intervals, and rearrange them as follows. new/repair age partition no. start partition no. or rank 1 5.0093 2 1 0 8.9197 5 3 1 9.2638 6 1 0 12.7893 9 7 0 14.6230 11 10 0 16.9651 12 12 0 19.4155 13 13 1 2.4406 1 1 0 7.7570 3 2 0 8.2593 4 4 0 9.8226 7 5 0 11.7806 8 8 0 13.2864 10 9 The last two columns indicate the partition number into which the age falls, and the partition number of the beginning partition into which it could have fallen. Thus, it was after a perfect repair, the age could have fallen in all partitions starting from the first partition. It it was a minimal repair, the age can fall in all partitions starting from the age at the previous observation. 157 The following figure displays the same information. 158 We used a Dirichlet prior with a parameter equal to 2 times a χ2 -distribution with 2 degrees of freedom and calculated the Bayes estimate of the survival function which is given in the following figure. The Bayes estimate, the frequentist estimate of Whitaker and Samaniego together with the true survival function is given in the following figure. 159 April 22, 2008 51 Students and the papers presented - I Student Paper title Authors Laura Taylor Bayesian nonparametric estimation Salinas-Torres, Pereira in a series system or a competing risks model and Tiwari On choosing the centering distribution Hanson, Sethuraman in Dirichlet process mixture models and Xu Alex McLain 160 April 24, 2008 52 Students and the papers presented - II Student Paper title Authors Na Yang Gibbs sampling methods for stick-breaking priors James and Ishwanran Shuang Li Gibbs sampling methods for stick-breaking priors James and Ishwanran Peng Chen Variational methods for the Dirichlet process 161 Blei and Jordan