A survey of the Dirichlet process and its role in nonparametrics Jayaram Sethuraman Department of Statistics Florida State University Tallahassee, FL 32306 sethu@stat.fsu.edu May 9, 2013 Summary • Elementary frequentist and Bayesian methods • Bayesian nonparametrics • Nonparametric priors in a natural way • Nonparametric priors and exchangeable random variables; Pólya urn sequences • Nonparametric priors through other approaches • Sethuraman construction of Dirichlet priors • Some properties of Dirichlet priors • Bayes hierarchical models Frequentist methods Data X1 , . . . , Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family of distributions. Frequentist methods Data X1 , . . . , Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family of distributions. For instance, data can be X1 , . . . , Xn be i.i.d. N(θ, σ 2 ), with θ ∈ Θ = (−∞, ∞) and known σ 2 . Frequentist methods Data X1 , . . . , Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family of distributions. For instance, data can be X1 , . . . , Xn be i.i.d. N(θ, σ 2 ), with θ ∈ Θ = (−∞, ∞) and known σ 2 . P Then the frequentist estimate of θ is X¯n = (1/n) n1 Xi . Bayesian methods However, a Bayesian would consider θ to be random (by seeking experts’ opinion, etc.) and would put a distribution on Θ, Bayesian methods However, a Bayesian would consider θ to be random (by seeking experts’ opinion, etc.) and would put a distribution on Θ, say a N(µ, τ 2 ) distribution and call it the prior distribution. Bayesian methods However, a Bayesian would consider θ to be random (by seeking experts’ opinion, etc.) and would put a distribution on Θ, say a N(µ, τ 2 ) distribution and call it the prior distribution. Then the posterior distribution of θ, the conditional distribution of θ given the data X1 , . . . , Xn is nX¯n + τµ2 σ2 N( n + τ12 σ2 , n σ2 1 + 1 τ2 ). Bayesian methods However, a Bayesian would consider θ to be random (by seeking experts’ opinion, etc.) and would put a distribution on Θ, say a N(µ, τ 2 ) distribution and call it the prior distribution. Then the posterior distribution of θ, the conditional distribution of θ given the data X1 , . . . , Xn is nX¯n + τµ2 σ2 N( n + τ12 σ2 , n σ2 1 + 1 τ2 ). The Bayesian of θ is the expectation of θ under its posterior distribution and is nX¯n + τµ2 σ2 . n + τ12 σ2 Beginning nonparametrics Suppose that X1 , . . . , Xn are i.i.d. and take on only a finite number of values, say in X = {1, . . . , k}. The most general probability distribution for X1 is p = (p1 , . . . , pk ) takingPvalues in the unit simplex in Rk , i.e. pj ≥ 0, i = 1, . . . , k and pj = 1. Let N = (N1 , . . . , Nk ) where Nj = #{Xi = j, i = 1, . . . , n} is the observed frequency of j. The frequentist estimate of pj is Nj /n and the estimate of p is (N1 , . . . , Nk )/n. Bayes analysis The class of all priors for p is just the class of distributions of the k − 1 dimensional vector (p1 , . . . , pk−1 ). Bayes analysis The class of all priors for p is just the class of distributions of the k − 1 dimensional vector (p1 , . . . , pk−1 ). A special distribution is the finite dimensional Dirichlet distribution D(a) with parameter a = (a1 , . . . , ak ) with a density function proportional to k Y 1 p1ai −1 , aj ≥ 0, j = 1, . . . , k, A = X aj > 0. Bayes analysis The class of all priors for p is just the class of distributions of the k − 1 dimensional vector (p1 , . . . , pk−1 ). A special distribution is the finite dimensional Dirichlet distribution D(a) with parameter a = (a1 , . . . , ak ) with a density function proportional to k Y p1ai −1 , aj ≥ 0, j = 1, . . . , k, A = X aj > 0. 1 Should really use ratios of independent Gammas to their sum to take care of cases when aj = 0. Bayes analysis Then the posterior distribution of p given the data X1 , . . . , Xn has a density proportional to k Y p1ai +Ni −1 1 which is the finite dimensional Dirichlet distribution with parameter a + N. Bayes analysis Then the posterior distribution of p given the data X1 , . . . , Xn has a density proportional to k Y p1ai +Ni −1 1 which is the finite dimensional Dirichlet distribution with parameter a + N. Thus the Bayes estimate of p is a+N A+n . The standard nonparametric problem Let X = (X1 , . . . , Xn ) be i.i.d. random variables on R1 with a common distribution function (df) F (or with a common probability measure (pm) P). The standard nonparametric problem Let X = (X1 , . . . , Xn ) be i.i.d. random variables on R1 with a common distribution function (df) F (or with a common probability measure (pm) P). P P Let Fn (x) = (1/n) I (Xi ≤ x), Pn (A) = (1/n) I (Xi ∈ A) be the empirical distribution function (edf) (empirical probability measure (epm)) of X. Then Fn (Pn ) is the frequentist estimate of F (P). The standard nonparametric problem Let X = (X1 , . . . , Xn ) be i.i.d. random variables on R1 with a common distribution function (df) F (or with a common probability measure (pm) P). P P Let Fn (x) = (1/n) I (Xi ≤ x), Pn (A) = (1/n) I (Xi ∈ A) be the empirical distribution function (edf) (empirical probability measure (epm)) of X. Then Fn (Pn ) is the frequentist estimate of F (P). In fact Fn (x) → F (x), Pn (A) → P(A) for each x ∈ R1 and each A in the Borel sigma field B in R1 . Nonparametric priors in a natural way The parameter F or P in this nonparametric problem is different from the finite parameter case. Nonparametric priors in a natural way The parameter F or P in this nonparametric problem is different from the finite parameter case. The first and most natural attempt to introduce a distribution for P was to mimic the case of random variables taking a finite number of values. Nonparametric priors in a natural way The parameter F or P in this nonparametric problem is different from the finite parameter case. The first and most natural attempt to introduce a distribution for P was to mimic the case of random variables taking a finite number of values. Consider a finite partition A = (A1 , . . . , Ak ) of R1 . Ferguson (1973) defined the distribution of (P(A1 ), . . . , P(Ak )) to be a finite dimensional Dirichlet distribution D(αβ(A1 ), . . . , αβ(Ak )) where α > 0 and β(·) is a pm on (R1 , B). Nonparametric priors in a natural way By a distribution for P we mean a distribution on P, the space of all probability measures on R1 . More precisely, we will also have to specify a σ-field S in P. We will take the smallest σ-field containing sets of the form {P : P(A) ≤ r } for all A ∈ B and 0 ≤ r ≤ 1 as our S. Nonparametric priors in a natural way By a distribution for P we mean a distribution on P, the space of all probability measures on R1 . More precisely, we will also have to specify a σ-field S in P. We will take the smallest σ-field containing sets of the form {P : P(A) ≤ r } for all A ∈ B and 0 ≤ r ≤ 1 as our S. It can be shown that Ferguson choice is a probability distribution on (P, S) and is called the Dirichlet distribution with parameter αβ(·) and denoted by D(αβ(·)). This is also called the Dirichlet process since it is the distribution of F (x), x ∈ R1 . Nonparametric priors in a natural way Current usage calls α as the scale factor and β(·) as the base measure of the Dirichlet distribution D(αβ(·)). Ferguson showed that the posterior distribution of P given the data X is the Dirichlet distribution D(αβ(·) + nPn (·)). and thus the Bayes estimate of P is αβ(·)+nPn (·) . α+n This is analogous to the results in the finite sample space case. Nonparametric priors in a natural way Under D(αβ), (P(A1 ), . . . , P(Ak )) has a finite dimensional Dirichlet distribution. Nonparametric priors in a natural way Under D(αβ), (P(A1 ), . . . , P(Ak )) has a finite dimensional Dirichlet distribution. Two assertions made in the last slide. • D(αβ) is a probability measure on (P, S). • The posterior distribution given X is D(αβ + nPn ). Another property of the Dirichlet prior that was disconcerting at first is that • D(αβ) gives probability 1 to the subset of of discrete probability measures on (R1 , B). The proofs needed some effort and were sometimes mystifying. Nonparametric priors and exchangeable random variables The class of all nonparametric priors are the same as the class of all exchangeable sequences of random variables! Nonparametric priors and exchangeable random variables The class of all nonparametric priors are the same as the class of all exchangeable sequences of random variables! This followed from De Finetti’s theorem (1931). See also Hewitt and Savage (1955), Kingman (1978). Let X1 , X2 , . . . be an infinite exchangeable (def?) sequence of random variables with a joint distribution Q. Then 1. The empirical distribution functions Fn (x) → F (x) with probability 1 for all x. In fact, supx |Fn (x) − F (x)| → 0 with probability 1. (Note that F (x) is a random distribution function.) Nonparametric priors and exchangeable random variables w 2. The empirical probability measures Pn → P. This P is a random probability measure. 3. Given P, X1 , X2 , . . . are i.i.d. P. 4. Thus the distribution of P under Q denoted by ν Q is a nonparametric prior. 5. The class of all nonparametric priors arises in this fashion. 6. The distribution of X2 , X3 , . . . , given X1 is also exchangeable; will be denoted by QX1 . 7. The limit P of the empirical probability measures of X1 , X2 , . . . is also the limit of the empirical probability measures of X2 , X3 , . . . . Thus the distribution of P given X1 (the posterior distribution) is the distribution of P under QX1 and, by mere notation, is ν QX1 . Dirichlet prior based on a Pólya urn sequences The Pólya urn sequence is an example of an infinite exchangeable random variables. Let β be a pm on R1 and let α > 0. Define the joint distribution Pol(α, β) of X1 , X2 , . . . through P αβ + n−1 δXi 1 , n = 2, 3, . . . X1 ∼ β, Xn |(X1 , . . . , Xn−1 ) ∼ α+n−1 This defines Pol(α, β) as an exchangeable probability measure. Dirichlet prior based on a Pólya urn sequences The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior D(αβ)! • That is, the distribution of (P(A1 ), . . . , P(Ak )) for any partition (A1 , . . . , Ak ), under Pol(α, β), is the finite dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )). Dirichlet prior based on a Pólya urn sequences The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior D(αβ)! • That is, the distribution of (P(A1 ), . . . , P(Ak )) for any partition (A1 , . . . , Ak ), under Pol(α, β), is the finite dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )). P(A) ∼ B(αβ(A), αβ(Ac )). Dirichlet prior based on a Pólya urn sequences The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior D(αβ)! • That is, the distribution of (P(A1 ), . . . , P(Ak )) for any partition (A1 , . . . , Ak ), under Pol(α, β), is the finite dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )). P(A) ∼ B(αβ(A), αβ(Ac )). • The conditional distribution of (X2 .X3 , . . . ) given X1 is αβ+δ Pol(α + 1, α+1X1 ). Thus posterior distribution given X1 is D(αβ + δX1 ). • Each Pn is a discrete rpm and the limit P is also a discrete rpm. Dirichlet prior based on a Pólya urn sequences The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior D(αβ)! • That is, the distribution of (P(A1 ), . . . , P(Ak )) for any partition (A1 , . . . , Ak ), under Pol(α, β), is the finite dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )). P(A) ∼ B(αβ(A), αβ(Ac )). • The conditional distribution of (X2 .X3 , . . . ) given X1 is αβ+δ Pol(α + 1, α+1X1 ). Thus posterior distribution given X1 is D(αβ + δX1 ). • Each Pn is a discrete rpm and the limit P is also a discrete rpm. For this case of a Pólya urn sequence, we can show that P({X1 , . . . , Xn }) → 1 with probability 1 and thus P is a discrete rpm. Dirichlet prior based on a Pólya urn sequences The conditional distribution of P({X1 }) given X1 is B(1 + αβ({X1 }), αβ(R1 \ {X1 })). This is tricky. Is P({X1 }) measurable to begin with? Dirichlet prior based on a Pólya urn sequences The conditional distribution of P({X1 }) given X1 is B(1 + αβ({X1 }), αβ(R1 \ {X1 })). This is tricky. Is P({X1 }) measurable to begin with? For the moment assume that β is nonatomic. Dirichlet prior based on a Pólya urn sequences The conditional distribution of P({X1 }) given X1 is B(1 + αβ({X1 }), αβ(R1 \ {X1 })). This is tricky. Is P({X1 }) measurable to begin with? For the moment assume that β is nonatomic. The above conditional distribution does not depend of X1 and thus X1 and P({X1 }) are independent and P({X1 }) ∼ B(1, α). Dirichlet prior based on a Pólya urn sequences Let Y1 , Y2 , . . . be the distinct values among X1 , X2 , . . . listed in the order of their appearance. Then Y1 = X1 , Y1 , P({Y1 }) are independent Dirichlet prior based on a Pólya urn sequences Let Y1 , Y2 , . . . be the distinct values among X1 , X2 , . . . listed in the order of their appearance. Then Y1 = X1 , Y1 , P({Y1 }) are independent and Y1 ∼ β, P({Y1 }) ∼ B(1, α). Dirichlet prior based on a Pólya urn sequences Consider the sequence X2 , X3 , . . . where all occurrences of X1 are removed. This reduced sequence is the Pólya urn sequence Pol(α, β) and independent of Y1 . Dirichlet prior based on a Pólya urn sequences Consider the sequence X2 , X3 , . . . where all occurrences of X1 are removed. This reduced sequence is the Pólya urn sequence Pol(α, β) and independent of Y1 . As before, Y2 and Y2 ∼ β, P({Y2 }) 1−P({Y1 }) P({Y2 }) 1−P({Y1 }) are independent, ∼ B(1, α). Dirichlet prior based on a Pólya urn sequences Consider the sequence X2 , X3 , . . . where all occurrences of X1 are removed. This reduced sequence is the Pólya urn sequence Pol(α, β) and independent of Y1 . As before, Y2 and Y2 ∼ β, P({Y2 }) 1−P({Y1 }) P({Y2 }) 1−P({Y1 }) are independent, ∼ B(1, α). P({Y3 }) P({Y2 }) Thus P({Y1 }), 1−P({Y , , . . . are i.i.d. B(1, α) 1 }) 1−P({Y1 })−P({Y2 }) Dirichlet prior based on a Pólya urn sequences Consider the sequence X2 , X3 , . . . where all occurrences of X1 are removed. This reduced sequence is the Pólya urn sequence Pol(α, β) and independent of Y1 . As before, Y2 and Y2 ∼ β, P({Y2 }) 1−P({Y1 }) P({Y2 }) 1−P({Y1 }) are independent, ∼ B(1, α). P({Y3 }) P({Y2 }) Thus P({Y1 }), 1−P({Y , , . . . are i.i.d. B(1, α) 1 }) 1−P({Y1 })−P({Y2 }) and all these are independent of Y1 , Y2 , Y3 . . . which are i.i.d. β. Dirichlet prior based on a Pólya urn sequences Since P is discrete and just sits on the set {X1 , X2 , . . . } which is {Y1 , Y2 , . . . }, Dirichlet prior based on a Pólya urn sequences Since P is discrete and just sits on the set {X1 , X2 , . . . } which is {Y1 , Y2 , . . . }, P it is also equal to ∞ 1 P({Yi })δY1 , in other words, Dirichlet prior based on a Pólya urn sequences Since P is discrete and just sits on the set {X1 , X2 , . . . } which is {Y1 , Y2 , . . . }, P it is also equal to ∞ 1 P({Yi })δY1 , in other words, we have the Sethuraman construction of the Dirichlet prior (if β is nonatomic). Blackwell and MacQueen (1973) do not obtain this result, but show just that, the rpm P is discrete for all Pólya urn sequences. Dirichlet prior based on a Pólya urn sequences The Pólya urn sequence also gives the predictive distribution under a Dirichlet prior by its very definition. Let the distribution of X1 , X2 , . . . given P be i.i.d. P where P has the Dirichlet prior D(αβ). Then X1 , X2 , . . . is the Pólya urn sequence Pol(α, β). Hence, the distribution of Xn given X1 , . . . , Xn−1 is P αβ+ n−1 δ Xi 1 α+n−1 . Dirichlet prior based on a Pólya urn sequences The awkwardness of the rpm P being discrete has turned into a gold mine for finding clusters in data. Let Y1 , . . . , YMn be the distinct values among X1 , . . . , Xn and let their multiplicities be k1 , . . . , kMn . From the predictive distribution given above we can show that the probability of Mn = m and (k1 , . . . , km ) is Q αm Γ(α) m 1 Γ(kj ) . Γ(α + n) From this we can obtain the marginal distribution of M the number of distinct values (or clusters) and also the conditional distributions of the multiplicities. Dirichlet prior based on a Pólya urn sequences A curious property: The number of distinct values (clusters) Mn goes to ∞ slower than Mn → α. n and in fact, log(n) Also, from the LLN, 1. PMn 1 I (Yj ≤x) log(n) → β(x) for all x with probability Dirichlet prior based on a Pólya urn sequences A curious property: The number of distinct values (clusters) Mn goes to ∞ slower than Mn → α. n and in fact, log(n) Also, from the LLN, 1. Pn I (X ≤x) PMn 1 I (Yj ≤x) log(n) → β(x) for all x with probability Note that 1 n j → F (x) with probability 1 where F is a rdf and has distribution D(αβ). Note all all this required that β be nonatomic. Nonparametric priors through completely random measures Kingman (1967) introduced the concept of completely random measures. X (A), A ∈ B is a completely random measure, if X (·) is a random measure and if X (A1 ), . . . , X (Ak ) are independent whenever A1 , . . . , Ak are disjoint. If X (R1 ) < ∞ with probability 1, then P(A) = random probability measure. X (A) X (R1 ) will be a Kingman also characterized the class of all completely random measures (subject to a σ-finite condition) and also showed how to generate them from Poisson processes and transition functions. The Dirichlet prior is a special case of this. Nonparametric priors and ind inc processes A df is just an increasing function on the real line. Consider [0, 1] instead. The class of processes X (t), t ∈ [0, 1] with nonnegative independent increments is well known from the theory of infinitely divisible laws. When some simple cases are excluded, this process has only a countable number of jumps which are independent of their locations. Nonparametric priors and ind inc processes A special case when X (t) ∼ Gamma(αβ(t)) for t ∈ [0, 1]. (t) is a random distribution function (since X (1) < ∞ F (t) = XX (1) with probability 1). Ferguson (second definition) shows that its distribution is the nonparametric prior D(αβ(·)). The properties of the Gamma process show that the rpm is discrete and also has the representation P([0, t]) = F (t) = ∞ X pi∗ δYi ([0, t]) 1 where p1∗ > p2∗ > · · · are the jumps of the Gamma process in decreasing order. The jumps are independent of the locations and thus (p1∗ , p2∗ , . . . ) which are independent of Y1 , Y2 , . . . which are i.i.d. β. Sethuraman construction of Dirichlet priors Sethuraman (1994) Let α > 0 and let β(·) be a pm. We do not assume that β is nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be independent of V1 , V2 , . . . and be i.i.d. β(·). Sethuraman construction of Dirichlet priors Sethuraman (1994) Let α > 0 and let β(·) be a pm. We do not assume that β is nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be independent of V1 , V2 , . . . and be i.i.d. β(·). Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d. discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other contexts, like species sampling, this is called the GEM(α) distribution. Sethuraman construction of Dirichlet priors Sethuraman (1994) Let α > 0 and let β(·) be a pm. We do not assume that β is nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be independent of V1 , V2 , . . . and be i.i.d. β(·). Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d. discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other contexts, like species sampling, this is called the GEM(α) distribution. Then the convex mixture P(·) = ∞ X 1 pi δYi (·) Sethuraman construction of Dirichlet priors Sethuraman (1994) Let α > 0 and let β(·) be a pm. We do not assume that β is nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be independent of V1 , V2 , . . . and be i.i.d. β(·). Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d. discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other contexts, like species sampling, this is called the GEM(α) distribution. Then the convex mixture ∞ ∞ X X P(·) = pi δYi (·) = p1 δY1 (·) + (1 − p1 ) 1 2 pi δY (·) 1 − p1 i Sethuraman construction of Dirichlet priors Sethuraman (1994) Let α > 0 and let β(·) be a pm. We do not assume that β is nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be independent of V1 , V2 , . . . and be i.i.d. β(·). Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d. discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other contexts, like species sampling, this is called the GEM(α) distribution. Then the convex mixture ∞ ∞ X X P(·) = pi δYi (·) = p1 δY1 (·) + (1 − p1 ) 1 2 pi δY (·) 1 − p1 i is a random discrete probability measure and its distribution is the Dirichlet prior D(αβ). Sethuraman construction of Dirichlet priors The random variables Y1 , Y2 . . . can take values any measure space. The proof is demonstrated by rewriting the constructive definition as P = p1 δY1 + (1 − p1 )P ∗ where all the random variables are independent, p1 ∼ B(1, α), Y1 ∼ β and the two rpm’s P, P ∗ have the same distribution. For the distribution of P we have the distributional identity d P = p1 δY1 + (1 − p1 )P. We first show that D(αβ) is a solution to this distributional equation and the solution is unique. Sethuraman construction of Dirichlet priors To summarize, P(·) = ∞ X pi δYi (·) 1 is a concrete representation of a random probability measure and its distribution is D(αβ(·)). Sethuraman construction of Dirichlet priors To summarize, P(·) = ∞ X pi δYi (·) 1 is a concrete representation of a random probability measure and its distribution is D(αβ(·)). It was not assumed that β is a nonatomic probability measure and the Yi ’s can be general random variables. Sethuraman construction of Dirichlet priors The Sethuraman construction is similar to P(·) = ∞ X pi∗ δYi (·) 1 obtained when using a Gamma process to define the Dirichlet prior. It can be shown that these (p1∗ , p2∗ , . . . ) are the same, in distribution, as (p1 , p2 , . . . ) arranged in increasing order. Definition: A one-time size biased sampling converts (p1∗ , p2∗ , . . . ) to (p1∗∗ , p2∗∗ , . . . ) as follows. Sethuraman construction of Dirichlet priors The Sethuraman construction is similar to P(·) = ∞ X pi∗ δYi (·) 1 obtained when using a Gamma process to define the Dirichlet prior. It can be shown that these (p1∗ , p2∗ , . . . ) are the same, in distribution, as (p1 , p2 , . . . ) arranged in increasing order. Definition: A one-time size biased sampling converts (p1∗ , p2∗ , . . . ) to (p1∗∗ , p2∗∗ , . . . ) as follows. Let J be an integer valued random variable with Prob(J = n|p1∗ , p2∗ , . . . ) = pn∗ . If J = 1 define (p1∗∗ , p2∗∗ , . . . ) to be the same as (p1∗ , p2∗ , . . . ). If J > 1 put p1∗∗ = pJ∗ and let p2∗∗ , p3∗ ∗, . . . be p1∗ , p2∗ . . . with pJ∗ removed. Sethuraman construction of Dirichlet priors As a converse result, the distribution of (p1 , p2 , . . . ) is the limiting distribution after repeated size biased sampling of (p1∗ , p2∗ , . . . ). McCloskey (1965), Pitman (1996), etc. The distribution of p1 , p2 , . . . does not change after a single size biased sampling. This property can be used to establish that in the nonparametric problem with one observation X1 , the posterior distribution of P given X1 is D(αβ + δX1 ). Simpler than the one given in Sethuraman (1994). Sethuraman construction of Dirichlet priors Ferguson showed that the support of the D(αβ) is the collection of probability measures in P whose support is contained in the support of β. If the support of β is R1 then the support of Dαβ is P. This assures us that even though D(αβ) gives probability 1 to the class of discrete pm’s, it gives positive probability to every neighborhood of every pm. Dirichlet priors are not discrete The Dirichlet prior D(αβ) is not a discrete pm; it just sits on the set all discrete pm’s. Dirichlet priors are not discrete The Dirichlet prior D(αβ) is not a discrete pm; it just sits on the set all discrete pm’s. The rpm P with distribution D(αβ) is a random discrete probability measure. Absolute continuity Consider D(αi βi ), i = 1, 2. These two measures are absolutely continuous with respect to each other or they are orthogonal. Absolute continuity Consider D(αi βi ), i = 1, 2. These two measures are absolutely continuous with respect to each other or they are orthogonal. Let βi1 , βi,2 be the atomic and discrete part of βi , i = 1, 2. D(αi βi ), i = 1, 2 are orthogonal to each other if α1 6= α2 or if β11 6= β21 or if support of β12 6= support of β22 . There is a necessary and sufficient condition for orthogonality when all of these are equal. Absolute continuity The curious result that we can consistently estimate the parameters of the Dirichlet prior from the sample happens because of the orthogonality of Dirichlet distributions when their parameters are different. Absolute continuity The curious result that we can consistently estimate the parameters of the Dirichlet prior from the sample happens because of the orthogonality of Dirichlet distributions when their parameters are different. Another curious result: When β is nonatomic, the prior D(αβ), the posterior given X1 , the posterior given X1 , X2 , the posterior given X1 , X2 , X3 etc. are all orthogonal to one another! Some properties of Dirichlet priors A R simple problem is the estimation of the “true” mean, i.e. xdP(x) from data X1 , X2 , . . . , Xn which are i.i.d. P. In the Bayesian nonparametric problem, P has a prior distribution D(αβ) and given P, the data X1 , . . . , Xn are i.i.d. P. R The Bayesian estimate of xdP(x) is its mean under the posterior distribution. Some properties of Dirichlet priors A R simple problem is the estimation of the “true” mean, i.e. xdP(x) from data X1 , X2 , . . . , Xn which are i.i.d. P. In the Bayesian nonparametric problem, P has a prior distribution D(αβ) and given P, the data X1 , . . . , Xn are i.i.d. P. R The Bayesian estimate of xdP(x) is its mean under the posterior distribution. R However, before estimating xdP(x), Some properties of Dirichlet priors A R simple problem is the estimation of the “true” mean, i.e. xdP(x) from data X1 , X2 , . . . , Xn which are i.i.d. P. In the Bayesian nonparametric problem, P has a prior distribution D(αβ) and given P, the data X1 , . . . , Xn are i.i.d. P. R The Bayesian estimate of xdP(x) is its mean under the posterior distribution. R However, before estimating xdP(x), one should first check that our prior R and posterior give probability 1 to the set {P : |x|dP(x) < ∞}. Some properties of Dirichlet priors Feigin and Tweedie (1989), and others R later, gave necessary and sufficient conditions for this, namely log(max(1, |x|))dβ(x) < ∞. From our constructive definition, Z |x|dP(x) = ∞ X p1 |Yi |. 1 The Kolomogorov three series theorem gives a simple direct proof. Sethuraman (2010). Some properties of Dirichlet priors The Bayes estimate of the mean is the mean under the posterior distribution D(αβ + nPn ) and it is R α xdβ(x) + nX¯n . α+n R One should check first that log(max(1, |x|))dβ(x) < ∞. Some properties of Dirichlet priors The Bayes estimate of the mean is the mean under the posterior distribution D(αβ + nPn ) and it is R α xdβ(x) + nX¯n . α+n R One should check first that log(max(1, |x|))dβ(x) < ∞. Note R that the Bayesian can estimate xdP(x) in this case even though the base measure β may not have a mean. (What does frequentist estimate X¯n estimate in this case?). Some properties of Dirichlet priors R The actual distribution of xdP(x) under D(αβ) is a vexing problem. Regazzini, Lijoi and Prünster (2003), Lijoi and Prünster (2009) have the best results. When β is the Cauchy distribution, it is easy that Z xdP(x) = ∞ X pi Yi 1 R where Y1 , Y2 , . . . are i.i.d. Cauchy, and hence xPd(x) is Cauchy. One does not need the GEM property of (p1 , p2 , . . . ) for this; it is enough for it to be independent of (Y1 , Y2 , . . . ). Yamato (1984) was the first to prove this. Some properties of Dirichlet priors Convergence of Dirichlet priors when their parameters converge is easy to establish using the constructive definition. w When α → ∞ and β is fixed, D(αβ) → δβ . Some properties of Dirichlet priors Convergence of Dirichlet priors when their parameters converge is easy to establish using the constructive definition. w When α → ∞ and β is fixed, D(αβ) → δβ . w When α → 0 and β is fixed, D(αβ) → δY where Y ∼ β. Sethuraman and Tiwari (1982). Some properties of Dirichlet priors The constructive definition P(·) = ∞ X pi δYi (·) 1 leads to the inequality ||P − M X 1 pi δYi || ≤ M Y (1 − pi ). 1 So one can allow for several kinds of random stopping to stay within chosen errors. One can also stop at nonrandom times and have probability bounds for errors. Muliere and Tardella (1998) has several results of this type. Some properties of Dirichlet priors Let P ∼ D(αβ) and let f be a function of X . The Pf −1 the distribution of f under P has the D(αβf −1 ) where βf −1 is the distribution of f under β Some properties of Dirichlet priors Regression problems are studied when the data is of the form (Y1 , X1 ), (Y2 , X2 ), . . . and the model is that Yi = f (Xi , i ). If the distribution of is P, then the conditional distribution of Y given X = x is the distribution of f (x, ·) under P and is denoted by Mx . Let P ∼ D(α, β) and given P let 1 , 2 , . . . be i.i.d. P. The constructive definition gives P= ∞ X 1 where Z1 , Z2 , . . . are i.i.d. β. p i δ Zi Some properties of Dirichlet priors The df Mx is the distribution of f (x, ·) under P and is thus has a Dirichlet distribution and can be represented by Px = ∞ X pi δZi,x 1 where Z1,x , Z2,x , . . . are i.i.d from the distribution of f (x, )˙ under β. MacEachern (1999) calls this the dependent Dirichlet prior (DDP). Some properties of Dirichlet priors Ishwaran and James (2011) allow the Vi ’s appearing in the definition of the pi ’s to independent B(ai , bi ) random variables. Rodriguez (2011) allows Vi = Φ(Ni ) where N1 , N2 , . . . are i.i.d. N(µ, σ 2 ) and calls it “probit stick breaking”. Bayes analysis is with such priors can be handled only by computational methods. Bayes hierarchical models A popular Bayes hierarchical model is usually stated as follows. The data X1 , . . . , Xn are independent with distributions K (·, θ1 ), . . . , K (·, θn ), where K (·, θ) is a nice continuous df for each θ. Next it is assumed that there is rpm P, and given P, θ1 , . . . , θn are i.i.d. P. Finally, it is assumed that P has the D(αβ) distribution. This is also called the DP mixture model and it has lots of applications. Bayes hierarchical models A popular Bayes hierarchical model is usually stated as follows. The data X1 , . . . , Xn are independent with distributions K (·, θ1 ), . . . , K (·, θn ), where K (·, θ) is a nice continuous df for each θ. Next it is assumed that there is rpm P, and given P, θ1 , . . . , θn are i.i.d. P. Finally, it is assumed that P has the D(αβ) distribution. This is also called the DP mixture model and it has lots of applications. One wants the posterior distribution of P given the data X1 , . . . , Xn . Bayes hierarchical models One should state the Bayes hierarchical model more carefully. Let the rpm P have the D(αβ) distribution. Conditional on P, let θ1 , . . . , θn be i.i.d. P. Conditional of (P, θ1 , . . . , θn ), let X1 , . . . , Xn be independent with distributions K (·, θ1 ), . . . , K (·, θn ), where K (·, θ) is a df for each θ. Then (X1 , . . . , Xn , θ1 , . . . , θn , P) will have a well defined joint distribution and all the necessary conditional distributions are also well defined. Ghosh and Ramamoorthy (2003) are careful to state this way throughout their book. Bayes hierarchical models Several computational methods have been proposed in the literature. West, Muller, and Escobar (1994), Escobar and West (1998), and MacEachern (1998) have studied computational methods. They integrate out the rpm P and deal with the joint distribution of (X1 , . . . , Xn , θ1 , . . . , θn ). The hierarchical model can also be expressed as follows, suppressingR the latent variables θ1 , . . . , θn : For any pm P, let K (·, P) = K (·, θ)dP(θ). P ∼ D(αβ) and given P, let X1 , . . . , Xn be i.i.d. P. Bayes hierarchical models To perform an MCMC one can use the constructive definition and put ∞ X K (·, P) = pi K (·, Yi ) 1 where (p1 , p2 , . . . ) is the usual GEM(α) and Y1 , Y2 , . . . are i.i.d. .β To do the MCMC here one has to truncate this infinite series appropriately. Doss (1994), Ishwaran and James (2011) and others. More More Bayes hierarchical models: Neal (2000), Rodriguez, Dunson and Gelfand (2008) Clustering applications: Blei and Jordan (2006), Wang, Shan and Banerjee (2009) Two parameter Dirichlet process: Perman, Pitman and Yor (1992), Pitman and Yor (1997). Partition based priors - useful in reliability and repair models: Hollander and Sethuraman (2009)