A survey of the Dirichlet process and its role in nonparametrics

advertisement
A survey of the Dirichlet process and its role in
nonparametrics
Jayaram Sethuraman
Department of Statistics
Florida State University
Tallahassee, FL 32306
sethu@stat.fsu.edu
May 9, 2013
Summary
• Elementary frequentist and Bayesian methods
• Bayesian nonparametrics
• Nonparametric priors in a natural way
• Nonparametric priors and exchangeable random variables;
Pólya urn sequences
• Nonparametric priors through other approaches
• Sethuraman construction of Dirichlet priors
• Some properties of Dirichlet priors
• Bayes hierarchical models
Frequentist methods
Data X1 , . . . , Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family of
distributions.
Frequentist methods
Data X1 , . . . , Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family of
distributions.
For instance, data can be X1 , . . . , Xn be i.i.d. N(θ, σ 2 ), with
θ ∈ Θ = (−∞, ∞) and known σ 2 .
Frequentist methods
Data X1 , . . . , Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family of
distributions.
For instance, data can be X1 , . . . , Xn be i.i.d. N(θ, σ 2 ), with
θ ∈ Θ = (−∞, ∞) and known σ 2 .
P
Then the frequentist estimate of θ is X¯n = (1/n) n1 Xi .
Bayesian methods
However, a Bayesian would consider θ to be random (by seeking
experts’ opinion, etc.) and would put a distribution on Θ,
Bayesian methods
However, a Bayesian would consider θ to be random (by seeking
experts’ opinion, etc.) and would put a distribution on Θ, say a
N(µ, τ 2 ) distribution and call it the prior distribution.
Bayesian methods
However, a Bayesian would consider θ to be random (by seeking
experts’ opinion, etc.) and would put a distribution on Θ, say a
N(µ, τ 2 ) distribution and call it the prior distribution.
Then the posterior distribution of θ, the conditional distribution of
θ given the data X1 , . . . , Xn is
nX¯n
+ τµ2
σ2
N( n
+ τ12
σ2
,
n
σ2
1
+
1
τ2
).
Bayesian methods
However, a Bayesian would consider θ to be random (by seeking
experts’ opinion, etc.) and would put a distribution on Θ, say a
N(µ, τ 2 ) distribution and call it the prior distribution.
Then the posterior distribution of θ, the conditional distribution of
θ given the data X1 , . . . , Xn is
nX¯n
+ τµ2
σ2
N( n
+ τ12
σ2
,
n
σ2
1
+
1
τ2
).
The Bayesian of θ is the expectation of θ under its posterior
distribution and is
nX¯n
+ τµ2
σ2
.
n
+ τ12
σ2
Beginning nonparametrics
Suppose that X1 , . . . , Xn are i.i.d. and take on only a finite number
of values, say in X = {1, . . . , k}. The most general probability
distribution for X1 is p = (p1 , . . . , pk ) takingPvalues in the unit
simplex in Rk , i.e. pj ≥ 0, i = 1, . . . , k and
pj = 1.
Let N = (N1 , . . . , Nk ) where Nj = #{Xi = j, i = 1, . . . , n} is the
observed frequency of j.
The frequentist estimate of pj is Nj /n and the estimate of p is
(N1 , . . . , Nk )/n.
Bayes analysis
The class of all priors for p is just the class of distributions of the
k − 1 dimensional vector (p1 , . . . , pk−1 ).
Bayes analysis
The class of all priors for p is just the class of distributions of the
k − 1 dimensional vector (p1 , . . . , pk−1 ).
A special distribution is the finite dimensional Dirichlet distribution
D(a) with parameter a = (a1 , . . . , ak ) with a density function
proportional to
k
Y
1
p1ai −1 , aj ≥ 0, j = 1, . . . , k, A =
X
aj > 0.
Bayes analysis
The class of all priors for p is just the class of distributions of the
k − 1 dimensional vector (p1 , . . . , pk−1 ).
A special distribution is the finite dimensional Dirichlet distribution
D(a) with parameter a = (a1 , . . . , ak ) with a density function
proportional to
k
Y
p1ai −1 , aj ≥ 0, j = 1, . . . , k, A =
X
aj > 0.
1
Should really use ratios of independent Gammas to their sum to
take care of cases when aj = 0.
Bayes analysis
Then the posterior distribution of p given the data X1 , . . . , Xn has
a density proportional to
k
Y
p1ai +Ni −1
1
which is the finite dimensional Dirichlet distribution with
parameter a + N.
Bayes analysis
Then the posterior distribution of p given the data X1 , . . . , Xn has
a density proportional to
k
Y
p1ai +Ni −1
1
which is the finite dimensional Dirichlet distribution with
parameter a + N.
Thus the Bayes estimate of p is
a+N
A+n .
The standard nonparametric problem
Let X = (X1 , . . . , Xn ) be i.i.d. random variables on R1 with a
common distribution function (df) F (or with a common
probability measure (pm) P).
The standard nonparametric problem
Let X = (X1 , . . . , Xn ) be i.i.d. random variables on R1 with a
common distribution function (df) F (or with a common
probability measure (pm) P).
P
P
Let Fn (x) = (1/n) I (Xi ≤ x), Pn (A) = (1/n) I (Xi ∈ A)
be the empirical distribution function (edf) (empirical probability
measure (epm)) of X.
Then Fn (Pn ) is the frequentist estimate of F (P).
The standard nonparametric problem
Let X = (X1 , . . . , Xn ) be i.i.d. random variables on R1 with a
common distribution function (df) F (or with a common
probability measure (pm) P).
P
P
Let Fn (x) = (1/n) I (Xi ≤ x), Pn (A) = (1/n) I (Xi ∈ A)
be the empirical distribution function (edf) (empirical probability
measure (epm)) of X.
Then Fn (Pn ) is the frequentist estimate of F (P).
In fact
Fn (x) → F (x), Pn (A) → P(A)
for each x ∈ R1 and each A in the Borel sigma field B in R1 .
Nonparametric priors in a natural way
The parameter F or P in this nonparametric problem is different
from the finite parameter case.
Nonparametric priors in a natural way
The parameter F or P in this nonparametric problem is different
from the finite parameter case.
The first and most natural attempt to introduce a distribution for
P was to mimic the case of random variables taking a finite
number of values.
Nonparametric priors in a natural way
The parameter F or P in this nonparametric problem is different
from the finite parameter case.
The first and most natural attempt to introduce a distribution for
P was to mimic the case of random variables taking a finite
number of values.
Consider a finite partition A = (A1 , . . . , Ak ) of R1 .
Ferguson (1973) defined the distribution of (P(A1 ), . . . , P(Ak )) to
be a finite dimensional Dirichlet distribution
D(αβ(A1 ), . . . , αβ(Ak )) where α > 0 and β(·) is a pm on (R1 , B).
Nonparametric priors in a natural way
By a distribution for P we mean a distribution on P, the space of
all probability measures on R1 . More precisely, we will also have to
specify a σ-field S in P. We will take the smallest σ-field
containing sets of the form {P : P(A) ≤ r } for all A ∈ B and
0 ≤ r ≤ 1 as our S.
Nonparametric priors in a natural way
By a distribution for P we mean a distribution on P, the space of
all probability measures on R1 . More precisely, we will also have to
specify a σ-field S in P. We will take the smallest σ-field
containing sets of the form {P : P(A) ≤ r } for all A ∈ B and
0 ≤ r ≤ 1 as our S.
It can be shown that Ferguson choice is a probability distribution
on (P, S) and is called the Dirichlet distribution with parameter
αβ(·) and denoted by D(αβ(·)).
This is also called the Dirichlet process since it is the distribution
of F (x), x ∈ R1 .
Nonparametric priors in a natural way
Current usage calls α as the scale factor and β(·) as the base
measure of the Dirichlet distribution D(αβ(·)).
Ferguson showed that the posterior distribution of P given the
data X is the Dirichlet distribution D(αβ(·) + nPn (·)).
and thus the Bayes estimate of P is
αβ(·)+nPn (·)
.
α+n
This is analogous to the results in the finite sample space case.
Nonparametric priors in a natural way
Under D(αβ), (P(A1 ), . . . , P(Ak )) has a finite dimensional
Dirichlet distribution.
Nonparametric priors in a natural way
Under D(αβ), (P(A1 ), . . . , P(Ak )) has a finite dimensional
Dirichlet distribution.
Two assertions made in the last slide.
• D(αβ) is a probability measure on (P, S).
• The posterior distribution given X is D(αβ + nPn ).
Another property of the Dirichlet prior that was disconcerting at
first is that
• D(αβ) gives probability 1 to the subset of of discrete
probability measures on (R1 , B).
The proofs needed some effort and were sometimes mystifying.
Nonparametric priors and exchangeable random variables
The class of all nonparametric priors are the same as the class of
all exchangeable sequences of random variables!
Nonparametric priors and exchangeable random variables
The class of all nonparametric priors are the same as the class of
all exchangeable sequences of random variables!
This followed from De Finetti’s theorem (1931). See also Hewitt
and Savage (1955), Kingman (1978).
Let X1 , X2 , . . . be an infinite exchangeable (def?) sequence of
random variables with a joint distribution Q. Then
1. The empirical distribution functions Fn (x) → F (x) with
probability 1 for all x. In fact, supx |Fn (x) − F (x)| → 0 with
probability 1. (Note that F (x) is a random distribution
function.)
Nonparametric priors and exchangeable random variables
w
2. The empirical probability measures Pn → P. This P is a
random probability measure.
3. Given P, X1 , X2 , . . . are i.i.d. P.
4. Thus the distribution of P under Q denoted by ν Q is a
nonparametric prior.
5. The class of all nonparametric priors arises in this fashion.
6. The distribution of X2 , X3 , . . . , given X1 is also exchangeable;
will be denoted by QX1 .
7. The limit P of the empirical probability measures of
X1 , X2 , . . . is also the limit of the empirical probability
measures of X2 , X3 , . . . . Thus the distribution of P given X1
(the posterior distribution) is the distribution of P under QX1
and, by mere notation, is ν QX1 .
Dirichlet prior based on a Pólya urn sequences
The Pólya urn sequence is an example of an infinite exchangeable
random variables.
Let β be a pm on R1 and let α > 0. Define the joint distribution
Pol(α, β) of X1 , X2 , . . . through
P
αβ + n−1
δXi
1
, n = 2, 3, . . .
X1 ∼ β, Xn |(X1 , . . . , Xn−1 ) ∼
α+n−1
This defines Pol(α, β) as an exchangeable probability measure.
Dirichlet prior based on a Pólya urn sequences
The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior
D(αβ)!
• That is, the distribution of (P(A1 ), . . . , P(Ak )) for any
partition (A1 , . . . , Ak ), under Pol(α, β), is the finite
dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )).
Dirichlet prior based on a Pólya urn sequences
The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior
D(αβ)!
• That is, the distribution of (P(A1 ), . . . , P(Ak )) for any
partition (A1 , . . . , Ak ), under Pol(α, β), is the finite
dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )).
P(A) ∼ B(αβ(A), αβ(Ac )).
Dirichlet prior based on a Pólya urn sequences
The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior
D(αβ)!
• That is, the distribution of (P(A1 ), . . . , P(Ak )) for any
partition (A1 , . . . , Ak ), under Pol(α, β), is the finite
dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )).
P(A) ∼ B(αβ(A), αβ(Ac )).
• The conditional distribution of (X2 .X3 , . . . ) given X1 is
αβ+δ
Pol(α + 1, α+1X1 ). Thus posterior distribution given X1 is
D(αβ + δX1 ).
• Each Pn is a discrete rpm and the limit P is also a discrete
rpm.
Dirichlet prior based on a Pólya urn sequences
The nonparametric prior ν Pol(α,β) is the same as the Dirichlet prior
D(αβ)!
• That is, the distribution of (P(A1 ), . . . , P(Ak )) for any
partition (A1 , . . . , Ak ), under Pol(α, β), is the finite
dimensional Dirichlet D(αβ(A1 ), . . . , αβ(Ak )).
P(A) ∼ B(αβ(A), αβ(Ac )).
• The conditional distribution of (X2 .X3 , . . . ) given X1 is
αβ+δ
Pol(α + 1, α+1X1 ). Thus posterior distribution given X1 is
D(αβ + δX1 ).
• Each Pn is a discrete rpm and the limit P is also a discrete
rpm. For this case of a Pólya urn sequence, we can show that
P({X1 , . . . , Xn }) → 1 with probability 1 and thus P is a
discrete rpm.
Dirichlet prior based on a Pólya urn sequences
The conditional distribution of P({X1 }) given X1 is
B(1 + αβ({X1 }), αβ(R1 \ {X1 })).
This is tricky. Is P({X1 }) measurable to begin with?
Dirichlet prior based on a Pólya urn sequences
The conditional distribution of P({X1 }) given X1 is
B(1 + αβ({X1 }), αβ(R1 \ {X1 })).
This is tricky. Is P({X1 }) measurable to begin with?
For the moment assume that β is nonatomic.
Dirichlet prior based on a Pólya urn sequences
The conditional distribution of P({X1 }) given X1 is
B(1 + αβ({X1 }), αβ(R1 \ {X1 })).
This is tricky. Is P({X1 }) measurable to begin with?
For the moment assume that β is nonatomic.
The above conditional distribution does not depend of X1 and thus
X1 and P({X1 }) are independent and
P({X1 }) ∼ B(1, α).
Dirichlet prior based on a Pólya urn sequences
Let Y1 , Y2 , . . . be the distinct values among X1 , X2 , . . . listed in
the order of their appearance.
Then Y1 = X1 ,
Y1 , P({Y1 }) are independent
Dirichlet prior based on a Pólya urn sequences
Let Y1 , Y2 , . . . be the distinct values among X1 , X2 , . . . listed in
the order of their appearance.
Then Y1 = X1 ,
Y1 , P({Y1 }) are independent and Y1 ∼ β, P({Y1 }) ∼ B(1, α).
Dirichlet prior based on a Pólya urn sequences
Consider the sequence X2 , X3 , . . . where all occurrences of X1 are
removed. This reduced sequence is the Pólya urn sequence
Pol(α, β) and independent of Y1 .
Dirichlet prior based on a Pólya urn sequences
Consider the sequence X2 , X3 , . . . where all occurrences of X1 are
removed. This reduced sequence is the Pólya urn sequence
Pol(α, β) and independent of Y1 .
As before, Y2 and
Y2 ∼ β,
P({Y2 })
1−P({Y1 })
P({Y2 })
1−P({Y1 })
are independent,
∼ B(1, α).
Dirichlet prior based on a Pólya urn sequences
Consider the sequence X2 , X3 , . . . where all occurrences of X1 are
removed. This reduced sequence is the Pólya urn sequence
Pol(α, β) and independent of Y1 .
As before, Y2 and
Y2 ∼ β,
P({Y2 })
1−P({Y1 })
P({Y2 })
1−P({Y1 })
are independent,
∼ B(1, α).
P({Y3 })
P({Y2 })
Thus P({Y1 }), 1−P({Y
,
, . . . are i.i.d. B(1, α)
1 }) 1−P({Y1 })−P({Y2 })
Dirichlet prior based on a Pólya urn sequences
Consider the sequence X2 , X3 , . . . where all occurrences of X1 are
removed. This reduced sequence is the Pólya urn sequence
Pol(α, β) and independent of Y1 .
As before, Y2 and
Y2 ∼ β,
P({Y2 })
1−P({Y1 })
P({Y2 })
1−P({Y1 })
are independent,
∼ B(1, α).
P({Y3 })
P({Y2 })
Thus P({Y1 }), 1−P({Y
,
, . . . are i.i.d. B(1, α)
1 }) 1−P({Y1 })−P({Y2 })
and all these are independent of Y1 , Y2 , Y3 . . . which are i.i.d. β.
Dirichlet prior based on a Pólya urn sequences
Since P is discrete and just sits on the set {X1 , X2 , . . . } which is
{Y1 , Y2 , . . . },
Dirichlet prior based on a Pólya urn sequences
Since P is discrete and just sits on the set {X1 , X2 , . . . } which is
{Y1 , Y2 , . . . },
P
it is also equal to ∞
1 P({Yi })δY1 , in other words,
Dirichlet prior based on a Pólya urn sequences
Since P is discrete and just sits on the set {X1 , X2 , . . . } which is
{Y1 , Y2 , . . . },
P
it is also equal to ∞
1 P({Yi })δY1 , in other words, we have the
Sethuraman construction of the Dirichlet prior (if β is nonatomic).
Blackwell and MacQueen (1973) do not obtain this result, but
show just that, the rpm P is discrete for all Pólya urn sequences.
Dirichlet prior based on a Pólya urn sequences
The Pólya urn sequence also gives the predictive distribution under
a Dirichlet prior by its very definition.
Let the distribution of X1 , X2 , . . . given P be i.i.d. P where P has
the Dirichlet prior D(αβ). Then X1 , X2 , . . . is the Pólya urn
sequence Pol(α, β).
Hence, the distribution of Xn given X1 , . . . , Xn−1 is
P
αβ+ n−1
δ Xi
1
α+n−1
.
Dirichlet prior based on a Pólya urn sequences
The awkwardness of the rpm P being discrete has turned into a
gold mine for finding clusters in data.
Let Y1 , . . . , YMn be the distinct values among X1 , . . . , Xn and let
their multiplicities be k1 , . . . , kMn . From the predictive distribution
given above we can show that the probability of Mn = m and
(k1 , . . . , km ) is
Q
αm Γ(α) m
1 Γ(kj )
.
Γ(α + n)
From this we can obtain the marginal distribution of M the
number of distinct values (or clusters) and also the conditional
distributions of the multiplicities.
Dirichlet prior based on a Pólya urn sequences
A curious property:
The number of distinct values (clusters) Mn goes to ∞ slower than
Mn
→ α.
n and in fact, log(n)
Also, from the LLN,
1.
PMn
1
I (Yj ≤x)
log(n)
→ β(x) for all x with probability
Dirichlet prior based on a Pólya urn sequences
A curious property:
The number of distinct values (clusters) Mn goes to ∞ slower than
Mn
→ α.
n and in fact, log(n)
Also, from the LLN,
1.
Pn
I (X ≤x)
PMn
1
I (Yj ≤x)
log(n)
→ β(x) for all x with probability
Note that 1 n j
→ F (x) with probability 1 where F is a rdf
and has distribution D(αβ).
Note all all this required that β be nonatomic.
Nonparametric priors through completely random measures
Kingman (1967) introduced the concept of completely random
measures.
X (A), A ∈ B is a completely random measure, if X (·) is a random
measure and if X (A1 ), . . . , X (Ak ) are independent whenever
A1 , . . . , Ak are disjoint.
If X (R1 ) < ∞ with probability 1, then P(A) =
random probability measure.
X (A)
X (R1 )
will be a
Kingman also characterized the class of all completely random
measures (subject to a σ-finite condition) and also showed how to
generate them from Poisson processes and transition functions.
The Dirichlet prior is a special case of this.
Nonparametric priors and ind inc processes
A df is just an increasing function on the real line. Consider [0, 1]
instead.
The class of processes X (t), t ∈ [0, 1] with nonnegative
independent increments is well known from the theory of infinitely
divisible laws.
When some simple cases are excluded, this process has only a
countable number of jumps which are independent of their
locations.
Nonparametric priors and ind inc processes
A special case when X (t) ∼ Gamma(αβ(t)) for t ∈ [0, 1].
(t)
is a random distribution function (since X (1) < ∞
F (t) = XX (1)
with probability 1). Ferguson (second definition) shows that its
distribution is the nonparametric prior D(αβ(·)).
The properties of the Gamma process show that the rpm is
discrete and also has the representation
P([0, t]) = F (t) =
∞
X
pi∗ δYi ([0, t])
1
where p1∗ > p2∗ > · · · are the jumps of the Gamma process in
decreasing order. The jumps are independent of the locations and
thus (p1∗ , p2∗ , . . . ) which are independent of Y1 , Y2 , . . . which are
i.i.d. β.
Sethuraman construction of Dirichlet priors
Sethuraman (1994)
Let α > 0 and let β(·) be a pm. We do not assume that β is
nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be
independent of V1 , V2 , . . . and be i.i.d. β(·).
Sethuraman construction of Dirichlet priors
Sethuraman (1994)
Let α > 0 and let β(·) be a pm. We do not assume that β is
nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be
independent of V1 , V2 , . . . and be i.i.d. β(·).
Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note
that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d.
discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other
contexts, like species sampling, this is called the GEM(α)
distribution.
Sethuraman construction of Dirichlet priors
Sethuraman (1994)
Let α > 0 and let β(·) be a pm. We do not assume that β is
nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be
independent of V1 , V2 , . . . and be i.i.d. β(·).
Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note
that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d.
discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other
contexts, like species sampling, this is called the GEM(α)
distribution.
Then the convex mixture
P(·) =
∞
X
1
pi δYi (·)
Sethuraman construction of Dirichlet priors
Sethuraman (1994)
Let α > 0 and let β(·) be a pm. We do not assume that β is
nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be
independent of V1 , V2 , . . . and be i.i.d. β(·).
Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note
that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d.
discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other
contexts, like species sampling, this is called the GEM(α)
distribution.
Then the convex mixture
∞
∞
X
X
P(·) =
pi δYi (·) = p1 δY1 (·) + (1 − p1 )
1
2
pi
δY (·)
1 − p1 i
Sethuraman construction of Dirichlet priors
Sethuraman (1994)
Let α > 0 and let β(·) be a pm. We do not assume that β is
nonatomic. Let V1 , V2 , . . . , be i.i.d. B(1, α) and let Y1 , Y2 , . . . be
independent of V1 , V2 , . . . and be i.i.d. β(·).
Let p1 = V1 , p2 = (1 − V1 )V2 , p3 = V3 (1 − V1 )(1 − V2 ), . . . . Note
that (p1 , p2 , . . . ) is a random discrete distribution with i.i.d.
discrete failure rates V1 , V2 , . . . . “Stick breaking.” In other
contexts, like species sampling, this is called the GEM(α)
distribution.
Then the convex mixture
∞
∞
X
X
P(·) =
pi δYi (·) = p1 δY1 (·) + (1 − p1 )
1
2
pi
δY (·)
1 − p1 i
is a random discrete probability measure and its distribution is the
Dirichlet prior D(αβ).
Sethuraman construction of Dirichlet priors
The random variables Y1 , Y2 . . . can take values any measure
space.
The proof is demonstrated by rewriting the constructive definition
as
P = p1 δY1 + (1 − p1 )P ∗
where all the random variables are independent,
p1 ∼ B(1, α), Y1 ∼ β and the two rpm’s P, P ∗ have the same
distribution.
For the distribution of P we have the distributional identity
d
P = p1 δY1 + (1 − p1 )P.
We first show that D(αβ) is a solution to this distributional
equation and the solution is unique.
Sethuraman construction of Dirichlet priors
To summarize,
P(·) =
∞
X
pi δYi (·)
1
is a concrete representation of a random probability measure and
its distribution is D(αβ(·)).
Sethuraman construction of Dirichlet priors
To summarize,
P(·) =
∞
X
pi δYi (·)
1
is a concrete representation of a random probability measure and
its distribution is D(αβ(·)).
It was not assumed that β is a nonatomic probability measure and
the Yi ’s can be general random variables.
Sethuraman construction of Dirichlet priors
The Sethuraman construction is similar to
P(·) =
∞
X
pi∗ δYi (·)
1
obtained when using a Gamma process to define the Dirichlet prior.
It can be shown that these (p1∗ , p2∗ , . . . ) are the same, in
distribution, as (p1 , p2 , . . . ) arranged in increasing order.
Definition: A one-time size biased sampling converts (p1∗ , p2∗ , . . . )
to (p1∗∗ , p2∗∗ , . . . ) as follows.
Sethuraman construction of Dirichlet priors
The Sethuraman construction is similar to
P(·) =
∞
X
pi∗ δYi (·)
1
obtained when using a Gamma process to define the Dirichlet prior.
It can be shown that these (p1∗ , p2∗ , . . . ) are the same, in
distribution, as (p1 , p2 , . . . ) arranged in increasing order.
Definition: A one-time size biased sampling converts (p1∗ , p2∗ , . . . )
to (p1∗∗ , p2∗∗ , . . . ) as follows. Let J be an integer valued random
variable with Prob(J = n|p1∗ , p2∗ , . . . ) = pn∗ .
If J = 1 define (p1∗∗ , p2∗∗ , . . . ) to be the same as (p1∗ , p2∗ , . . . ).
If J > 1 put p1∗∗ = pJ∗ and let p2∗∗ , p3∗ ∗, . . . be p1∗ , p2∗ . . . with pJ∗
removed.
Sethuraman construction of Dirichlet priors
As a converse result, the distribution of (p1 , p2 , . . . ) is the limiting
distribution after repeated size biased sampling of (p1∗ , p2∗ , . . . ).
McCloskey (1965), Pitman (1996), etc.
The distribution of p1 , p2 , . . . does not change after a single size
biased sampling.
This property can be used to establish that in the nonparametric
problem with one observation X1 , the posterior distribution of P
given X1 is D(αβ + δX1 ). Simpler than the one given in
Sethuraman (1994).
Sethuraman construction of Dirichlet priors
Ferguson showed that the support of the D(αβ) is the collection of
probability measures in P whose support is contained in the
support of β.
If the support of β is R1 then the support of Dαβ is P.
This assures us that even though D(αβ) gives probability 1 to the
class of discrete pm’s, it gives positive probability to every
neighborhood of every pm.
Dirichlet priors are not discrete
The Dirichlet prior D(αβ) is not a discrete pm; it just sits on the
set all discrete pm’s.
Dirichlet priors are not discrete
The Dirichlet prior D(αβ) is not a discrete pm; it just sits on the
set all discrete pm’s.
The rpm P with distribution D(αβ) is a random discrete
probability measure.
Absolute continuity
Consider D(αi βi ), i = 1, 2. These two measures are absolutely
continuous with respect to each other or they are orthogonal.
Absolute continuity
Consider D(αi βi ), i = 1, 2. These two measures are absolutely
continuous with respect to each other or they are orthogonal.
Let βi1 , βi,2 be the atomic and discrete part of βi , i = 1, 2.
D(αi βi ), i = 1, 2 are orthogonal to each other if α1 6= α2 or if
β11 6= β21 or if support of β12 6= support of β22 . There is a
necessary and sufficient condition for orthogonality when all of
these are equal.
Absolute continuity
The curious result that we can consistently estimate the
parameters of the Dirichlet prior from the sample happens because
of the orthogonality of Dirichlet distributions when their
parameters are different.
Absolute continuity
The curious result that we can consistently estimate the
parameters of the Dirichlet prior from the sample happens because
of the orthogonality of Dirichlet distributions when their
parameters are different.
Another curious result: When β is nonatomic, the prior D(αβ), the
posterior given X1 , the posterior given X1 , X2 , the posterior given
X1 , X2 , X3 etc. are all orthogonal to one another!
Some properties of Dirichlet priors
A
R simple problem is the estimation of the “true” mean, i.e.
xdP(x) from data X1 , X2 , . . . , Xn which are i.i.d. P.
In the Bayesian nonparametric problem, P has a prior distribution
D(αβ) and given P, the data X1 , . . . , Xn are i.i.d. P.
R
The Bayesian estimate of xdP(x) is its mean under the posterior
distribution.
Some properties of Dirichlet priors
A
R simple problem is the estimation of the “true” mean, i.e.
xdP(x) from data X1 , X2 , . . . , Xn which are i.i.d. P.
In the Bayesian nonparametric problem, P has a prior distribution
D(αβ) and given P, the data X1 , . . . , Xn are i.i.d. P.
R
The Bayesian estimate of xdP(x) is its mean under the posterior
distribution.
R
However, before estimating xdP(x),
Some properties of Dirichlet priors
A
R simple problem is the estimation of the “true” mean, i.e.
xdP(x) from data X1 , X2 , . . . , Xn which are i.i.d. P.
In the Bayesian nonparametric problem, P has a prior distribution
D(αβ) and given P, the data X1 , . . . , Xn are i.i.d. P.
R
The Bayesian estimate of xdP(x) is its mean under the posterior
distribution.
R
However, before estimating xdP(x), one should first check that
our prior
R and posterior give probability 1 to the set
{P : |x|dP(x) < ∞}.
Some properties of Dirichlet priors
Feigin and Tweedie (1989), and others
R later, gave necessary and
sufficient conditions for this, namely log(max(1, |x|))dβ(x) < ∞.
From our constructive definition,
Z
|x|dP(x) =
∞
X
p1 |Yi |.
1
The Kolomogorov three series theorem gives a simple direct proof.
Sethuraman (2010).
Some properties of Dirichlet priors
The Bayes estimate of the mean is the mean under the posterior
distribution D(αβ + nPn ) and it is
R
α xdβ(x) + nX¯n
.
α+n
R
One should check first that log(max(1, |x|))dβ(x) < ∞.
Some properties of Dirichlet priors
The Bayes estimate of the mean is the mean under the posterior
distribution D(αβ + nPn ) and it is
R
α xdβ(x) + nX¯n
.
α+n
R
One should check first that log(max(1,
|x|))dβ(x) < ∞. Note
R
that the Bayesian can estimate xdP(x) in this case even though
the base measure β may not have a mean. (What does frequentist
estimate X¯n estimate in this case?).
Some properties of Dirichlet priors
R
The actual distribution of xdP(x) under D(αβ) is a vexing
problem. Regazzini, Lijoi and Prünster (2003), Lijoi and Prünster
(2009) have the best results.
When β is the Cauchy distribution, it is easy that
Z
xdP(x) =
∞
X
pi Yi
1
R
where Y1 , Y2 , . . . are i.i.d. Cauchy, and hence xPd(x) is Cauchy.
One does not need the GEM property of (p1 , p2 , . . . ) for this; it is
enough for it to be independent of (Y1 , Y2 , . . . ). Yamato (1984)
was the first to prove this.
Some properties of Dirichlet priors
Convergence of Dirichlet priors when their parameters converge is
easy to establish using the constructive definition.
w
When α → ∞ and β is fixed, D(αβ) → δβ .
Some properties of Dirichlet priors
Convergence of Dirichlet priors when their parameters converge is
easy to establish using the constructive definition.
w
When α → ∞ and β is fixed, D(αβ) → δβ .
w
When α → 0 and β is fixed, D(αβ) → δY where Y ∼ β.
Sethuraman and Tiwari (1982).
Some properties of Dirichlet priors
The constructive definition
P(·) =
∞
X
pi δYi (·)
1
leads to the inequality
||P −
M
X
1
pi δYi || ≤
M
Y
(1 − pi ).
1
So one can allow for several kinds of random stopping to stay
within chosen errors. One can also stop at nonrandom times and
have probability bounds for errors. Muliere and Tardella (1998) has
several results of this type.
Some properties of Dirichlet priors
Let P ∼ D(αβ) and let f be a function of X . The Pf −1 the
distribution of f under P has the D(αβf −1 ) where βf −1 is the
distribution of f under β
Some properties of Dirichlet priors
Regression problems are studied when the data is of the form
(Y1 , X1 ), (Y2 , X2 ), . . . and the model is that Yi = f (Xi , i ). If the
distribution of is P, then the conditional distribution of Y given
X = x is the distribution of f (x, ·) under P and is denoted by Mx .
Let P ∼ D(α, β) and given P let 1 , 2 , . . . be i.i.d. P. The
constructive definition gives
P=
∞
X
1
where Z1 , Z2 , . . . are i.i.d. β.
p i δ Zi
Some properties of Dirichlet priors
The df Mx is the distribution of f (x, ·) under P and is thus has a
Dirichlet distribution and can be represented by
Px =
∞
X
pi δZi,x
1
where Z1,x , Z2,x , . . . are i.i.d from the distribution of f (x, )˙ under
β. MacEachern (1999) calls this the dependent Dirichlet prior
(DDP).
Some properties of Dirichlet priors
Ishwaran and James (2011) allow the Vi ’s appearing in the
definition of the pi ’s to independent B(ai , bi ) random variables.
Rodriguez (2011) allows Vi = Φ(Ni ) where N1 , N2 , . . . are i.i.d.
N(µ, σ 2 ) and calls it “probit stick breaking”.
Bayes analysis is with such priors can be handled only by
computational methods.
Bayes hierarchical models
A popular Bayes hierarchical model is usually stated as follows.
The data X1 , . . . , Xn are independent with distributions
K (·, θ1 ), . . . , K (·, θn ), where K (·, θ) is a nice continuous df for
each θ.
Next it is assumed that there is rpm P, and given P, θ1 , . . . , θn
are i.i.d. P.
Finally, it is assumed that P has the D(αβ) distribution.
This is also called the DP mixture model and it has lots of
applications.
Bayes hierarchical models
A popular Bayes hierarchical model is usually stated as follows.
The data X1 , . . . , Xn are independent with distributions
K (·, θ1 ), . . . , K (·, θn ), where K (·, θ) is a nice continuous df for
each θ.
Next it is assumed that there is rpm P, and given P, θ1 , . . . , θn
are i.i.d. P.
Finally, it is assumed that P has the D(αβ) distribution.
This is also called the DP mixture model and it has lots of
applications.
One wants the posterior distribution of P given the data
X1 , . . . , Xn .
Bayes hierarchical models
One should state the Bayes hierarchical model more carefully.
Let the rpm P have the D(αβ) distribution.
Conditional on P, let θ1 , . . . , θn be i.i.d. P.
Conditional of (P, θ1 , . . . , θn ), let X1 , . . . , Xn be independent with
distributions K (·, θ1 ), . . . , K (·, θn ), where K (·, θ) is a df for each θ.
Then (X1 , . . . , Xn , θ1 , . . . , θn , P) will have a well defined joint
distribution and all the necessary conditional distributions are also
well defined.
Ghosh and Ramamoorthy (2003) are careful to state this way
throughout their book.
Bayes hierarchical models
Several computational methods have been proposed in the
literature.
West, Muller, and Escobar (1994), Escobar and West (1998), and
MacEachern (1998) have studied computational methods. They
integrate out the rpm P and deal with the joint distribution of
(X1 , . . . , Xn , θ1 , . . . , θn ).
The hierarchical model can also be expressed as follows,
suppressingR the latent variables θ1 , . . . , θn : For any pm P, let
K (·, P) = K (·, θ)dP(θ).
P ∼ D(αβ) and given P, let X1 , . . . , Xn be i.i.d. P.
Bayes hierarchical models
To perform an MCMC one can use the constructive definition and
put
∞
X
K (·, P) =
pi K (·, Yi )
1
where (p1 , p2 , . . . ) is the usual GEM(α) and Y1 , Y2 , . . . are i.i.d. .β
To do the MCMC here one has to truncate this infinite series
appropriately.
Doss (1994), Ishwaran and James (2011) and others.
More
More Bayes hierarchical models: Neal (2000), Rodriguez, Dunson
and Gelfand (2008)
Clustering applications: Blei and Jordan (2006), Wang, Shan and
Banerjee (2009)
Two parameter Dirichlet process: Perman, Pitman and Yor (1992),
Pitman and Yor (1997).
Partition based priors - useful in reliability and repair models:
Hollander and Sethuraman (2009)
Download