Bayesian Nonparametrics Foundations and Applications Notes to Stat 718 (Spring 2008) 1

advertisement
Bayesian Nonparametrics
Foundations and Applications
Notes to Stat 718 (Spring 2008)
Jayaram Sethuraman
E-mail: sethuram@math.sc.edu
Modified on April 22, 2008
January 15, 2008
1
Introduction of Bayes Inference
A statistician wishes to make inferences about the unknown state of nature.
Suppose that this unknown state of nature is summarized by a real variable
θ. We try to observe some data, represented as X, which again for simplicity,
will be assumed to be a a real number or a finite dimensional vector of real
numbers. It’s distribution will depend on θ. This fact allows us to infer
something about θ from the data X.
The Bayesian begins by saying that he has some information about θ
1
even before he has collected data. This is assumed to be summarized by a
probability distribution (the prior distribution) for θ defined by a pdf π(θ).
The distribution of the data given that the state of nature is θ is summarized
by a distribution with pdf p(x|θ).
Let Q denote the joint distribution of (X, θ). The pdf of this distribution
is
p(x|θ)π(θ).
The pdf of the conditional distribution of θ given the data X (the posterior
distribution) can be given as
π(θ|x) = R
p(x|θ)π(θ)
∝ p(x|θ)π(θ).
p(x|θ′ )π(θ′ )dθ′
Example 1.1. Let the random variables X1 , X2 , ...Xn be i.i.d N(θ, σ 2 ) where
σ 2 is known. Let X stand for (X1 , ...Xn ). Then
p(x|θ) ∝ e−n
where x̄ =
1
n
Pn
1
(x̄−θ)2
2σ 2
A(x)
xi and A(x) depends only on x.
Suppose that the prior information on θ is summarized by the normal
distribution with pdf π(θ) ∼ N(µ, τ 2 ). Then
−n(X̄ − θ)2 (θ − µ)2
−
)
2σ 2
2τ 2
1 n
1
nx̄
µ
∝ exp (− ( 2 + 2 )θ2 + θ( 2 + 2 ))
2 σ
τ
σ
τ
µ
nX̄
1
1 n
2 + τ2 2
∝ exp (− ( 2 + 2 )(θ − σn
1 )
2 σ
τ
+
2
σ
τ2
nX̄
µ
1
2 + τ2
.
∼ N σn
1 , n
+ τ 2 σ2 + τ12
σ2
π(θ|x) ∝ exp (
2
Under a squared error loss, the optimal estimate of θ is given by θ̂ =
E(θ|x) which is the expected value of θ under the posterior distribution
π(θ|x). This Bayes estimate is equal to
θ̂ =
nx̄
σ2
n
σ2
+
+
µ
τ2
1
τ2
.
This is usually interpreted by saying that the Bayes estimate is a convex
combination of the prior mean and the sample mean, with constants proportional to the prior precision ( τ12 ) and the sample precision ( σn2 ). It is
interesting to note that if τ → ∞, this estimate tends to x̄, the sample
mean. Thus τ → ∞ is interpreted as no prior information about θ and it
produces the usual frequentist estimate. This can also be formally done by
using the uniform distribution on (−∞, +∞) for θ. It should be emphasized
that this is not a rigorous Bayesian way of doing things.
Example 1.2. Let N = (N1 , N2 , . . . Nk ) be a multivariate random vector
P
satisfying Ni ≥ 0, i = 1, . . . , k and ki=1 Ni = n, whose distribution depends
P
on a parameter θ = (θ1 , θ2 . . . θk ) satisfying θi ≥ 0, ki=1 θi = 1. Suppose
that this distribution is given by the pmf
k
Y
n!
p(N|θ) =
θiNi .
N1 ! . . . Nk ! i=1
A popular prior distribution for θ is given the pdf of k−1 variables θ1 , . . . , θk−1 :
π(θ) ∝
k
Y
θiαi −1
1
k
Γ(α1 + · · · + αk ) Y αi −1
=
θ
Γ(α1 ) . . . Γ(αk ) i=1 i
3
with θk = 1 − θ1 − · · · − θk−1 and where α1 > 0, . . . , αk > 0 satisfying
α1 + · · · + αk > 0. This distribution is called the finite dimensional Dirichlet
distribution D(α1 . . . αk ) = Dα where α = (α1 , . . . , αk ).
An alternate description can be given as θ ∼ Dα if θ1 =
Z1
, . . . θk
Z
=
Zk
Z
when Z = Z1 + . . . Zk and Z1 , . . . Zk are independent Gamma r. v.’s with
parameters (λ, α1 ), . . . , (λ, αk ), respectively. A r. v. Z is said to have the
Gamma distribution Γ(λ, α) if its pdf is
p(z) =
λα −λz α−1
e z
Γ(α)
z≥0
if α > 0 and is δ0 , the degenerate distribution at 0 if α = 0.
Notice that
π(θ|N) ∝
k
Y
i=1
θiNi +αi −1 ∼ D(α1 + N1 , . . . , αk + Nk ) = Dα+N
and hence the posterior distribution of θ is Dα+N .
We will examine this in some more detail below.
Lemma 1.1. Let U, V, A be independent random variables with U ∼ Dα, V ∼
P
P
P
P
Dβ and A ∼ B( ki=1 αi , ki=1 βi ) (i.e. (A, 1 − A) ∼ D( ki=1 αi , ki=1 βi )).
Let W = AU + (1 − A)V. Then W ∼ D(α1 + β1 , . . . αk + βk ) = Dα+β .
Z∗
Z∗
Proof: Let U = ( ZZ1 , . . . ZZk ), V = ( Z1∗ , . . . Zk∗ ), A =
Z
,
Z+Z ∗
where Z = Z1 +
· · · + Zk , Z ∗ = Z1∗ + · · · + Zk∗ and Z1 , . . . , Zk , Z1∗ , . . . , Zk∗ are independent random Gamma variables with parameters α1 , . . . , αk , β1 , . . . , βk , respectively.
By Basu’s theorem U is independent of Z, V is independent of Z ∗ . Also
P
P
U ∼ Dα, V ∼ Dβ and A ∼ B( αi , βi ). Thus the random variables
(U, V, A) defined above have the same joint distribution as (U, V, A) in
Lemma 1.1.
4
Z +Z ∗
1
1
Now, W = AU + (1 − A)V = ( Z+Z
∗ ,...,
Zk +Zk∗
).
Z+Z ∗
This shows that the
distribution of W is Dα+β .
Remark 1.1. The conditional probability of A given B denoted by P (A|B)
is better viewed as a function of B, with A fixed.
Thus we also have
P (A|B c ), P (A|φ) and P (A|Ω), in fact we have P (A|C) for C ∈ σ(B) =
{φ, B, B c , Ω}. It will have the property P (A ∩ C) = P (A|C)P (C) for all
C ∈ σ(B). Similarly the conditional expectation of X given Y denoted
by E(X) = E(E(X|Y )) should be viewed as function of Y satisfying the
condition
Z
E(X|Y )I(Y ∈ C)dQ =
Z
XI(Y ∈ C)dQ
for all Borel sets C, where Q is the joint distribution of (X, Y ). As a special
case the conditional probability of X ∈ A given Y is a function Q(X ∈ A|Y )
of Y satisfying
Z
Z
Q(X ∈ A|Y )I(Y ∈ C)dQ = I(X ∈ A)I(Y ∈ C)dQ
for all Borel sets C. When (X, Y ) possesses a joint pdf p(x, y) under Q and
λ(y) is the pdf (w.r.t ν) of Y , the conditional probability Q(X ∈ A|y) can
be written as
R
X∈A
or just that
p(x,y)dν
λ(y)
p(x, y)dν
λ(y)
is the pdf of the conditional distribution of X given Y = y.
Back to Example 1.2:
The random vector N = (N1 , . . . , Nk ) at the beginning of this example can
also be viewed in a different way. Let X be a random variable taking values
in {1. . . . , k} with a probability distribution, p(·|θ), with a pmf given by
Q(X = i|θ) = p(i|θ) = θi , i = 1, . . . , k.
5
Let X1 , . . . , Xn be i.i.d. with distribution p(·|θ). Then the joint distribution
of (X1 , . . . , Xn ) has a pmf given by
k
Y
θiNi
1
where Ni =
Pk
1
I(Xj = i), i = 1, . . . , k.
Recall that we assumed a finite dimensional Dirichlet prior for θ, namely
θ ∼ Dα .
The joint dist of ((X1 , . . . , Xn ), θ) is proportional to
k
Y
θiNi
1
k
Y
θiαi
1
and the conditional distribution of θ given (X1 , . . . , Xn ) has pdf proportional
to
k
Y
θiNi +αi
1
which corresponds to the finite dimensional Dirichlet distribution Dα+N .
Define
e1 = (1, 0, . . . 0), e2 = (0, 1, . . . 0), . . . ek = (0, 0, . . . 1).
Assume that n = 1. Then X1 can take k possible values and N = (N1 , . . . , Nk )
can correspondingly take the values e1 , . . . , ek . In fact, N = eX1 . From now
on write X for X1 .
The posterior distribution of θ given X is Dα+eX . From Remark 1.1 we
6
have
Dα(B) = Q(θ ∈ B)
k
X
=
Q(θ ∈ B|X = i)Q(X = i)
1
=
=
k
X
1
k
X
1
Dα+ei (B)Q(X = i)
Dα+ei (B)
αi
.
Pk
1 αj
Thus we have proved the following Lemma.
Lemma 1.2.
Dα =
2
k
X
1
αi
Dα+ei Pk .
1 αj
The nonparametric problem
Let X1 , X2 . . . Xn be i.i.d with distribution F . One can consider several
nonparametric hypotheses concerning the df F .
• The parameter F is completely unspecified.
• F (x) = G(x − θ) for all x, θ is unknown and G is an unspecified
symmetric df.
• F has IFR; ..
7
January 17, 2008
We will now give some details on the use of Basu’s theorem to establish
the the independence of U = (U1 , ...Uk ) and Z in Lemma 1.1. Note that
U1 =
Z1
, ...Uk
Z
=
Zk
Z
where Z = Z1 + ...Zk where Z1 . . . Zk are independent
Gamma(λ, α1), . . . , Gamma(λ, αk ). Then Z is a complete sufficient statistic
for λ and the distribution of (U1 , . . . , Uk ) is free of λ. Basu’s theorem says
that any λ-free statistic is independent of a complete sufficient statistic. This
establishes the independence of (U1 , . . . , Uk ) and Z.
It is easy to extend Lemma 1.1 to more than two component random
vectors as follows.
Lemma 2.1. Let (A1 , A2 , A3 ) ∼ D(Pk
i=1
αi ,
Pk
i=1
βi ,
Pk
i=1
γi ) ,
U ∼ Dα , V ∼ Dβ ,
W ∼ Dγ . Note that A1 +A2 +A3 = 1. Assume that U, V, W and (A1 , A2 , A3 )
are independent. Then
A1 U + A2 V + A3 W ∼ Dα+β+γ .
The next Lemma is a consequence of Lemma 1.1.
Lemma 2.2. Recall ei = (0, . . . , 1, . . . , 0), where the ith co-ordinate is 1
and the rest are 0. Let U ∼ Dei = δei , that is U = ei with probability 1.
P
Let V ∼ Dβ and let A ∼ B(1, ki=1 βi ). Furthermore, let U, V and A be
independent. (We really need only that V and A are independent since U is
degenerate.) Then
AU + (1 − A)V ∼ Dβ+ei ,
i.e.
Aδei + (1 − A)V ∼ Dβ+ei .
8
Here is another proof of the posterior distribution in Example 1.2 using
moments to identify the posterior distribution.
The pdf p(u) of the distribution of U ∼ D(α1 ,...αk ) is given by
p(u) =
Γ(α1 + · · · αk ) α1 −1
u
. . . ukαk −1 .
Γ(α1 ) . . . Γ(αk ) 1
Where uk = 1 − u1 − . . . uk−1 and α1 > 0, . . . αk > 0
Let U have the Dirichlet distribution Dα. Then the moments are given
by
E(U1r1 U2r2 ...Ukrk )
=
Qk
[r ]
αi i
(α1 + · · · + αk )[r1 +···+rk ]
1
(2.1)
where ri ≥ 0 for i = 1 . . . k. This can checked from the pdf of U, ignoring
coordinates corresponding to {i : αi = 0}. Note that we have used the
notation for ascending factorials: a[r] = a(a + 1)(a + r − 1) if a ≥ 0 and
0[0] = 1,0[r] = 0 if r > 0.
Conversely, the distribution of a random variable U on the simplex in
Rk whose moments are given by (2.1) is uniquely determined (since U is
bounded) and is the Dirichlet distribution Dα. Thus this can be considered
as the third definition of a Dirichlet distribution.
Let X be a discrete r.v. taking values in {1, . . . , k}, with a pmf given by
Q(X = i|p) = p(i), i = 1, . . . k where p = (p(1), ...p(k)).
Let p have (prior) distribution Dα . Then
Z
Z1 Zk
Q(p ∈ B) =
dDα = Q(( , ...
∈ B)).
Z
Z
B
We will now find the posterior distribution Q(p ∈ B|X1 = i) by the
moment-characterization of the Dirichlet distribution. This will be another
way to derive this posterior distribution.
9
The marginal distribution of X1 is given by
αi
Q(X1 = i) = E(Q(X1 = i|p)) = E(p(i)) = Pk
1 αj
by using the moments of a Dirichlet distribution. Assume that αi > 0. Then
Q(X1 = i) > 0. Thus
Q(p ∈ B, X1 = i)
Q(X1 = i)
P
Z
( k1 αj )
=
p(i)dDα
.
αi
p∈B
Q(p ∈ B|X1 = i) =
This means that
(
E(f (p(1), ...p(k))|X1 = i) = E(p(1)f (p(1), ...p(k)))
Pk
1
αj )
αi
.
(2.2)
For any vector of non-negative integers r = (r1 , . . . , rk ), let r∗ = r + ei =
Pk ∗
Pk
∗
∗
(r1 , . . . , ri + 1, . . . , rk ). Let r =
1 rj . Also let α = α +
1 rj , r =
Q
ei = (α1 , . . . , αi + 1, . . . , αk ). By choosing f (p(1), . . . , p(k)) = k1 p(j)rj in
equation (2.2) we get
E(
k
Y
j=1
rj
p(j) |X = i) = E(
k
Y
p(j) )
j=1
Y
=
rj∗
[r ∗ ]
αj j
(
Pk
1
αj )
αi
P
( k1 αj )
P
( k1 αj )[r∗ ] αi
[r ]
(α1∗ )[r1 ] . . . (α∗ )k k
P
.
( αj∗ )[r]
=
This proves that the conditional distribution of p given X = i is Dα+ei . It
also yields another proof of Lemma 1.2 as follows.
Dα (p ∈ B) = Q(p ∈ B) =
=
k
X
1
k
X
1
Q(p ∈ B)|X = i)Q(X = i)
αi
Dα+ei (p ∈ B) Pk .
1 αi
10
We will examine this conclusion of Lemma 1.2 further. Let A ∼ B(1,
Pk
1
αi ),
V ∼ Dα and A, V be independent. Let X be independent of (A, V) with
Q(X = i) =
α
P ki
1 αi
for i = 1 . . . k, Let δx = (δx (1), . . . δx (k)) = ei if X = i, i.e.
let δX = eX . So what is the distribution of AδX + (1 − A)V?
Q(AδX + (1 − A)V ∈ B) = E(Q(AδX + (1 − A)V ∈ B|X = i))
k
X
=
Q(AδX + (1 − A)V ∈ B|X = i)Q(X = i)
=
1
k
X
1
Q(Aei + (1 − A)V ∈ B|X = i)Q(X = i)
αi
= Dα+ei Pk
1 αi
= Dα .
This means that
d
V = A1 δx1 + (1 − A1 )V
d
= A1 δx1 + (1 − A1 )[A2 δx2 + (1 − A2 )V]
d
= A1 δx1 + A2 (1 − A1 )δx2 + A3 (1 − A1 )(1 − A2 )δx3 + . . .
where A1 , A2 . . . , X1 , . . . , V are independent, A1 , A2 . . . are i.i.d B(1,
X1 , X2 . . . are i.i.d. with Q(X = i) =
α
Pki
1 αi
Pk
1
αi ),
for i = 1 . . . k. Thus the right
hand side is representation of the finite dimensional Dirichlet distribution
Dα .
11
3
The nonparametric problem
Let X1 , X2 . . . be i.i.d F or P . Recall that a probability measure is a set
function
P = P (B), B ∈ B
satisfying
• P (A) ≥ 0 for all A ∈ B,
• P (R1 ) = 1 and P (φ) = 0, and
• if A1 , A2 · · · ∈ B are pairwise disjoint, then P (∪∞
1 Ai ) =
P∞
1
P (Ai ).
Question: How to verify if a set function P is a probability measure on
the real line? Do we know all the probability measures on the real line?
Let
P ((−∞, x]) = F (x)
The function F (x) is called the distribution function (df) associated with the
probability measure P . It satisfies the conditions:
• F is monotone non-decreasing,
• F (x) → 0 as x → −∞, F (x) → 1 as x → ∞, and
• F (x) is right continuous.
To show F (x) that satisfies the above conditions, we need to verify an
uncountable number of conditions. Consider the restriction F ∗ (x) of F on
{x ∈ R∗ }, where R∗ is the subset of rational numbers in R1 . It will satisfy
• F ∗ is monotone non-decreasing,
12
• F ∗ (x) → 0 as x → −∞, F ∗ (x) → 1 as x → ∞, and
• F ∗ (x) is right continuous
when viewed as a function on R∗ . The number of conditions that we have to
verify here is countable. Thus there is a 1 ↔ 1 correspondence among df’s
F ∗ , F and pm’s P . So, one can determine if P is a probability measure by
verifying just a countable number of conditions.
This result holds when we look at probability distributions in separable
complete metric spaces.
13
January 22, 2008
Let us recall some definitions and theorems:
Definition 3.1. A field is a non-empty class of subsets of Ω closed under
finite unions, finite intersections and complements, and containing Ω.
Definition 3.2. A σ-field is a non-empty class of subsets of Ω closed under
countable unions, countable intersections and complements, and containing
Ω.
Definition 3.3. P0 is countable additive on a field F0 , if
P (∪∞
1 An ) =
∞
X
P (An )
1
for every collection {An ∈ F0 ,n ≥ 1} of disjoint sets whose union ∪∞
1 An is
in F′.
Definition 3.4. P is a probability measure (pm) if it is defined on a σ-field
F and P (Ω) = 1 and is countable additive on F , i.e. if
P (∪∞
1 An ) =
∞
X
P (An )
1
for every collection {An ∈ F ,n ≥ 1} of disjoint sets.
Definition 3.5. Let
F0 = finite unions of sets of the form {(a, b], −∞ ≤ a < b < ∞}
and
F = smallest sigma field containing F0 .
The σ-field F is called the Borel σ-field in R.
14
Theorem 3.1. Caratheodory Extension Theorem: Let F0 be a field of
subsets of Ω, and let P0 : F0 → [0, 1] be a countable additive set function
on F0 with P0 (Ω) = 1. Then there is a unique probability measure P on
F = σ(F0 ) that extends P0 .
Recall the field F0 of finite unions of half-open intervals in R and the
Borel σ-field F in R.
A function F on R is a distribution function if
I limx→−∞ F (x) = 0, limx→∞ F (x) = 1,
II F is non-decreasing, i.e. F (x) ≤ F (y) if x ≤ y, and
III F is right continuous, i.e limx↓y F (x) =F (y).
Let F be a distribution function and define P0 ((a, b]) = F (b)−F (a). Then
P0 is well-defined on F0 and P0 is countable additive on F0 .
By the Caratheodory Extension Theorem, there is a unique extension P
of P0 which is a probability measure on F .
Thus there is a 1-1 correspondence between distribution functions and
probability measure.
If we want to show that some function is a distribution function, there
are uncountable conditions to verify. We can decrease that to a countable
conditions as follows. Let F ∗ (x), x ∈ R∗ be the restriction of the df F to
the rationals R∗ . Then F ∗ will satisfy conditions I, II and III.
Conversely, if F ∗ is a function on R∗ , it takes only a countable number
of conditions to verify that it satisfies conditions I, II and III.
We can extend such a function F ∗ to a unique df F on the real line and
to a unique pm P on the Borel sets.
15
Let P = {all probability measures P on (R, F )}. If we are going to place
a pm on P we will need a σ-field on this space.
Let NA,r = {P : P (A) < r} where A ∈ F , 0 ≤ r ≤ 1. Define σ(P)
be the smallest σ-field containing NA,r , for all A ∈ F and r ∈ [0, 1]. This
σ-field is also the smallest σ-field under which P (A) is measurable for all
R
A ∈ F . Further more, {P : f dP < r} will be in σ(P) if f is bounded and
measurable.
Let A = (A1 , . . . Ak ) be a finite measurable partition of R. Suppose that
we can assign ΠA as the probability distribution of P (A) = (P (A1 ), . . . ,
P (Ak )), for each finite measurable partition A. We will examine conditions
under which this defines a pm on P.
We will definitely need a consistency condition. Let A∗ be the partition
obtained by unions of some disjoint collections of subsets of A. Then we can
add the corresponding elements in P (A) to form P (A∗ ). Call this mapping
φA,A∗ . The consistency condition we need is
Condition A: ΠA φ−1
A,A∗ = ΠA∗ .
Assignment problem: Let h(θ1 , . . . θk ) be the pdf of prior distribution for
θ. Then the posterior distribution of θ given X1 = i is ∝ θi h(θ1 , . . . θk ).
As a particular example which satisfies the consistency condition A, we
can postulate that (P (A1 ), . . . , P (Ak )) ∼ D(α(A1 ), . . . , α(Ak )) for each finite
partition A, where α(·) is a non-zero finite countable additive measure on
(R, F ). The consistency condition A holds because
(P (A1 ) + P (A2 ), P (A3), . . . P (Ak )) ∼ D(α(A1) + α(A2 ), α(A3), . . . , α(Ak ))
∼ D(α(A1 ∪ A2 ), α(A3), . . . , α(Ak ))
∼ (P (A1 ∪ A2 ), , P (A3), . . . P (Ak ))
16
January 24, 2008
4
Definition of distributions of (P, σ(P))
It is intuitive that a distribution for a probability measure which is but a
set function (P (A), A ∈ F ) could be defined by specifying the finite dimensional distributions of P (A) = (P (A1 ), . . . , P (Ak )) for all finite measurable
partitions A = (A1 , . . . , Ak ) of R. In this section we present this idea in a
rigorous fashion.
Let A∗ = (A∗1 , . . . A∗m ) be a sub-partition of A, i.e. every set in A∗ is a
P
union of sets in A. Clearly, P (A∗i ) = j:Aj ⊂A∗ P (Aj ), i = 1, . . . , m. This
i
∗
defines a function φA,A∗ from P (A) to P (A ). If we know the distribution
of P (A), then we know the distribution of P (A∗ ), and it should agree with
its assigned distribution. We call this the consistency condition among the
finite dimensional distributions of P (A).
We can now formally state the following theorem.
Theorem 4.1. Consider a pm P on (R, F ), the real line with its Borel σfield. Suppose that we can assign distributions πA for P (A) = (P (A1 ), . . . , P (Ak ))
for each finite partition A of R which satisfy the following two conditions.
• Consistency: if A∗ is sub-partition of A then πA∗ = πA φ−1
A,A∗ , and
• Continuity: if An ց φ then P (An ) ց 0 in distribution. (Note
that this condition is equivalent to E(P (An )) ց 0 since {P (An )} is
bounded..)
17
Then there is a unique pm ν on (P, σ(P)) such that the distribution of P (A)
under ν is πA for all finite partitions A.
Proof: Some preliminaries:
Consider (R, F ), the Real line with its Borel σ-field. Let P be the class
of all probability measures P on (R, F ). Given a pm P we can consider its
df F and its restriction F ∗ to the rationals. Let H be the class of all df’s (i.e.
functions on R satisfying conditions I, II and III) and let H∗ be the class of
all functions F ∗ (i.e. functions on R∗ satisfying conditions I, II and III). We
have shown that there is a 1 ↔ 1 relationship between a pm P ∈ P, F ∈ H
and F ∗ ∈ H∗ .
We can write this as
P ↔ F ↔ F ∗.
Let us denote the mapping from F ∗ to P as φ:
φ
H∗ → P.
The space H ∗ can also be viewed as a subspace of [0, 1]∞ and incorporate
the standard product σ-field into H∗ . The function φ is a measurable map
from H∗ to (P, σ(P)). Thus if one can define a pm ν ∗ on H∗ , then ν = ν ∗ φ−1
will be a probability measure on (P, σ(P)). This is one easy way to define
pm’s on (P, σ(P)). This is the end of the preliminaries.
Consider the space H∗ of df’s F ∗ restricted to the rationals. By identi-
fying F ∗ (x) = P ((−∞, x]), x ∈ R∗ one can define the distribution πx1 ,...,xk
of (F ∗ (x1 ), . . . F ∗ (xk )) for finite collection of points (x1 , . . . xk ) in R∗ in a
consistent fashion because the collection πA is consistent.
18
From the Kolmogorov consistency theorem there exists a unique pm ν ∗
on [0, 1]∞ under which (F ∗ (x1 ), . . . F ∗ (xk )) has distribution πx1 ,...,xk .
Let xn ց x where xn , x are in R∗ . Then the distribution of F ∗ (xn )−F (x)
is that of P ((x, xn ]) and tends to 0 in distribution and with probability 1 since
we have defined a joint distribution of {F ∗ (x), x ∈ R∗ }. To verify that F ∗
is right continuous on R∗ we need to verify only a countable number of such
cases. Thus we conclude that F ∗ is right continous in R∗ with ν ∗ -probability
1. It is also clear that F ∗ is non-decreasing in R∗ with ν ∗ -probability 1.
Hence the pm ν ∗ is supported on the subset H∗ of [0, 1]∞ .
φ
We can now use the mapping F ∗ → P from H∗ to P, to obtain ν ∗ φ−1
which is now a pm on (P, σ(P)). Furthermore the distribution of P (A) under
ν is πA . This completes the proof of this theorem.
5
First definition of Dirichlet distributions
(processes)
We will illustrate the method described in the previous section with an example. Let α(·) be a non-zero finite measure on (R, F). Postulate that the
distribution of P (A) = (P (A1 ), . . . P (Ak )) is the finite dimensional Dirichlet
distribution D(α(A1 ), . . . α(Ak )), for all finite partitions A of R. The properties of finite dimensional Dirichlet distributions show that the distributions
{πA } are consistent.
We know that the distribution of P (A) is the Beta distribution
B(α(A), α(Ac )); thus E(P (A)) =
α(An )
α(R)
α(A)
.
α(R)
Hence, if An ց φ, then E(P (An )) =
ց 0 and the distributions {πA } satisfy the continuity condition.
19
Hence there is a unique pm Dα on (P, σ(P)) such that the distribution
of P (A∗ ) = (P (A1 ), . . . P (Ak )) is D(α(A1), . . . α(Ak )), for all finite partitions
A of R. This is the Dirichlet measure (also called a Dirichlet process) of
Ferguson.
6
Posterior distribution under the Dirichlet
prior
We will now show how to obtain the posterior distribution under a Dirichlet
prior.
Now let P ∈ P and P ∼ D α . Let X be a r.v. such that X|P ∼ P . The
distribution of P |X is the posterior distribution given X.
The posterior distribution is also a distribution on (P, σ(P)). We will
define such a distribution by employing the technique outlined in Section 4;
in other words we will find the posterior finite dimensional distributions of
P (A) for all finite partitions A.
Let Q be the joint distribution of (X, P ). We will find
Q(P (A) ∈ B|X)
for all finite partition A and appropriate sets B (namely measurable subsets
of Rk ).
This conditional probability Q(P (A)|X) is a function satisfying
E(Q(P (A) ∈ B|X)I(X ∈ C)) = E(I(P (A) ∈ B)I(X ∈ C))
for all C ∈ F .
20
Simplifying the right hand side (RHS) we get
RHS = Q(P (A) ∈ B, X ∈ C) = E[Q(X ∈ C|P )I(P (A) ∈ B)] (6.3)
= E[P (C)I((P (A) ∈ B)].
Let Ai1 = Ai ∩ C, Ai2 = Ai ∩ C c . Then C = ∪k1 Ai1 .
We can write the RHS in (6.3) as
E[P (C)I((P (A1), . . . P (Ak )) ∈ B)] =
k
X
1
E(P (A)) ∈ B)).
(6.4)
Using the fact that the distribution of (P (A11 ), P (A12 ), P (A2 , ) . . . P (Ak )) is
D(α(A)) = D(α(A11 ), α(A12 ), α(A2 ), . . . , α(Ak )), we can write the first term
in the summation in (6.4) as
=
=
=
=
E(P (A11 )I((P (A11 ) + P (A12 ), P (A2, ) . . . P (Ak )) ∈ B))
Z
y11 dD(α(A11 ), α(A12 ), α(A2 ), . . . α(Ak ))
(y11 +y12 ,y2 ,...yk )∈B
Z
α(A11 )
dD(α(A11 ) + 1, α(A12 ), α(A2 ), . . . α(Ak ))
α(R) (y11 +y12 ,y2 ,...yk )∈B
Z
α(A11 )
dD(α(A1 ) + 1, α(A2), . . . α(Ak ))
α(R) B
E[D((α + δX )(A1 ), . . . , (α + δX )(Ak ))(B)I(X ∈ A11 )].
since ((α + δX )(A1 ), . . . , (α + δX )(Ak )) = (α(A1 ) + 1, α(A2 ), . . . , α(Ak )) when
X ∈ A11 Thus equation (6.3) becomes
RHS =
k
X
i
E[D((α + δX )(A))(B)I(X ∈ Ai1 )]
= E[D((α + δX )(A))I(X ∈ C))],
and hence
Q(P (A) ∈ B|X) = D((α + δX )(A))(B).
21
Therefore the distribution of P (A) under the posterior distribution is the
finite dimensional Dirichlet distribution D((α + δX )(A)). From the technique in Section 4, these finite dimensional distributions uniquely define the
posterior distribution of P given X to be
Dα+δX .
22
January 29, 2008
7
Posterior distribution given n observations
Let the parameter in the nonparametric problem, namely the unknown probability measure P , have Dα as its prior distribution . Suppose that the data
consists of just one observation X (taking values in (X , F )), which is such
that X|P ∼ P . We saw in the last class that the posterior distribution,
the conditional distribution of P given X, can be written as P |X ∼ Dα+δx .
What if we the data consisted of a a sequence of observations X1 , . . . , Xn ?
i.i.d
To repeat, suppose that X1 , X2 , . . . Xn |P ∼ P . Then what is the distribution of P |X1 , . . . Xn . Let us first obtain the distribution of P given
X1 , X 2 .
Note that given X1 , P has distribution Dα+δX1 and the distribution of
X2 , given (X1 , P ) is P . From this it follows that the distribution of P given
X1 , X2 is Dα+δX1 +δX2 . In the same way, it follows that the distribution of P
given X1 , . . . , Xn is Dα+Pn1 δXi .
8
A little discussion on conditional probability
The conditional probability of A given B denoted by P (A|B) is given by
P (AB)
P (B)
and similarly, P (A|B c ) =
P (AB c )
,
P (B c )
when P (B) > 0, P (B c ) > 0.
However,it is helpful to consider P (A|σ(B)), where σ(B) = {φ, B, B c , Ω)
is the σ-field generated by B, as a function. Thus we will say that the
23
conditional probability P (A|σ(B)) is a function of ω, which satisfies the
condition
P (A|σ(B))(ω) =



P (AB)
P (B)
P (AB)
,
P (B c )
if
ω∈B
if
ω ∈ Bc.
More generally, the conditional probability P (A|σ(B)) is a σ(B) measurable function satisfying
E(P (A|σ(B))(ω)I(ω ∈ C)) = E(I(ω ∈ A)I(ω ∈ C)).
for each C ∈ σ(B).
Let (Ω, F , P) be a probability measure space. Let B be a sub-σ-field of
F . Also let A ∈ F. Then conditional probability of A given B is a function
measurable with respect to B, such that
Z
P (A|B)(ω)I(ω ∈ C)dP = P (A ∩ C) for each C ∈ B.
Note that the function I(ω ∈ A) also satisfies this condition except that
it is not be measurable with respect to B, in the interesting case when A 6∈ B,
and so will not be a candidate for a conditional probability.
Definition 8.1. If there is a version P (A|B)(ω) of the conditional probability
such that it is a probability measure in A for each ω, then we will say that
it is a regular conditional probability.
If Ω is R1 , or Rk or separate complete metric space then a regular conditional probability exists.
24
9
Martingales
Example 9.1. Let X1 , X2 . . . be i.i.d with E(Xi ) = 0. Let Sn = X1 +. . . Xn .
Then
E(Sn |X1 , . . . Xn−1 ) = E(Sn−1 + Xn |X1 , . . . Xn−1 )
= Sn−1 + E(Xn |X1 , . . . Xn−1 )
= Sn−1 + E(Xn )
= Sn .
The sequence {Sn } is a example of a martingale.
More formally:
Definition 9.1. Let Fn be an increasing sequence of σ-fields and Xn be Fn
measurable, n = 1, 2, . . . . Let E(|Xn |) < ∞, n = 1, 2, . . . . Then {Xn , Fn } is
said to be a martingale if
E(Xn |Fn ) = Xn−1 , n = 2, 3, . . . .
Theorem 9.1. Let {(Xn , Fn )} be a martingale which is L1 bounded, i.e,
there is K < ∞ such that E(|Xn |) ≤ K for all n. Then there exists a
random variable X∞ , such that Xn −→ X∞ with probability 1.
In our earlier example, E(|Sn |) is not L1 bounded, and so this theorem
will not apply.
The following is an example of another martingale which is L1 -bounded
and hence convergent. Further more it is uniformly integrable.
25
Theorem 9.2. Let Fn be a sequence of σ-fields increasing to F∞ . Let X be
a r.v. with E(|X|) < ∞. Let Xn = E(X|Fn ). Then {Xn } is a uniformly
integrable martingale and Xn = E(X1 |Fn ) → E(X1 |F∞ ) with probability
1 and in L1 .
Definition 9.2. Let Fn be a decreasing σ-fields and let Xn be Fn measurable
with E(|Xn |) < ∞, n = 1, 2, . . . . We say that{Xn } is a reverse martingale
if
E(Xn |F n+1 ) = Xn+1
for n = 1, 2 . . . .
The following is a result for reverse martingales.
Theorem 9.3. Let {(Xn , Fn )} be a reverse martingale. Then there exits a
random variable X∞ , such that Xn −→ X∞ with probability 1 and in L1 (i.e.
E(|Xn − X∞ |) −→ 0.)
The following is an example of a reverse martingale. Let Fn be a decreasing σ-fields decreasing to F∞ . Let X1 be F1 measurable and let E(|X1 |) < ∞.
Then the sequence {Xn = E(X1 |Fn )} is a reverse martingale converging to
E(Xn |F∞ ).
10
Product measure spaces
Suppose that (X , F ) is a probability measure space. The product σ-field F 2
in the product space X 2 is defined as follows. Consider the rectangle set
A × B = {(x1 , x2 ) : x1 ∈ A, x2 ∈ B}. The smallest σ-field containing such
rectangle sets is the σ-field F 2 .
26
Question: How to induce a product σ-field in X ∞ ?
Notice that a set C ∈ F 2 can be viewed a subset C∗ of X ∞ as follows:
C∗ = {(x1 , x2 , . . . ), (x1 , x2 ) ∈ C, x3 ∈ X , x4 ∈ X , . . . } ⊂ X ∞ .
Let F∗2 = {C∗ : C ∈ F }. Thus we can define σ-fields F∗2 ⊂ F∗3 ⊂ . . . . Let
n
∞
F0 = ∪∞
in X ∞ is defined
1 F∗ , which is not a σ-field. The product σ-field F
to be σ(F 0 ), the smallest σ-field containing F0 .
Theorem 10.1. Let λn be a probability measure on (X n , F n ), n = 1, 2, . . .
and let {λn } be a consistent sequence. Then there exists a unique probability
measure λ∞ on (X ∞ , F ∞ ) which extends λn , i.e. λn (C) = λ(C∗ ) for all
C ∈ Fn , n = 1, 2, . . . .
Let us look at some other σ-fields on X ∞ .
Consider the set A = {X : x1 + x2 ≤ 10}. If (x1 , x2 , x3 , . . . ) ∈ A, then
(x2 , x1 , x3 , . . . ) ∈ A. We can say that the set A is invariant under τ , the
permutation of the first two coordinates of (x1 , x2 , . . . ), i.e. under τ defined
by τ (x1 , x2 , x3 , . . . ) = (x2 , x1 , x3 , . . . ). Define G 2 = {C : τ (C) = C, C ∈
F ∞ }. Then G 2 is a σ-field. Define the σ-field G n = {C : τ C = C, C ∈ F ∞ }
for each permutation τ of the first n coordinates of (x1 , x2 , . . . ). Then
F ∞ = G 1 ⊃ G 2 ⊃ · · · ⊃ G n −→ G ∞ .
The σ-field G ∞ is called the invariant σ-field in X ∞ .
There is also a σ-field T called the tail σ-field contained in the invariant
σ-field G ∞ .
Definition 10.1. A sequence of random variables (X1 , X2 . . . ) is said to be
exchangeable, if the distribution of (X1 , . . . , Xn ) = (Xi1 , . . . , Xin ) for all
27
permutation (i1 , . . . , in ) of (1, . . . , n), n = 2, 3, . . . . Let Q be the distribution
of (X1 , X2 . . . ), then Q is said to be exchangeable.
An example of an exchangeable sequence of random variables is a sequence
of i.i.d. random variables X1 , X2 , . . . .
Let (X1 , X2 . . . ) be exchangeable. Let f be a bounded measurable function on X . Since (X1 , X2 , X3 , . . . ) ∼ (X2 , X1 , X3 . . . ) and
f (x1 +x2 )
2
is G2 -
measurable,
E(f (X1 )|G 2 ) = E(f (X2 )|G 2 )
f (X1 ) + f (X2 ) 2
= E(
|G )
2
f (X1 ) + f (X2 )
.
=
2
More generally,
n
1X
def
E(f (X1 )|G ) =
f (xi ) = An (f ).
n 1
n
Therefore {An (f ), G n } is a reverse martingale with E(|A1 (f )|) < ∞. Hence
An (f ) → A∞ (f ) = E(f (X1 )|G ∞ )
with probability 1 and E(|An (f ) − A∞ (f )|) → 0.
28
January 31, 2008
11
Review
Let Xi ∈ X for i = 1, 2, . . . . The random variables (X1 , X2 , . . . ) with
joint distribution Q is exchangeable, if (X1 , . . . Xn ) ∼ (Xi1 , . . . , Xin ) for
all permutations (i1 , . . . , in ) of (1, 2, . . . n), for n = 2, 3, . . . . In other words,
(X1 , X2 , . . . ) is exchangeable if Qτ −1 = Q for all finite permutations τ defined
as
τ (X1 , X2 , . . . , Xn Xn+1 ) = (Xi1 , . . . Xin , Xn+1 , . . . ),
for n = 1, 2, . . . .
A simple example of exchangeable random variables is a sequence
(X1 , X2 , . . . ) of i.i.d. random variables.
Another example of exchangeable random variables is given by:
Example 11.1. Let X1 , X2 , . . . be r.v.on the (X , F). Let θ be a r.v., perhaps, on another space with distribution π. Suppose that
Q(X1 ∈ A1 , . . . , Xn ∈ An |θ) = Pθ (A1 ) . . . Pθ (An )
for all A1 ∈ F , . . . An ∈ F , n = 1, 2, . . . where {Pθ } is a family of probability
measures indexed by θ. Then
Q(X1 ∈ A1 , . . . Xn ∈ An ) = E(Q(X1 ∈ A1 , . . . Xn ∈ An |θ))
Z Y
n
=
Pθ (Ai ) dπ(θ)
1
= Q(X1 ∈ Ai1 , . . . Xn ∈ Ain )
= Q(Xj1 ∈ A1 , . . . Xjn ∈ An )
29
where (j1 , . . . , jn ) is the anti-permutation of (i1 , . . . , in ). Hence (X1 , X2 , . . . )
is exchangeable.
Theorem 11.1. (De-Finetti’s Theorem) Let (X1 , X1 . . . ) ∼ Q be exchangeable. Then there is a random probability measure P = P (·, ω) =
P (A, ω) on P the space of probability measures on (X , F ) such that
Q(X1 ∈ A1 , . . . Xn ∈ An |P ) =
n
Y
P (Ai)
1
with probability 1 for all (A1 , . . . , An ). In other words, X1 , X2 , . . . |P are i.i.d
P.
The distribution of P will depend on Q and can be denoted by νQ . One of
the consequences of this theorem is that every exchangeable distribution Q
of (X1 , X2 , . . . ) determines a probability measure νQ on P, which can serve
as a prior distribution for us in a nonparametric problem.
To prove the De-Finetti Theorem, we will use the theorem from our last
lecture. Recall that G n = {A : A ∈ F ∞ , τ A = A, for all permutations τ
of (1, . . . n), n = 1, 2, . . . and F ∞ = G 1 ⊃ G 2 ⊃, . . . , ⊃ G ∞ . Let f be a
P
bounded measurable function; then An (f ) = n1 n1 f (xi ) = E(f1 (X)|G n ).
We showed that {An (f ), G n } is a reverse martingale and An (f ) → A(f ) =
E(f (X1 )|G ∞ ) with probability 1 and in L1 .
This result can be used to prove the Kolmogorov’s large of law numbers
P
and also that E(| n1 n1 Xi − E(X1 )|) → 0.
Proof of De-Finetti Theorem: Recall that (X , F, P) is a probability
30
space. Let A ∈ F and put f (X1 ) = I(X1 ∈ A). We also know that
n
1X
I(Xi ∈ A)
Fn (A) =
n 1
= E(I(X1 ∈ A)|G n )
= Q(X1 ∈ A|G n )
and that Fn (A) is a reverse martingale.
Hence, from the reverse martingale convergence theorem Theorem 9.3,
Fn (A) → F (A) with probability 1 and in L1 , where F (A) = E(I(X1 ∈
A)|G ∞ ) = Q(X1 ∈ A|G ∞ ).
We also know that Q(X1 ∈ A|G ∞ ) exists a regular conditional probability,
i.e. it is equal to P (A, ω), where P (A, ω) is measurable in ω for each A and
is a probability measure in A for each ω.
Thus we can say that Fn (A) → P (A, ·) with probability 1 and in L1 , for
each A. We can also say that the random probability measures Fn (·) converge
weakly to the random probability measure P (·) with probability 1, since weak
convergence can be determined by a countable number of conditions.
Let A1 , A2 ∈ F . For n ≥ 2
E(I(X1 ∈ A1 , X2 ∈ A2 )|G n ) =
=
X
1
I(Xi ∈ A1 , Xj ∈ A2 )
n(n − 1) i6=j
n
n
X
X
1
I(Xj ∈ A2 ))
I(Xi ∈ A1 ))(
(
n(n − 1) 1
1
n
X
1
−
(
I(Xi ∈ A1 )(I(Xi ∈ A2 )))
n(n − 1) 1
→ P (A1 , ω)P (A2, ω)
with probability 1 and in L1 .
31
Since
Pn
I(Xi ∈ A1 )
→ P (A1 , ω).
n
Pn
1 I(Xj ∈ A2 )
→ P (A2 , ω), and
n
n
n
1X
1X
I(Xi ∈ A1 )(I(Xi ∈ A2 )) =
I(Xi ∈ A1 ∩ A2 ) → P (A1 ∩ A2 , ω)
n 1
n 1
1
with probability 1 and in L1 , we obtain
E(I(X1 ∈ A1 , X2 ∈ A2 )|Gn )(ω) → E(I(X1 ∈ A1 )I(X2 ∈ A2 )|G∞ ))ω)
= P (A1 , ω)P (A2, ω)
with probability 1 and in L1 , where P (A, ω) is a regular conditional probability and a version of Q(X1 ∈ A|G ∞ )(A, ω).
Thus X1 , X2 . . . |G ∞ are i.i.d P (·).
We can say a little more. Since the random probability measure P (·)
is equal to Q(·|G ∞ ), it is G ∞ -measurable. We can therefore conclude that
X1 , X2 , . . . |P are i.i.d. P . Here we have used the following fact - If E(X|S)
is T -measurable where T ⊂ S then E(X|S) = E(X|T ).
We want to push this discussion some more. For instance, since Fn (A) →
P (A, ·) in L1 we obtain E(P (A)) = lim E(Fn (A)) = E(E(Q(X1 ∈ A|G1 )) =
Q(X1 ∈ A). In fact, we can find all moments of the random probability
measure P .
Let
A1 , . . . , Ak
be
subsets
in
F.
We
will
evaluate
E(P (A1 ) · · · P (A1 ) P (A2) · · · P (A2 ) · · · P (Ak ) · · · P (Ak )) where r1 , . . . , rk are
|
{z
}|
{z
} |
{z
}
r1
r2
rk
32
non-negative integers. Let n ≥ r1 + · · · + rk . Consider
E(I(X1 ∈ A1 , . . . Xr1 ∈ A1 , Xr1 +1 ∈ A2 , . . . , Xr1 +···+rk ∈ Ak |Gn ))
r1 +···+r
r1
rY
1 +r2
Y
Y k
∞
∞
→
Q(Xi ∈ A1 |G )
Q(Xi ∈ A2 |G ) · · ·
Q(Xi ∈ Ak |G ∞ )
1
=
k
Y
r1 +1
r1 +···rk−1
P ri (Ai )
1
with probability 1 and in L1 . Thus
Q(x1 ∈ A1 , . . . Xr1 ∈ A1 , Xr1 +1 ∈ A2 , . . . Xr1 +r2 ∈ A2 , . . . Xr1 +···+rk ∈ Ak )
k
Y
= E( Pir (Ai )).
1
12
Pólya sequences
We will look at examples of exchangeable sequences called Pólya sequences.
Let X = (1, . . . , k) and let α = (α(1), . . . α(k)) where α(1), . . . α(k) are nonnegative numbers and α(1) + · · · + α(k) = α(X ) > 0.
The joint distribution Qα of X1 , X2 , . . . is defined as follows:
Qα (X1 = i) =
Qα (Xn+1 = i|X1 , . . . Xn ) =
α(i)
,
α(X )
P
α(i) + nj=1 δXj (i)
(12.5)
α(X ) + n
n = 2, 3, . . . .
We can understand the distribution Qα as a sampling from an urn with
balls of k colors with weights α(i), i = 1, . . . , k. The probability of drawing
a ball of color i is proportional to the weight of that ball. Let X1 be the
color of the first ball drawn. Thus Q(X1 = i) =
33
α(i)
.
α(X )
Every ball drawn is
replaced by a ball of the same color with its weight is increased by 1. Then
the next ball is drawn. The distribution of X1 , X2 , . . . satisfies (12.5) and
this provides a model for a Póly sequence in a finite space.
We will now evaluate the joint distributions of this Pólya sequence.
Q((X1 , . . . , Xn ) = (1, . . . , 1, 2, . . . , 2, . . . , k, . . . , k ))
| {z } | {z }
| {z }
r1
r2
rk
α(1)(α(1) + 1) . . . (α(1) + r1 − 1) . . . (α(k) + rk − 1)
=
α(X )(α(X + 1)) · · · (α(X ) + r1 + rk − 1)
[r1 ]
α (1) . . . α[rk ] (k)
=
α[r1+···+rk ] (X )
Where a[r] = a(a + 1) . . . (a + r − 1). Since this probability remains unchanged if the values of the colors X1 , . . . , Xn are permuted, the probability
distribution Qα is exchangeable.
We can extend this example to general spaces X . Let α be a non-negative
finite measure on the probability space (X , F). Define Qα the distribution
of the Pólya sequence X1 , X2 , . . . by
α(A)
α(X )
P
α(A) + n1 δXi (A)
∈ A|X1 , X2 , . . . , Xn ) =
α(X ) + n
Q(X1 ∈ A) =
Q(Xn+1
(12.6)
for n = 2, 3, . . . .
Let φ : X → Y, and let Y1 = φ(X1 ),Y2 = φ(X2 ), . . . . Then Y1 , Y2 , . . . is a
Pólya sequence with parameter β = αφ−1 .
In particular, let φ : X → {1, 2, . . . , k} with φ(x) = i ↔ x ∈ Ai , i =
1, . . . , k where (A1 , . . . , Ak ) is a finite partition of X . Then Y1 , Y2 , . . . has
34
the Pólya distribution Q(α(A1 ),...,α(Ak )) . From our previous calculations,
Q(X1 ∈ A1 , . . . , Xr1 ∈ A1 , Xr1 +1 ∈ A2 , . . . , Xr1 +···+rk ∈ Ak ) =
35
Qk
[ri ]
(Ai )
1α
.
[r
+···+r
] (X )
1
k
α
Feb. 5, 2008
13
Review
Let the random variables X1 , X2 , . . . with values in (X ∞ , F ∞ ) and with joint
P
distribution Q be exchangeable. Let Fn (A) = n1 n1 I(Xi ∈ A), A ∈ F be the
empirical distribution of X1 , . . . , Xn . We established that
Fn (A) → P (A) with probability 1 and in L1 ,
for each A ∈ F , where P (A) = Q(X1 ∈ A|G ∞ ) is a regular conditional
probability measure. Recall that G ∞ is the invariant σ-field in X ∞ . We also
established that
∞
Q(X1 ∈ A1 , . . . Xn ∈ An |G ) =
n
Y
1
∞
Q(Xi ∈ Ai |G ) =
n
Y
P (Ai)
1
with probability 1 for each A1 , . . . , An . Since both sides exist as regular
conditional probabilities, and two probability measures agree with each other
if they agree on a countable determining class of sets, we can also conclude
that
Q(X1 ∈ ·, . . . Xn ∈ ·|G ∞ ) =
n
Y
1
Q(Xi ∈ ·|G ∞ ) =
n
Y
P (·)
1
∞
with probability 1. And finally since P (·) is G -measurable, we can further
state this result as
Q(X1 ∈ ·, . . . Xn ∈ ·|P ) =
n
Y
1
In other words, X1 , X2 , . . . |P are i.i.d P .
36
Q(Xi ∈ ·|P =
n
Y
1
P (·).
Can we also say that
Fn → P
with probability 1 in some sense?
Let us examine the case where X1 , X2 , . . . are i.i.d. F and also, hence,
exchangeable. The Glivenko-Cantelli theorem states that
sup |Fn (x) − F (x)| → 0
x∈X
with probability 1. But it is not true that
sup |Fn (A) − F (A)| → 0.
A∈F
For instance, let F be the standard normal distribution.
Choose A =
{X1 , . . . , Xn }. Then Fn (A) = 1, F (A) = 0 and supA∈F |Fn (A) − F (A)| 6→ 0.
However, we can say that
w
Fn → F
with probability 1. In the same way, we can also say that
w
Fn → P
with probability 1 in the case of exchangeable random variables, since µn → µ
R
R
if f dµn → f dµ for a countable collection of functions f .
14
Posterior distibutions in Pólya distributions
Going back the case of exchangeable random variables, let νQ be the distribution of P , which will depend on the joint distribution Q. What is the
37
posterior distribution of P , i.e. what is the distribution of P given X1 ?
We first look at the distribution of X2 , X2 3, . . . given X1 . This will also
be exchangeable; denote it by QX1 . Note that, under this conditional joint
P
distribution, the sequence n1 n2 I(Xi ∈ A) also converges to P (A) with probability 1. Thus the distribution of P given X1 is the νQX1 . Thus we have
found the posterior distribution just by clever notation.
Definition 14.1. (Pólya sequence) Let X = {1, . . . , k},α = (α(1), . . . , α(k)).
We say X1 , X2 , . . . is a Pólya sequence on X with parameter α and joint
distribution Qα on X , if
Qα (X1 ∈ A) =
α(A)
α(X )
Qα (Xn+1 ∈ A|X1 , . . . , Xn ) =
α(A) + nFn (A)
α(X ) + n
for n = 1, 2, . . . .
Let α = (α(1), . . . , α(k)) and denote this joint distribution as Q = Qα.
We can understand the joint distribution of X1 , X2 , . . . as the distribution
of the colors of balls chosen from an urn with k balls as follows. Initially the
urn contains balls of k colors with weights (α(1), . . . , α(k)). The probability
that a ball drawn from this urn is drawn is proportional to its weight. Each
time a ball is drawn, it is replaced and the weight of that ball is increased by
1 before the next ball is drawn. Let X1 X2 , . . . be the colors of the balls that
are drawn. Then the distribution of this sequence is the same as the Pólya
sequence with parameter α = (α(1), . . . , α(k)). Hence
Q(X1 = i1 , . . . Xn = in ) =
38
Qk
1
α[ri ] (i)
α[n]
where ri =
Pn
1 I(Xj = i),n =
Pk
1 ri ,α =
Pk
1
α(i). This probability depends
only on r1 , . . . , rk and thus Qα is exchangeable.
We will now this calculation to show that general Pólya sequences are
also exchangeable.
Let B = (B1 , . . . Bk ) be a partition of X . What is Qα (X1 ∈ A1 , . . . Xn ∈
An ) where A1 , . . . , An ∈ B? ?
Let φ : X → Y = {1, 2, . . . , k},i.e. φ(x) = i if x ∈ Bi . Then Y1 , Y2 , . . . is
a Pólya sequence on Y with parameter β = (αφ−1 ({1}), . . . , α−1 ({k})). Let
A1 = Bi1 , . . . , An = Bin . Then
Qα (X1 ∈ A1 , . . . Xn ∈ An ) = Q(Y1 = i1 , . . . , Yk = in )
which is equal to Qα (X1 ∈ Aj1 , . . . Xn ∈ Ajn ) for any permutation (j1 , . . . , jn )
of (1, . . . , n). This is also true if the sets A1 , . . . , An are not sets from a
partition as seen below.
Let the sets A1 , . . . , An be arbitrary and let B be the partition generated
by A1 , . . . , An . We can write Qα (X1 ∈ A1 , . . . , Xn ∈ An ) as sums of the form
Qα (X1 ∈ B1 , . . . , Xn ∈ Bn ) where B1 , . . . , Bn come from B. Since each of this
is invariant under permutations, the joint distribution Qα is exchangeable.
We will now calculate moments from a Pólya sequence. Let Qα be the
distribution of the Pólya sequence with parameter α, where α is a non-
39
negative finite measure. Let (A1 , . . . Ak ) be a partition of X . Then
EQα (P r1 (A1 ) . . . P rk (Ak ))
= Qα (X1 ∈ A1 , . . . , X· ∈ A1 . . . , X· ∈ Ak , . . . , . . . , Xn ∈ Ak )
|
{z
}
|
{z
}
r1
rk
= Qα (Y1 = 1, . . . , Y· = 1, . . . , Y· = k, . . . , Yn = k )
|
{z
}
{z
}
|
r1
=
Qk
rk
[ri ]
α (Ai )
.
α(X )[n]
1
From the characterization of a Dirichlet distribution by its moments, we get
(P (A1 ), . . . , P (Ak )) ∼ D(α(A1 ), . . . , α(Ak )).
Therefore we can identify vQα with Dα that we defined earlier.
We will now find the posterior distribution of P . For this we will first
find the distribution of X2 , X3 , . . . given X1 . We know that
α(A) + nFn (A)
α(X ) + n
P
α(A) + δX1 + n2 δXi
=
.
α(X ) + 1 + n − 1
Qα (Xn+1 ∈ A|X1 , . . . Xn ) =
Hence, the distribution of X2 , X3 , . . . given X1 is that of a Pólya sequence
with parameter α + δX1 . Thus P |X1 ∼ vQX1 = Dα+δX1 .
15
The random pm P is dicrete with probability 1
We will push these ideas some more. We know that if X1 , X2 , . . . is a Pólya
sequence with parameter α, then
Pn
2
I(Xi ∈A)
n
P (A) ∼ B(α(A), α(Ac )).
40
→ P (A) with probability 1 and
Hence
Pn
2
I(Xi ∈ {X1 })
→ P ({X1 })
n
and P ({X1 }) ∼ B(α({X1 }) + 1, α(X − {X1 })), and
E(P ({X1})|X1 ) =
α({X1 }) + 1
1
≥
.
α(X ) + 1
α(X ) + 1
Since this lower bound does not depend on X1 , we also have
E(P ({X1 })) ≥
1
.
α(X ) + 1
In other words,
E(P (X − {X1 })) ≤
α(X )
.
α(X ) + 1
Similarly,
E(P (X − {X1 , . . . , Xn })) ≤
α(X )
→0
α(X ) + n
as n → ∞.
Therefore, P is a measure that puts all its mass on a countable number
of points or the random probability measure P is discrete with probability 1.
We have used special properties of a Pólya sequence in this proof. Consider the example of exchangeable random variables X1 , X2 , . . . which are i.i.d
with distribution F where F is not discrete. This is not a Pólya sequence.
P
The limiting random measure P (A) = lim n1 n1 I(Xi ∈ A) is degenerate at
F and is not sitting on the class of discrete pm’s.
41
February 7, 2008
16
Solution to a homework problem
Problem: Let X1 , X2 , . . . be a Pólya sequence on (X , F) with parameter
α. Denote its joint distribution by Qα . Let φ be a measurable function from
φ
X to Y, i.e. let φ : (X , F ) → (Y, G). Let Y1 = φ(X1 ), Y2 = φ(X2 ), . . . . Then
Y1 , Y2 , . . . is a Pólya sequence on (Y, G) with parameter β = αφ−1 .
Solution: We know that Q(Xn+1 ∈ A|X1 , X2 , . . . , Xn ) =
P
α(A)+ n
1 δXi (A)
.
α(X )+n
We
need to show that
Q(Yn+1
P
αφ−1 (B) + n1 δYi (B)
∈ B|Y1, Y2 , . . . , Yn ) =
.
αφ−1(Y) + n
(16.7)
Let B ∈ G. Then
Q(Yn+1 ∈ B|X1 , X2 , . . . , Xn ) = Q(φ(Xn+1 ∈ B|X1 , . . . , Xn ))
= Q(Xn+1 ∈ φ−1 (B)|X1 , . . . , Xn )
P
α(φ−1(B)) + n1 δXi (φ−1 (B))
=
α(X ) + n
P
−1
αφ (B) + n1 δYi (B)
=
αφ−1 (Y) + n
because x ∈ φ−1 (B) ↔ y = φ(x) ∈ B. Since this conditional probability is a
function of Y1 , Y2 , . . . , Yn , (16.7) is true.
In the above we have used the well known answer to the question of how
to obtain Q(W ∈ A|Y ) from Q(W ∈ A|X, Y ):
The conditional probability Q(W ∈ A|Y ) is a function which is σ(Y )measurable and satisfies
Z
Z
Q(W ∈ A|Y )I(Y ∈ C)dQ = I(W ∈ A)I(Y ∈ C)dQ
42
(16.8)
for all C ∈ σ(Y ). We know that Q(W ∈ A|X, Y ) satisfies
Z
Q(W ∈ A|X, Y )I(X ∈ D)I(Y ∈ C)dQ = Q(W ∈ A, X ∈ D, Y ∈ C)
(16.9)
for all appropriate C, D, and in particular when D is the whole space.
This means that Q(W ∈ A|X, Y ) satisfies (16.8), but it may not be σ(Y )measurable. If if was σ(Y ) measurable (i.e. a measurable function of Y ) then
Q(W ∈ A|Y ) = Q(W ∈ A|X, Y ).
If Q(W ∈ A|X, Y ) is not σ(Y )-measurable we put equations (16.8) and
(16.9) together (with D equal to the whole space) and get
Z
Q(W ∈ A, Y ∈ C) =
Q(W ∈ A|X, Y )I(Y ∈ C)dQ
Z
=
Q(W ∈ A|Y )I(Y ∈ C)dQ
for all C ∈ σ(Y ). Thus, in this case,
Q(W ∈ A|Y ) = E(Q(W ∈ A|X, Y )|Y ).
17
Support of Dirichlet measures
In the last class, we proved that in a Pólya sequence with parameter α, the
distribution of the random probability measure P , which is the limit of the
empirical measure Fn , is the Dirichlet distribution Dα . We also showed that
Dα ({P : P is discrete}) = 1. So, what is the support of Dα ? The support
of a probability measure is the smallest closed set with probability 1. The
set {P : P is discrete } is not a closed set. Its closure is P, the set of all
probability measures. So we must do some more work to find the support of
Dα . We will do this later.
43
18
Another way to introduce random probability measures
We will look for another way to introduce probability measures on P.
Consider X = (0, 1]. Every point in X can be written as x → z =
(z1 , z2 , . . . ) in a unique way, where

 1 if
z1 =
 0, if
..
.
zn = [2n x]
1
2
<x≤1
x≤
1
2
mod 2, n = 1, 2 . . .
As an example, when x = 7/8 you can check z1 (x) = 1, z2 (x) = 1.
When x = 1/2, one can take z1 (x) = 0, z2 (x) = 1, z3 (x) = 1, . . . or as
z1 (x) = 1, z2 (x) = 0, z3 (x) = 0, . . . . To make things unique, we will use only
the former representation, namely that we will allow a recurring 1 but not a
recurring 0.
This means that we have a transformation from (0, 1] into Z ∞ where
Z = {0, 1}. The transformation is 1 ↔ 1 if we remove sequences in Z ∞ with
recurring 0’s. Thus to define a pm on (0, 1] it is enough to define a pm on
Z ∞.
How to define a pm on Z ∞ ?
The singletons of Z n , which are zn = (z1 , . . . , zn ), zi = 0, 1, i = 1, . . . , n,
can also be viewed as rectangle sets in Z ∞ . We will therefore allow zn to
denote both the first n coordinates of z and also all sequences in Z ∞ whose
first n coordinates agree with zn . The product σ-field in Z ∞ is the smallest
44
σ-field containing all these rectangle sets.
A probability function p defined on rectangle sets satisfies the consistency
condition if
p(zn ) = p(zn 1) + p(zn 0)
for all zn ∈ Z n , n = 1, 2, . . . , where zn 1 means that the (n + 1)th coordinate
is 1. For instance, p(1) = p(10) + p(11). Such a consistent probability
function defines a unique probability measure on Z ∞ by the Kolmogorov
consistency theorem. One can also use the idea of compact fields to arrive
at this conclusion.
Here is a way to define a consistent probability function on rectangle sets.
Let ue = p(1), 1 − ue = p(0), u1 =
p(00)
p(0)
. . . u xn =
p(xn 1)
p(xn )
p(11)
, u0
p(1)
=
p(01)
,1
p(0)
− u1 =
p(10)
,1
p(1)
− u0 =
for xn ∈ Z n , n = e, 1, 2, . . . . Conversely,
p(1) = ue
p(0) = 1 − ue
p(11) = u0 ue
p(10) = u1 (1 − ue )
..
.
n−1
Y
(ux(r−1) )xr
p(xn ) =
r=1
for xn ∈ Z where we define a = a, a0 = 1 − a and x0 = e.
n
1
Let U ∞ = Ue × U1 × · · · × Un × . . . , where Ue = {ue }, Un = {uxn : xn ∈
Z n }, n = 1, 2, . . . . Then there is a one to one transformation between all
probability measures P = P(Z ∞ ) and U.
Thus a pm on U ∞ gives rise to a pm on P. Since U ∞ is a product space,
the first task will be easier.
45
Let λ be the distribution of u taking values in U ∞ . This gives rise to a
distribution ν of pm P taking values in P(Z ∞ ). Let z1 , z2 , . . . , zn be random
variables in Z ∞ , whose distribution given P is i.i.d P , and P has the distri-
bution ν. We want the conditional distribution ν ∗ of P given z1 , z2 , . . . , zn .
This ν ∗ will arise from the conditional distribution λ∗ of u given z1 , z2 , . . . , zn .
This can be obtained from looking at L(u|z1k , z2k , . . . , znk ), where zik is the first
k coordinate of zi , i = 1, . . . , n. We will do this in the next class.
46
Feb 12, 2008
We begin with a quick review of what we did in the previous class. Let
X = (0, 1]. There is a one-to-one transformation x ↔ z from X to Z ∞ =
{0, 1}∞, the space of sequences of 0’s and 1’s (minus a few points like se-
quences with recurring 0’s). The space of probability measures P ∗ = P(X ) on
X is in a one-to-one relation to the space of probability measures P = P(Z ∞ )
on Z ∞ . Thus our nonparametric Bayes problem is the problem of finding
probability measures ν on P. Let us explore the nature of a element P ∈ P.
Such a P is completely defined by the probabilities it gives to rectangle
sets in Z ∞ , i.e. by a specification of
{P (zn ), zn ∈ Z n , n = 1, 2, . . . }
(where zn = (z1 , . . . , zn ) stands for both a point in Z n and the corresponding
cylinder set in Z ∞ ) satisfying the consistency conditions
P (zn ) = P (zn 1) + P (zn 0), n = 1, 2, . . .
We can now define constants u = (ue , u1 , . . . , un , . . . ) with ue = ue , u1 =
(u1 , u0), . . . un = (uxn , xn ∈ Z n ), . . . (for convenience we will set x0 = e, Z 0 =
{e}, with e standing for “empty”) as follows:
ue = P (1)
P (11)
u1 =
= P (z2 = 1|z1 = 1)
P (1)
P (01)
= P (z2 = 1|z1 = 0)
u0 =
P (0)
..
.
uxn = P (zn+1 = 1|zn = xn ) for xn ∈ Z n , n = 1, 2, . . .
47
where one can choose to define the ratio
0
0
as equal to 1.
Conversely, given constants u = ({uxn , xn ∈ Z n }, n = 0, 1, . . . ) in [0, 1],
one can recover the probability measure P by its probability on cylinder sets
as follows:
P (xn ) = uxe 1 uxx21 . . . uxxnn−1 , n = 1, 2, . . .
(where a1 = a, a0 = (1 − a)). Thus there is a one-one-map from P to U =
n
U⌉ × U1 × U2 × · · · = [0, 1]∞ . Note also that Un = [0, 1]2 , n = 0, 1, . . . . It will
be easy to define a probability measure on U since it is just a product space.
Each such probability measure λ will correspond to a probability measure
ν on P. This will complete our program to introduce probability measures
on P for nonparametric Bayesian analysis. We will still have questions on
posterior distributions, Bayes estimates, etc.
Thus we have the following picture.
Basic Space
Space of 0’s & 1’s
X = (0, 1]
↔
Z ∞ = {0, 1}∞
x
↔
z = (z1 , z2 , . . . )
Space of pm’s
P ∗ = P(X )
P∗
Space of pm’s
Space of Cond. Prob’s
↔
P = P(Z ∞ )
U = Ue × U1 × · · · = [0, 1]∞
↔
P
↔
u = (ue , u1 , . . . )
P (xn )
x
n−1
×j=0
uxjj−1
ν∗
↔
ν
↔
λ
Dα
↔
Dα
↔
λα
Let λ be a pm on U corresponding to a pm ν on P. Let the prior
distribution of the unknown pm P be ν i.e. let P ∼ ν. This P corresponds
48
to a rv U on U which has distribution λ. Conditional on P (i.e. on U), let
the observations Y 1 , Y 2 , . . . , Y n be i.i.d P .
We will now find the conditional distribution of P given Y 1 , Y 2 , . . . , Y n .
We will do this by first finding the conditional distribution of (Ue , U1 , . . . , Um )
given Y 1 , Y 2 , . . . , Y n for all m. By the Kolmogorov’s consistency theorem, this leads us to the conditional distribution of the sequence U given
Y 1 , Y 2, . . . , Y n , which is the posterior distribution.
To
find
the
conditional
distribution
of
(Ue , U1 , . . . , Um )
given (Y 1 , Y 2, . . . , Y n ) we will find the conditional distribution of
(Ue , U1 , . . . , Um ) given (Yki = (Y1i , . . . , Yki ), 1 ≤ i ≤ n) for each k. This
can be done with the joint distribution of just a finite number of random
variables. We then will first allow m → ∞ and then let k → ∞ to obtain
the posterior distribution.
We now proceed to do the calculations, where will assume that
(Ue , U1 , . . . , Um ) has a joint pdf q(ue , u1 , . . . , um ). With Q standing for
the joint distributions of all the random variables, we have
Q(Yn = yn |P ) = Q(Yn = yn |U) = uye1 uyy21 · · · uyynn−1
where, as before, we use the notation a1 = a and a0 = 1 − a.
49
Let
N(e) = n
n
X
N(1) =
I(Y1i = 1)
1
N(0) =
n
X
I(Y10 , i = 0)
N(11) =
n
X
I(Y2i = (1, 1)), etc.
1
N(yr ) =
1
n
X
1
I(Yri = yr ) for yr ∈ Z r , r = 0, 1, . . .
The function N(yr ) is just the frequency of yr in the sample. With this
notation,
Q(Yk i = yki , 1 ≤ i ≤ n|Ue , U1 , . . . , Um )
Y
=
uxNr(xr 1) (1 − uxr )N (xr 0) ,
xr ∈Z r ,0≤r≤k−1
where we have taken m to be larger than k.
The joint distribution of ((Yki , 1 ≤ i ≤ n), (Uj , 0 ≤ j ≤ m)) is
Q(((Yki = yki , 1 ≤ i ≤ n), (Ur , 0 ≤ r ≤ m)))
Y
=
uxNr(xr 1) (1 − uxr )N (xr 0) q((ur , 0 ≤ r ≤ m)).
xr ∈Z r ,0≤r≤k−1
The conditional distribution of (Ur , 0 ≤ r ≤ m) given (Yki , 1 ≤ i ≤ n) is
proportional to
xr
Y
∈Z r ,0≤r≤k−1
(xr 1)
uN
(1 − uxr )N (xr 0) q((ur , 0 ≤ r ≤ m)).
xr
Let us consider three classes of probability measures on U:
50
(18.10)
• Λ1 = {λ : Ue , U1 , U2 , U3 . . . are independent under λ}.
• Λ2 = {λ : Uxn , xn ∈ Z n , n = 0, 1, . . . are all independent under λ}.
• Λ3 = {λ : Uxn ≡ Vn , n = 0, 1, . . . , and Ve , V1 , . . . independent under λ}.
It is easy to see that Λ2 ⊂ Λ1 and Λ3 ⊂ Λ1 . We will show that Λ1 is
a conjugate family. The proofs that λ2 and Λ3 are conjugate families are
similar.
The density of {Uxj , xj ∈ Z j , 0 ≤ j ≤ m} under a typical λ ∈ Λ1 is a
product of densities of the form given below
j
q((uxj , xj ∈ Z j , 0 ≤ j ≤ m)) = ×m
0 qj ((u)xj , xj ∈ Z ).
(18.11)
The conditional density of (Ue , . . . , Um ) given ((Yk1 , . . . , Ykn ), the sample
restricted to the first k coordinates is proportional to the joint density given
in (18.10)
Y
xr ∈Z r ,0≤r≤k−1
j
uxNr(xr 1) (1 − uxr )N (xr 0) ×m
0 qj ((u)xj , xj ∈ Z ).
This is a product of densities similar to the prior density given in (18.11).
Thus the posterior density is similar in form to the density of a pm in Λ1 .
Thus Λ1 is a conjugate family.
We will now look at particular λ = λα ∈ Λ2 where α is a non-zero finite
measure on Z ∞ . The pm λα is defined as follows: under λα ,
Uxj are independent
(18.12)
Uxj ∼ B(α(xj 1), α(xj 0))
(18.13)
and
51
as xj varies over Z j , j = 0, 1, . . .
As before, the conditional distribution of {Uxj , xj ∈ Z j , 0 ≤ j ≤ m}
given (Yk1 , . . . , Ykn ) is given by
Q(Uxj , xj ∈ Z j , 0 ≤ j ≤ m|Yk1 = yk1 , . . . , Ykn = ykn )
Y
Y
r 1)−1
=
uxNr(xr 1) (1 − uxr )N (xr 0)
uα(x
(1 − uxr )α(xr 0)−1 .
xr
xr ∈Z r ,0≤r≤k−1
xr ∈Z r ,0≤r≤m
Under this posterior distribution, the random variables {Uxj , xj ∈ Z j , 0 ≤
j ≤ k − 1} are independent and
uxr ∼ B(α(xr 1) + N(xr 1), α(xr 0) + N(xr 0)),
xr ∈ Z r , 0 ≤ r ≤ k − 1.
This means that U has the distribution λα+N (·) .
We will now identify λα defined in (18.12) and (18.13) with the Dirichlet
pm Dα on P(Z ∞ ).
Note that P (1) = Ue . Thus P (1) ∼ B(α(1), α(0)) from the distribution
of Ue . One can use independent Gamma random variables to model the joint
distribution of (Ue , (U1 , U0 )) as follows. Let Z(11), Z(10), Z(01), Z(00) be independent Gamma random variables with parameters α(11), α(10), α(01), α(00),
respectively. Define Z(1) = Z(11) + Z(10), Z(0) = Z(01) + Z(00), Z(e) =
Z(1) + Z(0). Then Ue = Z(1)/Z(e), U1 = Z(11)/Z(1), U0 = Z(01)/Z(0)
are independent random variables with the Beta distributions specified in
(18.12) and (18.13) (proved by a repeated use Basu’s Theorem). Thus the
probabilities on the partition of Z ∞ defined by the two-dimensional cylinder
sets ((11), (10), (01), (00)) given by
(P (11), P (10), P (01), P (00)) = (Ue1 U11 , Ue1 U10 , Ue0 U01 , Ue0 U00 )
52
is equal to (Z(11), Z(10), Z(01), Z(00))/Z(e) which has a Dirichlet distribution with parameters (α(11), α(10), α(01), α(00)). Similarly, the probabilities
(P (xm ), xm ∈ Z m ) are distributed as D(α(xm ), xm ∈ Z m ) for m = 1, 2, . . .
This is the Dirichlet pm on P(Z ∞ ).
This is the third definition of the Dirichlet pm on P. Since the posterior
distribution expressed as the distribution of U was λα+N (·) , it follows that
the posterior distribution of P in Dα+N (·) .
53
February 14, 2008
19
Examples and properties of pm’s on Z ∞
Let δ be the fair coin tossing probability measure on Z ∞ = {0, 1}∞ , i.e.
Under δ, (z1 , z2 , . . . ) are i.i.d with δ(zi = 1) = 1/2, δ(zi = 0) = 1/2, i =
1, 2, . . . This corresponds to the Lebesgue measure δ ∗ on the space X = (0, 1],
since δ(z1 = 0) = 1/2 translates to δ ∗ ((0, 1/2]) = 1/2, δ(zi = 1) = 1/2
translates to δ ∗ ((1/2, 1]) = 1/2, etc. Under δ ∗ the measure of an interval
with diadic rational end points is the length of that interval. This means
that δ ∗ is the Lebesgue measure on (0, 1] or the uniform distribution on
(0, 1].
What the a point in U which corresponds to this δ?
Notice that ue = P (1) = 12 , P (11) = 14 , u1 =
P (11)
P (1)
= 21 , u0 = 12 , . . . . Thus
the point corresponding to δ is u = (1/2, (1/2, 1/2), . . . ).
20
The set of all discrete pm’s P
We will give a necessary and sufficient condition for a pm P ∈ P(Z ∞ ) to be
discrete.
Fix a P ∈ P(Z ∞ ). This can be considered as the distribution of random
variables (Y1 , Y2 , . . . ), the coordinates in Z ∞ . It can also be considered as
the vector u ∈ U which corresponds to P .
Theorem 20.1. The pm P is discrete if and only if P (E) = 1 where E is a
54
countable set defined by
∞
E = ∪∞
m=1 ∩n=m Dn where Dn = {yn = I(uyn−1 )}, n = 1, 2, . . .
(20.14)
A sufficient condition for P to be discrete is
yj
X
∈Z j ,1≤j<∞
n−1 yj
×j=1
uyj−1 uyn−1 (1 − uyn−1 ) < ∞.
(20.15)
A pm ν on P gives measure 1 to the collection of all discrete pm’s if
Eν
yj
X
∈Z j ,1≤j<∞
n−1 yj
×j=1
uyj−1 uyn−1 (1 − uyn−1 ) < ∞.
(20.16)
The random pm P is discrete with probability 1 under the Dirichlet pm να .
Proof: Let y be a point in Z ∞ for which P (y) > 0. This is equivalent to
P (y) = uye1 uyy21 uyy32 . . . > 0
which implies the following successively weaker conditions:
uyynn−1 → 1 as n → ∞
1
uyynn−1 ≥ for all large n
2
1
yn = I(uyn−1 ≥ ) for all large n
2
Dn occurs for all large n
y∈E
with Dn and E as defined in the statement of the theorem. Thus
{y : P (y) > 0} ⊂ {y ∈ E}.
55
m−1
The set ∩∞
points, since once ym−1 is fixed,
m=n Dm contains at most 2
the yk ’s for k ≥ m are uniquely determined. This implies that E is a countable set.
Note that P is discrete is equivalent to P (y : P (y) > 0}) = 1 and this
implies P (E) = 1. Conversely, since E is countable, P (E) = 1 implies that
P is discrete. This completes the proof of (20.14).
We will explore the condition P (E) = 1 further. Note that
E = {Dnc occurs only finitely often }
∞
X
I(Dnc ) < ∞}.
= {
1
P
c
If we want to show that P (E) = 1, it is enough to show that P ( ∞
1 I(Dn ) <
P
c
∞) = 1 or show that the stronger condition ∞
1 P (Dn ) < ∞ holds.
Now
P (Dnc ) = EP (P (Dnc |yn−1 ))
= EP (E(yn = I(uyn−1 ≤ 1/2)|yn−1))
= EP (min(uyn−1 , 1 − uyn−1 )).
≤ 2EP (uyn−1 (1 − uyn−1 ))
X
yj
×1n−1 uj−1
uyn−1 (1 − uyn−1 ).
= 2
yn−1 ∈Z n−1
since
u(1 − u) ≤ min(u, 1 − u) ≤ 2u(1 − u).
Hence P (E) = 1 if
X
yj ∈Z j ,1≤j<∞
y
j
uyn−1 (1 − uyn−1 ) < ∞.
×1n−1 uj−1
56
This establishes the assertion in (20.15).
Consider the Dirichlet pm να on P arising from the pm λα on U, under
which all the uyn independent with the Beta distributions B(α(yn 1, α(yn 0)).
We now verify that the series in (20.16) is finite with probability 1 under να :
X
Eνα
yn−1 ∈Z n−1 ,1≤n<∞
=
n−1
×j=1
X
α(yn 1)α(yn 0)
α(e)
yn−1 ∈Z n−1 ,1≤n<∞
≤ α(e)
yn−1
X
α(yj )
α(yn 1)α(yn 0)
α(yj−1 ) α(yn−1 )(α(yn−1) + 1)
X
yn−1 ∈Z n−1 ,1≤n<∞
=
n−1 yj
×j=1
uyj−1 uyn−1 (1 − uyn−1 )
ᾱ(yn 1)ᾱ(yn 0)
∈Z n−1 ,1≤n<∞
= α(e)ᾱ(Y 1 6= Y 2 )
< ∞,
where ᾱ(·) =
α(·)
α(e)
is the normalized pm of α and Y 1 , Y 2 are i.i.d. with pm ᾱ
on Z ∞ .
Thus condition (20.16) is verified and P is discrete with probability 1
under να . This completes the proof of the theorem.
21
Absolute continuous pm’s P wrt Lebesgue
measure
We already saw that the fair coin tossing measure δ corresponds to the
Lebesgue measure δ ∗ on (0, 1]. Thus we are interested in finding conditions
for P to be absolutely continuous wrt δ.
57
Define
fn (x) =
P (xn )
= 2n P (xn )
δ(xn )
be the Radon-Nikodym derivative of P wrt δ on Z n , n = 1, 2, . . . . From
standard results, we know that fn (x) → f (x), and that P is absolutely
R
continuous wrt δ if and only if f (x)dδ = 1. In this case, dP
= f (x).
dδ
Recall that we P can also be viewed as u = (ue , u1 , . . . ). Notice that
P (A) = P (1)P1(A) + P (0)P0 (A) where Px (·) is the conditional distribution
of P on Z2 × Z3 × · conditional on Y1 = x. We can also write this as
P = ue Pue (A) + u0e Pu0e (A)
where for any yn , Puyn (·) denotes the pm arising from the tail of the sequence
u whose initial segment is uyn .
It is easy to see that P is absolutely continuous wrt δ if and only if Pue
and Pu0e are absolutely continuous wrt δ. This depends only (U2 , U3 , . . . ),
and by extension, this is a tail event for the sequence (Ue , U1 , . . . ).
Let λ is a p.m. on P. The probabilities of such tail events under λ are 0
or 1. Hence
λ({P << δ}) = 0 or
We will develop this further in the next class.
58
1.
Feb 19, 2008
22
Is the random P absolutely continuous,
singular or discrete?
Review from the previous class:
Basic Space
Space of 0’s & 1’s
X = (0, 1]
↔
Z ∞ = {0, 1}∞
x
↔
z = (z1 , z2 , . . . )
Space of pm’s
P ∗ = P(X )
P∗
Space of pm’s
Space of Cond. Prob’s
↔
P = P(Z ∞ )
U = Ue × U1 × · · · = [0, 1]∞
↔
P
↔
u = (ue , u1 , . . . )
P (xn )
x
n−1
×j=0
uxjj−1
ν∗
↔
ν
↔
λ
Dα
↔
Dα
↔
λα
Conditional on P (i.e.
on U), let the observations Y 1 , Y 2 , . . . , Y n
be i.i.d P .
We have
P (Yn = yn ) = uye1 uyy21 · · · uyynn−1
where, as before, we use the notation a1 = a and a0 = 1 − a.
We also considered the following three classes of probability measures on
U:
• Λ1 = {λ : Ue , U1 , U2 , U3 . . . are independent under λ}.
59
• Λ2 = {λ : Uxn , xn ∈ Z n , n = 0, 1, . . . are all independent under λ}.
• Λ3 = {λ : Uxn ≡ Vn , n = 0, 1, . . . , and Ve , V1 , . . . are independent under λ}.
We also defined a pm λα in Λ2 under which uyn ∼ B(α(yn 1), α(yn 0)).
Under this pm λα ,
{P (yn) = uye1 uyy21 · · · uyynn−1 } ∼ D(α(yn ), yn ∈ Z n ).
Thus, λα corresponds to the Dirichlet pm να on P(Z ∞ ).
We also showed that να (P is discrete ) = 1.
We can view P as u = (ue , u1 , . . . ) and so we can write P = Pu . Again,
we can write
P (A) = P (1)P1 (A) + P (0)P0(A) = ue Pu1 + u0e Pu0
and more generally,
Pu =
X
uye1 Puy1
y=0,1
=
X
yn ∈Zn
uye1 uyy21 · · · uyynn−1 Puyn
where uyn = (uyn , (uyny1 , y1 ∈ Z), (uyn y1 y2 , (y1 , y2) ∈ Z 2 ), . . . ).
Let δ be a fair coin tossing measure, and δ(yn ) =
1
, yn
2n
∈ Z n, n =
1, 2, . . . . We know that correspondingδ ∗ in (0, 1] is Lebesgue measure and
also that the corresponding uyn = 21 , yn ∈ Z n , n = 1, 2, . . . .
We will examine what it means to say that P << δ, i.e.P is absolutely
continuous wrt δ.
One says that P is absolutely continuous wrt δ if δ(A) = 0 implies P (A) =
0. This is equivalent to saying that P1 and P0 are absolutely continuous
60
wrt δ. Note that P (A) = P (1)P1 (A) + P (0)P0(A) where the conditional
distributions P1 and P0 are intuitively clear. Thus P << δ implies that
P1 << δ and P0 << δ, since δ(A) = 0 ⇒ P (A) = 0 which implies P1 (A) = 0
and P0 (A) = 0. Trivially, conversely P1 << δ and P0 << δ imply P << δ.
Similarly, P << δ if and only if Puyn << δ for all yn ∈ Z n . Thus the event
P << δ expressed in term of u depends only (Un , Un+1 , . . . ), for n = 1, 2, . . . ,
and hence this event is a tail event for the sequence (Ue , U1 , . . . ).
Suppose that P ∼ λ, where λ ∈ Λ1 . Then by the Kolmogorov 0-1 law
λ(P << δ) = 0 or 1.
Note: λ(P singular wrt δ) = 0 or 1 is same with λ(P << δ)0 or 1.
Define
fn (y) =
P (yn )
= 2n P (yn ) = 2n uye1 uyy21 · · · uyynn−1 .
δ(yn )
From standard theory, we know that, under δ, fn (y) is a martingale and
converges to f (y) with probability 1. We also know that
Z
P << δ ⇔ {fn }u.i. ⇔ f (y)dδ(y) = 1.
p
p
Now we consider fn (y) which converges to f (y).
p
If limn Eδ ( fn (y)) = 0, then f (y) = 0 and P ⊥ δ, i.e. P and δ are
supported on disjoint sets.
p
If limn Eδ ( fn (y)) > 0, then P is not singular wrt δ. If we knew a priori
p
that P is such that P << δ or P is singular wrt δ, then limn Eδ ( fn (y)) > 0
implies that P << δ.
61
We compute
v
v
u n
u n
uY
uY yr
Eδ (t 2uyr−1 ) = Eδ (Eδ (t 2uyyrr−1 )|yn−1 )
r=1
r=1
v
un−1
q
uY y
r
t
= Eδ (
2uyr−1 Eδ ( 2uyynn−1 |yn−1 ))
r=1
v
un−1
q
uY y
1
1p
= Eδ (t
2uyrr−1 (
2uyn−1 +
2(1 − uyn−1 ))).
2
2
r=1
Define
Hn =
sup
yn−1 ∈Z n−1
1
2
1p
1
2uyn−1 +
2
2
q
2(1 − uyn−1 ).
+ θ. Then
1√
1p
2u +
2(1 − u)
2
2
1√
1√
1 + 2θ +
1 − 2θ
=
2
2
1
2θ 1
1
2θ 1
=
(1 +
− (2θ)2 + · · · ) + (1 −
− (2θ)2 + · · · )
2
2
8
2
2
8
2
∼ 1 − cθ if θ ∼ 0.
Let u =
Thus
Hn ≤ 1 − cθn2
where θn = max|uyn−1 − 21 |, and
v
u n
uY
Eδ (t 2uyyrr−1 ) ≤ H1 H2 · · · Hn
r=1
n
Y
(1 − cθr2 )
≤
r=1
→ 0 if
→
X
θr2 = ∞
a limit which is > 0 if
62
X
θr2 < ∞.
One can also look at a special case where uyn ≡ Vn for all yn ∈ Z n , n =
0, 1, . . . , in which case, we get the estimate
v
u n
n
Y
uY yr
∗ ∗
∗
∗ 2
t
Eδ (
2uyr−1 ) = H1 H2 · · · Hn =
(1 − c(θr−1
))
r=1
r=1
p
√
where Hr∗ = 2 2Vr + 2 2(1 − Vr ) and θr∗ = |Vr − 12 |, r = 1, 2, . . . .
Example 22.1. Let λ ∈ Λ3 . Then Un = {uyn , yn ∈ Zn } ≡ Vn and V0 , V1 , . . .
are independent. Suppose further that
Vn ∼ B(an + rn , an + sn )
with |rn | ≤ K < ∞, |sn | ≤ K < ∞ and an → ∞. Then
an + rn
2an + rn + sn
rn − sn
1
1
∼
E(Vn − ) =
2
2(2an + rn + sn )
an
1
(an + rn )(an + sn )
∼
V (Vn ) =
2
(2an + rn + sn ) (2an + rn + sn + 1)
an
1
1
E(Vn − )2 ∼ ,
2
an
E(Vn ) =
and
X
X 1
1
1
E(Vn − )2 < ∞ ⇔
< ∞.
(Vn − )2 < ∞ ⇔
2
2
an
P 1
Therefore, when an additional condition
< ∞ holds, we obtain λ(P <<
an
X
δ) = 1.
Example 22.2. Consider λ ∈ Λ3 with Vn , n = 0, 1, . . . i.i.d. B(1, 1) i.e. with
a has a uniform distribution.
Then
X
1
(Vn − )2 = ∞
2
63
with probability 1 under λ. Therefore,
λ(P ⊥ δ) = 1
P is singular. We can actually show that it is singular continuous.
23
Other mappings from R to Z ∞
Originally, we took X to be (0, 1]. How about X = (0, 1) or X is real line?
Let B0 = (B0 , B1 ) be a two-set partition of the real line. We partition B0 into
two sets B01 ,B00 and partition B1 into two sets B10 ,B11 . We then have the
partition B2 = (B11 , B10 , B01 , B00 ) which is a subpartition of B1 . Continuing
this way, we will get partitions B1 , B2 , . . . of the real line.
Byn 1 ∪ Byn 0 = Byn .
Then we can map the real line to Z ∞ as follows: x ∈ y with y1 (x) = 1 if
x ∈ B1 , y1 (x) = 0 if x ∈ B0 . y2 (x) = 1 if x ∈ B11 ∪ B01 , y2 (x) = 0 if
x ∈ B01 ∪ B00 , etc.
As an example, let F be a distribution function on real line. We can
choose the the partitions as follows:
1
B0 = (−∞, F −1( )]
2
1
B1 = (F −1( ), ∞]
2
1
B00 = (−∞, F −1 ( )]
4
1
1
B01 = (F −1 ( ), F −1 ( )]
4
2
etc.
64
In this case the fair coin tossing measure δ on Z ∞ corresponds to the
distribution F of the real line. From our earlier results, we can now choose
prior distributions λ which picks pm’s absolutely continuous wrt F .
65
Feb 21, 2008
24
Support of Dirichlet pm’s
Let us go back to our unknown parameter P in the nonparametric problem.
Let α be a non-zero finite measure on real line and let Dα be the prior
distribution for P . Let D∗ = {P : P is discrete }. We know that Dα (D ∗ ) =
1. Is D ∗ the support of Dα ?
Consider N(0, 1).
Let Iir be the set all irrationals on R.
Then N(0, 1)(Iir ) = 1. Can we say that the Normal distribution sits just
on the irrationals? We can remove a countable number of irrationals, and
the Normal distribution will give probability to the new set. Is there a smallest set which has probability 1. The answer is no. However, if we define the
support of the normal distribution to be the smallest closed set with probability 1, then the set of irrationals is not closed and its closure is the whole
real line R. Any closed set strictly included in R will have probability less
than 1, and thus the support of N(0, 1) is R.
Definition 24.1. Support: Let µ be a pm on R. A set K is called the
support of µ, if
1 K is closed.
2 µ(K) = 1.
3 K is the smallest such closed set, i.e. if K ∗ is closed and K ∗ ⊂ K then
µ(K ∗ ) < 1.
66
In the above definition one can replace 3 above by either of the conditions
below
3a If U is open and U ⊂ K, then µ(U) > 0.
3b If x ∈ K and N(x, ǫ) ⊂ K for some ǫ > 0, then µ(N(x, ǫ)) > 0. (Here
N(x, ǫ) = {y : |y − x| < ǫ}).
The support of the Poisson distribution with parameter λ > 0 is the set
of non-negative integers.
Coming back to the Dirichlet pm Dα , we know that Dα (D ∗ ) = 1. How-
ever, since D ∗ is not closed (under weak convergence), it is not the support
of Dα . So the question of the support of Dα is still unanswered.
Let us examine open balls in P. We know that a sequence Pn converges
to P if and only if
Z
f dPn →
Z
f dP
for all bounded continuous function f vanishing outside a compact set of R.
Consider the open set defined by f :
Z
Z
∗
N(P ; f, ǫ) = {P : | f dP − f dP ∗ | < ǫ}
Finite intersections of such sets form a base for open sets in the weak topology
in P.
Given δ > 0, we can find a0 < . . . < aT , such that f (x) = 0 outside
[a0 , aT ] and |f (x) − f (ai )| < δ for ai ≤ x ≤ ai+1 , i = 0, . . . , T − 1.
Let Ai = (ai , ai+1 ]. If |P (Ai) − P ∗ (Ai )| < θ, i = 0, . . . , T − 1, then
Z
Z
| f dP − f dP ∗| < T cθ + δ
67
where c = supx |f (x)|, and
∗
{P : |P (Ai ) − P (Ai )| < θ, i = 0, . . . , T − 1} ⊂ {P : |
Z
f dP −
Z
f dP ∗ | < ǫ}
if T cθ + δ < ǫ. Similarly, we we can find a set of the form {P : |P (Ai ) −
P ∗ (Ai )| < θ, i = 0, . . . , T − 1} contained in an open set defined by the finite
intersection of the form N(P ∗ ; f, ǫ).
Let ᾱ(A) =
α(A)
,
α(X)
then α(A) = ᾱ(A)α(X). Let M be the support of ᾱ.
Consider M ∗ defined by
M ∗ = {P : support of P ∈ M} = {P : P (M) = 1}.
Theorem 24.1. The support for Dα is M ∗ .
Proof. Let Pn ∈ M ∗ and let Pn → P . Then P (M) ≤ lim supn Pn (M) = 1,
which implies that P ∈ M ∗ . Thus M ∗ is closed.
We will show that Dα (M ∗ ) = 1. Under Dα , P (M) ∼ B(α(M), α(M c )).
We know that α(M) = α(R)ᾱ(M) = α(M) and thus α(M c ) = 0. Thus
P (M) ∼ B(α(M), 0), which means that P (M) ≡ 1 with probability 1 under
Dα , and Dα {P : P (M) = 1} = 1 or Dα (M ∗ ) = 1.
We will show that M ∗ is the smallest such closed set in P. Let P ∗ ∈ M.
Consider N(P ∗ ; f, ǫ) ⊂ M ∗ for some bounded continuous function f which
vanishes outside a bounded set in R. There exist θ > 0 and δ > 0, such that
{P : |P (Ai ) − P ∗ (Ai )| ≤ θ, i = 0, . . . , T − 1}
(24.17)
R
R
which is a subset of {P : | f dP − f dP ∗| < ǫ}. Since both P and P ∗ are
in M ∗ , we can write the set in (24.17) as
{P : |P (Ai ∩ M) − P ∗ (Ai ∩ M)| < θ, i = 0, . . . , T − 1}.
68
(24.18)
Let AT +1 = [a0 , aT ]c . Let I = {i : P (Ai ∩ M) > 0, 1 ≤ i ≤ T + 1}. The
joint distribution of (P (Ai ∩ M), i ∈ I) is finite dimensional Dirichlet with all
parameters positive. Such a distribution gives positive probability all open
sets in the simplex, and in particular to the set of the form (24.18). Hence
P (N(P ∗; f, ǫ)) > 0. This proves that M ∗ is the smallest such closed set.
25
Bayes estimates
Let θ be the unknown state of the nature. A standard statistical problem
is to estimate some desired function g(θ) of θ. Assume that θ has a prior
distribution π. If I claim that ĝ is an estimator of the g(θ), I should use
some loss (or cost) function L(g(θ), ĝ) or the average of this cost function to
evaluate the performance of the estimate. To do the latter, I should choose
R
ĝ to minimize L(g(θ), ĝ)dπ(θ).
R
If L(g(θ), ĝ) = (g(θ) − ĝ)2 and g(θ)2 dπ < ∞, the minimizer is ĝ =
R
g(θ)dπ, the expectation of g(θ) under π. This is the Bayes estimator of
g(θ) based just on the prior distribution before collecting any data.
If we use L(ĝ, g(θ)) = |ĝ − g(θ)|, then expected loss is minimized at ĝ is
the median of the distribution of g(θ) under π.
For more realistic and complicated loss function, we can use a computer
to do this minimization.
Now, back to our nonparametric Bayes problem where P ∼ Dα , with α a
non-zero finite measure, is our prior distribution of the unknown distribution
P.
Example 25.1. Consider the function g(P ) = P ((−∞, t]) = FP (t) (the
69
distribution function at the point t). The Bayes estimator is
F̂P (t) = EDα (P ((−∞, t]))
α((−∞, t])
=
α(X)
= ᾱ(t)
under square error loss function, since P ((−∞, t]) ∼ B(α((−∞, t]), α((t, ∞))).
Here we do not need to check the second moment condition since Fp (t) is a
bounded function.
Suppose that X1 , . . . , Xn |P are i.i.d P and P ∼ Dα . What is the Bayes
estimate of FP (t)? It is just the expectation of FP (t) under the distribution
of P given X1 . . . . , Xn , the posterior distribution. We already know that
P |X1, . . . , Xn ∼ Dα+Pn1 δXi = Dα+nFn . Thus
α((−∞, t]) + nFn (t)
α(X) + n
α(X)ᾱ((−∞, t]) + nFn (t)
=
α(X) + n
F̂P (t) =
R
Question: What is the Bayes estimator of g(P ) =
h(x)dP ?
R
R
If EDα (( h(x)dP )2 ) < ∞, it is the expectation EDα ( h(x)dP ). Do we
have a simple expression for this Bayes estimator? But before that, do we
know that g(P ) is well defined on the whole space?
70
Feb 26, 2008
Let us look at the definition of a random variable. Usually, we say that it
is a measurable mapping X from a probability space (Ω; A, Q) to the real line
R. Note that the function f (x) =
1
x
from (R, B, P ) to R is not well defined
at x = 0, but f (x) will be considered as a random variable if P ({x}) = 0.
Thus random variables need to be well defined on sets of probability 1 under
the underlying probability measure. We can then talk about expectations,
distributions, etc. of such random variables. We can also allow values −∞
and ∞ for a random variable X as long as the probability that X is finite is
1 under P .
How do we define the expectation of X, or more generally of g(X) where
g is a measurable function. Denote the distribution of X by P . This exR
pectation E(g(X)) is defined to be g(x)dP . How do we define the integral
of g? If g is an indicator function, more specifically if g(x) = I(x, A), then
we define E(g) = P (A). Consider a non-negative measurable function g(x).
Define
gn (x) =


i
,
2n
 0,
i
2n
≤ g(x) <
(i+1)
,i
2n
= 0, 1, . . . n2n
g(x) ≥ n
Then gn (x) ≤ g(x) for all n and gn (x) → g(x) as n → ∞, for all x ∈ {x :
g(x) < ∞}. Further more
n
n2
X
i
i
(i + 1)
E(gn (x)) =
P ({ n ≤ x <
})
n
2
2
2n
1
increases with n and its limit (finite or infinite) will be the definition of
E(g(X)). Thus for any non-negative measurable function g(x) we can find
a sequence of simple functions which increase to it and the integral of g is
71
the limit of the integrals of these simple functions. In this sense the integral
of a non-negative random variable always exists. However, we say that g is
integrable only when the integral is finite.
If g is not non-negative, let g = g + − g − where g + , g − are the positive
and negative parts of g. We say that g is integrable if both g + and g − are
integrable. If only one g + and g − is integrable, then we say that the integral
of g exists and it may be +∞ or −∞. Thus E(g) is integrable if and only if
both g − and g + have finite expectations which amounts to saying that |g| is
integrable. Thus we see the following statement in text books:
|E(g(X))| < ∞ ⇔ E(|g(X)|) < ∞.
Let P be the space of probability measures on R. Remember that we
endowed P with a σ-field. We want to consider rv’s taking values on this
space; they will be measurable functions from some probability space into P.
In particular, we would like to consider functions like
Z
Z
φg (P ) = φ(P ) = g(x)dP (x) = lim gn (x)dP (x)
where gn are simple functions converging to g. Clearly, the function φg (P )
is not defined for all P ’s. Thus we can consider φg (P ) as a random variable
under a pm ν if ν({P : φg (P ) is finite }) = 1.
R
Let m(P ) = xdP (x) and consider the Dirichlet pm Dα . Under what
conditions of α is m(P ) a well defined random variable? We will answer this
question later or make it into a project for this class. It can be shown that
Z
m(P ) is a genuine rv if and only if
log(1 + |x|)dᾱ(x) < ∞
72
where ᾱ is the normalized pm arising from α. Or, more generally, we can
state the following theorem:
Theorem 25.1. Let g be a measurable function on R. The random variable
R
φg (P ) = g(x)dP (x) well defined and finite under Dα , i.e. Dα (P : φg (P )
finite) = 1 if and only if
Z
log(1 + g(x)2 )dᾱ < ∞
R
⇔ log(1 ∧ |g(x)|)dᾱ < ∞
R
⇔ |g(X)|>100 log|g(x)|dᾱ < ∞
i.e. log|g(x)| is integrable in the tail under ᾱ.
If we also wish to require that m(P ) be integrable we will need a stronger
condition. In this connection, we can state the following theorem.
Theorem 25.2. A necessary and sufficient condition that EDα (φg (P )) is
R
finite is that |g(x)|dᾱ(x) < ∞.
Proof: We will establish this result when g(x) ≥ 0. The general case will
follow immediately.
We will use the sequence {gn (x)} which approximates g(x) and defined
as previously
gn (x) =


i
,
2n
 0,
i
2n
≤ g(x) <
(i+1)
,i
2n
= 0, 1, . . . n2n − 1
g(x) ≥ n
Suppose that φg (P ) is well defined and EDα (φg (P )) < ∞. Then
Z
Z
φgn (P ) = gn (x)dP (x) ր φg (P ) = g(x)P (x).
By the monotone convergence theorem
EDα (φgn (P )) ր EDα (φg (P ))
73
which, by our assumption, is finite. However,
EDα (φgn (P )) =
X
i=0n2n −1
=
ր
Hence
R
Z
Z
i
i
(i + 1)
ᾱ( n ≤ g(x) <
)
n
2
2
2n
gn (x)dᾱ(x)
g(x)dᾱ(x)
g(x)dᾱ(x) is finite and is equal to EDα (φg (P )).
To prove the converse assertion, assume that EDα (φg (P )) < ∞. This
automatically means that φg (P ) is well defined with Dα probability 1. Since
φgn (P ) ր φg (P ) we have EDα (φgn (P )) ր EDα (φg (P )). We also have
R
R
EDα (φgn (P )) = gn (x)dᾱ(x) ր g(x)dᾱ(x) and this is finite. This com-
pletes the proof of this theorem.
R
Let φg (P ) = gdP , P ∼ Dα . Then under square error loss function, the
R
Bayes estimator φ̂g for φg (P ) under the prior is EDα (φg (P )) = g(x)dᾱ(x),
R
provided that g(x)2 dᾱ)(x) < ∞.
We can derive this result in another way. Let P ∼ Dα and X1 |P ∼ P .
R
Notice that we may write φg (P ) = gdP = EQ (g(X1 )|P ) where Q is the
joint distribution of P and X1 . We can compute this as follows.
Z
Z
EDα ( gdP ) = EQ ( gdP )
= EQ (EQ (g(X1)|P ))
= EQ (g(X1)).
Since Q(X1 ∈ A|P ) = P (A) we have Q(X1 ∈ A) = EQ (P (A)) = ᾱ(A). Thus
Z
EQ (g(X1 )) = g(x)dᾱ
74
provided that
R
|g|dᾱ < ∞.
Thus for the case of the population mean m(P ) =
R
estimate is m̂ = xdᾱ.
R
xdP , the Bayes
Suppose that we have date X1 , . . . , Xn and X1 , X2 , . . . , Xn |P are i.i.d P ,
then the Bayes estimate of the population mean is
R
Z
α(X ) xdα + nX̄n
m̂X1 ,...,Xn = XdαX1 ¯,...,Xn =
α(X ) + n
Where ᾱX1 ,...,Xn =
α(X )ᾱ+nFn
α(X )+n
and Fn is the empirical df of X1 , . . . , Xn .
For the no-sample case, what is the Bayes risk of this estimator?
EDα (m̂ − m(P ))2 = EDα (m2 (P )) − m̂2 .
Let P ∼ Dα and X1 , X2 |P be i.i.d. P . Denote their joint distribution by Q.
Then
Z
g(x)dP
Z
h(x)dP = EQ (g(X1 )h(X2 )|P ).
Therefore
Z
Z
EQ ( gdP hdP ) = EQ (EQ (g(X1)h(X2 )|P )) = EQ (g(X1)h(X2 )).
Also since
Q(X2 ∈ A|X1 ) =
α(A) + δx1 (A)
,
α(X ) + 1
we get
EQ (
Z
gdP
Z
hdP ) = EQ (EQ (g(X1 )h(X2 )|P ))
= EQ (EQ (g(X1 )h(X2 )))
= EQ (EQ (g(X1 )h(X2 )|X1 ))
R
hdα + h(X1 )
= EQ (g(X1)
)
α(X ) + 1
R
R
R
gdα hdα + ghdα
.
=
α(X )(α(X ) + 1)
75
Thus the Bayes risk in estimating m(P ) is
EDα (m̂ − m(P ))2
= EDα (m2 (P )) − m̂2
R
R
Z
α(X )( xdα̂)2 + x2 dα̂
− ( xdα̂)2
=
α(X ) + 1
R 2
R
x dα̂ − ( xdα̂)2
=
.
α(X ) + 1
Now the posterior Bayes risk, EDα+P δX (m̂X1 ,...,Xn − m(P ))2 , after iid
i
observations X1 , . . . , Xn is
1
α(X )Vᾱ + ns2n + α(X )n(Eᾱ − X̄n )2
(
)
α(X ) + 1 + n
(α(X ) + n)2
R
R 2
P
where Eᾱ =
xdᾱ, Vᾱ =
x dᾱ − Eᾱ2 , X̄n = (1/n) n1 Xi , and
P
s2n = (1/n)( n1 Xi2 ) − X̄n2 .
76
Feb 28, 2008
In the last class, we found the Bayes estimator m̂X1 ,...,Xn for m(P ) =
R
xdP based on sample X1 , . . . , Xn and with a prior distribution Dα , i.e
when
• P ∼ Dα .
• X1 , X2 , . . . , Xn |P are i.i.d. P .
R
R
Let us now look at the Bayes estimator for V (P ), where V (P ) = (x −
xdP )2 dP . One can express V (P ) in another way. Let P ∼ Dα and let
X1 , X2 |P ∼ P and let Q be their joint distribution. Note that X1 ∼ ᾱ, and
X2 |X1 ∼
α+δX1
.
α(X )+1
The alternate expression is
Z
Z
(x − xdP )2 dP
Z
1
(x1 − x2 )2 dP (x1 )dP (x2 )
=
2
1
=
EQ ((X1 − X2 )2 |P )
2
V (P ) =
77
provided all the moments exist. We can calculate
=
=
=
=
=
=
=
1
V̂ = EDα (V (P )) = EQ (V (P )) = EQ ( EQ ((X1 − X2 )2 |P ))
2
1
1
EQ (X1 − X2 )2 = EQ [(EQ ((X1 − X2 )2 )|X1 )]
2
2
1
EQ [EQ (X22 |X1 ) − 2X1 EQ (X2 |X1 ) + X12 )]
2
R
R
α(X ) xdᾱ + X1
α(X ) x2 dᾱ + X12
1
EQ [
− 2X1
+ X12 ]
2
α(X ) + 1
α(X ) + 1
R 2
R 2
R
R
R
Z
α(X ) xdᾱ xdᾱ + x2 dᾱ
1 α(X ) x dᾱ + x dᾱ
[
−2
+ x2 dᾱ]
2
α(X ) + 1
α(X ) + 1
R
R
1 (α(X ) + 1 − 2 + α(X ) + 1) x2 dᾱ − 2α(X )( xdᾱ)2
2
(α(X ) + 1)
Z
Z
α(X )
( x2 dᾱ − ( xdᾱ)2 )
α(X ) + 1
α(X )
V (ᾱ).
α(X ) + 1
The Bayes estimator V̂X1 ,...,Xn based on a sample X1 , . . . , Xn , where P ∼
Dα and X1 , . . . , Xn |P are i.i.d. P is given by
α(X )ᾱ + nFn
α(X ) + n
V(
)
α(X ) + n + 1
α(X ) + n
R
α(X ) + n α(X )2V (ᾱ) + n2 s2n + 2α(X )n( xdᾱ − X̄n )2
=
α(X ) + n + 1
(α(X ) + n)2
V̂X1 ,...,Xn =
where X̄n and sn are the sample mean and variance.
If let α(X ) → 0, then
m̂X1 ,...,Xn
R
α(X ) xdᾱ + nX̄n α(X )→0
=
−→ X̄n .
α(X ) + n
Therefore, sample mean is the limit of the Bayes estimator as α(X ) → 0.
Again,
V̂X1 ,...,Xn
n
n 2
1 X
−→
(Xi − X̄)2
sn =
n+1
n+1 1
α(X )→0
78
which is the best equi-variant estimate for the variance.
Now let us look at the median, med(P ) = median of P . A value a is said
to be a median of P , if
P ((−∞, a]) ≥
1
1
and P ((a, ∞)) ≥ .
2
2
The collection of all medians of P forms an interval of the form [c, d].
Further more,
1
med(P ) ≥ g ⇔ P ((−∞, g]) ≤ ,
2
and
1
med(P ) ≤ g ⇔ P ((−∞, g]) ≥ .
2
ˆ of med(P ) under absolute error loss, based
Let the Bayes estimator med
on just the prior Dα is the median of med(P ). Then
{median of med(P ) under Dα ≥ g}
1
⇔ Dα (med(P ) ≤ g) ≤
2
1
1
⇔ Dα (P (−∞, g] ≥ ) ≤
2
2
1
1
⇔ Q(B(α(−∞, g], α(g, ∞)) ≥ ) ≤
2
2
1
α(−∞, g])
≤
⇔
α(X )
2
⇔ g ≥ med(ᾱ) and g ≤ med(ᾱ)
Thus the median of ᾱ is our Bayes estimator.
Let P ∼ Dα and X1 , . . . , Xn |P be i.i.d.P , then
Q(Xn |X1 , . . . , Xn−1 ) =
79
α + (n − 1)Fn−1
α(X ) + n − 1
where Fn−1 is the empirical df of X1 , . . . , Xn−1 . Thus
α({X1, . . . , Xn }) + (n − 1)
α(X ) + n − 1
n−1
.
≥
α(X ) + n − 1
Q(Xn ∈ {X1 , . . . , Xn−1 }|X1 , . . . , Xn−1 ) =
There will be an equality in the above if α is non-atomic. Thus in all cases,
there is positive probability that Xn will be a repeated observation.
We will assume that α is non-atomic and define Bernoulli random variables D1 , D2 , . . . to indicate that observations number 1, 2, . . . are new observations. More precisely,
Example 25.2. Let D1 = 1 and

 0 if X ∈ {X , . . . , X }
n
1
n−1
Dn =
n = 2, 3, . . .
 1 if X 6∈ {X , . . . , X },
n
1
n−1
We will also keep track of the observation number that corresponds to a
new observation and keep track of its magnitude as follows. Let τ1 = 1, Y1 =
Xτ1 = X1 and τn = max {k : k ≥ τn−1 and Xn 6∈ {X1 , . . . , Xn−1 }}, Yn =
Xτn , n = 2, 3, . . . .
A simple example will be:
{Xn } = {1, 2, 1, 6, 2, 1, 1, . . . }
{Dn } = {1, 1, 0, 1, 0, 0, 0, . . . }
{τn } = {1, 2, 4, . . . }
{Yn } = {1, 2, 6, . . . }
Notice that
Pn
1
Di is the number of distinct observations among {X1 , . . . , Xn }.
80
Theorem 25.3. Assume that the finite measure α is non-atomic. Then
D1 , D2 , . . . are independent, and
P (Dn = 1) =
α(X )
, n = 1, 2, . . .
α(X ) + n − 1
Proof: Notice that
Q(Dn = 1|X1 , . . . , Xn−1 ) = Q(Xn 6∈ {X1 , . . . , Xn−1 }|X1 , . . . , Xn−1 )
α(X − {X1 , . . . , Xn−1 })
=
α(X ) + n − 1
α(X )
.
=
α(X ) + n − 1
Since this conditional probability is a constant, it is also equal to
Q(Dn
=
1|D1 , . . . , Dn−1 ).
P (Dn = 1) =
Thus D1 , D2 , . . . are independent and
α(X )
.
α(X )+n−1
When α(X ) = 1, P (Dn = 1) =
1
,n
n
= 1, 2, . . . . An example of inde-
pendent Bernoulli random variables with such a distribution with arises in a
different situation. as described below.
Example 25.3. Let X1 , X2 , are i.i.d. with continuous distribution. Let
D1 = 1, and

 1
Dn =
 0
if Xn > max {X1 , . . . , Xn−1 }
n = 2, 3, . . .
otherwise
If X1 , X2 , . . . are the flood levels on a river in successive years, the Bernoulli
random variables Dn give the year numbers on which records occur. Assume
that X1 , X2 , . . . are i.i.d. and the common distribution is continuous. The
probability of a record on year n is P (Xn > max (X1 , . . . , Xn−1 )) is n1 , independent of previous history. This is since all possible vector of ranks of the
81
first n observations are equally likely. The event {Xn > max (X1 , . . . , Xn−1 )}
is just the event that the rank of Xn among the first n observations is equal
to n. This is therefore equal to
1
n
and does not depend the values of the
previous observations.
82
March 4, 2008
26
The sequence Y1, Y2, . . .
Recalling from our last class, let X1 , X2 , . . . |P be i.i.d P and P ∼ Dα where
α is non-atomic.
Let Dn = I{Xn 6∈ {X1 , . . . , Xn−1}}, i.e. Dn = 1 if Xn is a new observation, and Dn = 0, if it is equal to a previous observation.
Define τ1 = 1 and Y1 = Xτ1 ,
τn = inf{m, m > τn−1 , Xm 6∈ {X1 , . . . , Xm−1 }}
= inf{m, m > τn−1 , Xm 6∈ {Y1 , . . . , Ym−1 }},
and
Yn = Yτn .
We established that D1 , D2 , . . . are independent and
Q(Dn = 1) =
α(X )
α(X ) + n − 1
We will now establish the following.
Theorem 26.1. Under the assumption that α is non-atomic, the random
variables Y1 , Y2, . . . are i.i.d ᾱ.
83
Proof: Note that τn ≥ n and n − 1 ≤ τn−1 ≤ τn − 1. We calculate
Q(Yn ∈ A, τn = m|Y1 , . . . , Yn−1)
m−1
X
=
Q(Yn ∈ A, τn = m, τn−1 = r|Y1 , . . . , Yn−1)
r=n−1
=
m−1
X
r=n−1
Q(Yn ∈ A|τn = m, τn−1 = r, Y1 , . . . , Yn−1 )Q(τn = m, τn−1 = r|Y1, . . . , Yn−1 ).
∗
The event {τn = m, τn−1 = r} and also be written as {Xi ∈ Yn−1
,r+1 ≤ i ≤
∗
∗
m − 1, Xm 6∈ Yn−1
, τn−1 = r} where Yn−1
= {Y1, . . . , Yn−1 }. It follows that
Q(Yn ∈ A|τn = m, τn−1 = r, X1 , . . . , Xm−1 )
∗
α(A − Yn−1
)
=
α(X ) + m − 1
α(A)
, since α is non-atomic
=
α(X ) + m − 1
= Q(Yn ∈ A|τn = m, τn−1 = m, Y1 , . . . , Yn−1)
since the previous expression depends only on m. Substituting this in the
previous calculation, we get
Q(Yn ∈ A, τn = m|Y1 , . . . , Yn−1 )
m−1
X
α(A)
=
Q(τn = m, τn−1 = r|Y1 , . . . , Yn−1 )
α(X ) + m − 1
r=n−1
and
=
Q(Yn ∈ A|Y1 , . . . , Yn−1)
∞ m−1
X
X
α(A)
α(X ) + m − 1
m=n r=n−1
(26.19)
Q(τn = m, τn−1 = r|Y1 , . . . , Yn−1 ).
84
Since
∞
X
m=n
Q(Yn ∈ X |Y1, . . . , Yn−1 ) = 1,
we obtain, by letting A = X in (26.19),
∞ m=1
X
X
α(X )
Q(τn = m, τn−1 = r|Y1 , . . . , Yn−1 ) = 1.
α(X ) + m − 1
m=n r=n−1
Substituting this in (26.19) we get
Q(Yn ∈ A|Y1 , . . . , Yn−1) =
α(A))
= ᾱ(A).
α(X )
This completes the proof of theorem 26.1.
27
The sequence D1, D2, . . .
We will now study some properties of the sequence D1 , D2 , . . . which indicates
occurrences of a new observations in the sequence X1 , X2 , . . . . Recall that
we are assuming that α is non-atomic.
Notice that
∞
X
E(Di ) =
∞
X
1
i=1
α(X )
= ∞.
α(X ) + n − 1
This means that
∞
X
i=1
Q(Di = 1) = ∞ i. e. P ({Di = 1 infinitely often }) = 1.
We can strengthen this in the form of the following theorem:
Theorem 27.1.
Pn
Di
→ α(X ) with probability 1.
log n
1
85
Proof: Recall two standard results.
P
Theorem 27.2. If
Xn < ∞ and an ր ∞, then
Pn
ai Xi
an
1
→ 0.
Theorem 27.3. Let X1 , X2 , . . . iid and let E(Xi ) = 0, i = 1, 2, . . . . Assume
P
P
that
V (Xi ) < ∞. Then
Xi < ∞ with probability 1.
)
).
From last class, we know that Dn ∼Bin(1, α(Xα(X
)+n−1
Dn −E(Dn )
.
log n
Let Wn =
V (Wn ) =
and hence
P
n
Then
1
α(X )
α(X )
(1 −
)
2
(log n) α(X ) + n − 1
α(X ) + n − 1
α(X )
1
≤
2
(log n) α(X ) + n − 1
1
∼
n(log n)2
V (Wn ) < ∞. Thus
X
and
Wn =
X Dn − E(Dn )
log n
n
< ∞.
n
1 X
(Di − E(Di )) → 0
log n 1
with probability 1. Since
n
n
1 X
1 X
α(X )
E(Di ) =
→ α(X ).
log n 1
log n 1 α(X ) + i − 1
Therefore,
Pn
1 Di
log n
→ α(X ) with probability 1.
86
28
Estimation of the parameters of the Dirichlet distribution
Consider the standard Bayes nonparametric problem, where the unknown
distribution (parameter) P has distribution Dα and, given P , the observations X1 , X2 , . . . are i.i.d. P . In general, one will not be able to estimate the
parameter of the prior distribution from such data, but when α is non-atomic,
we will show that the parameter α of the prior distribution consistently from
X1 , X2 , . . . . We will later amplify on the reasons why this is possible.
We already saw in Theorem 27.1 that
n
1 X
Di → α(X ) with probability 1.
log(n) 1
Again, from Theorem 26.1, we also have, from the Glivenko-Cantelli theorem, that
n
1X
δXi (A) → α(A) with probability 1.
n 1
Combining these two results we get
n
n
1 X
1X
d
Di ·
δXi → α with probability 1
log(n) 1
n 1
d
(where → means converges weakly) which can also be written as
Dα P : P ∞
n
n
1X
1 X
d
Di ·
δXi → α = 1
log(n) 1
n 1
= 1.
(28.20)
Let us examine this result further. We have data X1 , . . . , Xn , which
given the unknown common distribution P , are i.i.d.P . Further, from our
87
prior knowledge, P has some prior distribution; in this case, a Dirichlet
distribution with parameter α. We have established that we can estimate
the parameter α of the prior distribution, consistently from the data, when
α is non-atomic.
This does not happen in standard problems. Here is an example where
this can happen
Example 28.1. Let the data X1 , X2 , . . . be i.i.d. P . Suppose that there
are two possible prior distributions for P , one giving all its mass to {P :
P = Unif(a, b), 0 ≤ a < b ≤ 1}, and the other assigning all its mass to
{P : P = Unif(a, b), 2 ≤ a < b ≤ 3}. We can tell which is the prior
distribution from just the first observation X1 . This phenomenon occurred
because the two prior distributions are singular with respect to one another.
In the same way, for two dirichlet priors, Dα and Dα′ , where α and α′ are
non-atomic and
α 6= α′ ,
the sample data can consistently tell which of the two priors is the correct
Pn
Pn
1
1
prior. In fact, the limit of log(n)
1 Di · n
1 δXi is correct α. Moreover, let
n
Aα = {P : P ∞
n
1X
1 X
d
Di ·
δXi → α = 1.
log(n) 1
n 1
Then the subsets Aα and Aα′ of P are disjoint, and Dα (Aα ) = 1, Dα′ (Aα′ ) =
1. This clearly shows that Dα and Dα′ are singular with respect to each other
in this case.
For later discussion, we will use the following result, which is being assigned as a homework problem.
88
Let A ∈ X . Then (A, Ac ) is a partition of X . Any probability measure
P on X can be expressed as:
P (B ∩ Ac )
P (B ∩ A)
+ P (Ac )
P (A)
P (Ac )
c
= P (A)PA (B) + P (A )PAc (B)
P (B) = P (A)
where PA and PAc are the restrictions of the pm P to A and Ac .
Theorem 28.1. If P ∼ Dα then (P (A), P (Ac )),PA and PAc are independent
and P (A) ∼ Beta(α(A), α(Ac )), PA ∼ DαA , PAc ∼ DαAc .
Here αA (B) = α(A ∩ B), whereas PA (B) =
P (A∩B)
;
P (A)
we define restricted
measure and restricted probability measures differently.
Let α be a finite measure. Let M be the set of points where α puts
positive mass. We always have the decomposition
α = α1 + α2
where α1 is non-atomic (continuous but may not have a density) and α2 is
discrete or singular with respect to Lebesgue measure. Clearly α1 (M) = 0.
Taking M to be A in Theorem (28.1) and writing
P = P (M)PM + P (M c )PM c
we have PM ∼ DαM = Dα2 and PM c ∼ DαM c = Dα1 .
Let X1 , X2 , . . . given P be iid P . Let X1M , X2M , . . . be the elements of
c
c
X1 , X2 , . . . which fall in M. Define X1M , X2M , . . . similarly. Then X1M , X2M , . . .
c
c
given P are i.i.d. PM and X1M , X2M , . . . given P are i.i.d. PM c .
If α′ is another finite measure, we can also look at M ′ the set of points of
positive mass under α′ and consider the decomposition α′ = α1′ + α2′ where
α1′ is the non-atomic part and α2′ is the discrete part.
89
We can state the following result.
Theorem 28.2.
1 If M 6= M ′ , then Dα and Dα′ are singular with respect to each other,
P
since for an x ∈ M ′ ∩ M c , n1 n1 I(Xi = x) → P ({x}), and P ({x}) = 0
under Dα and P ({x}) > 0 under Dα′ .
2 If M = M ′ and α1 6= α1′ , then Dα and Dα′ are singular with respect
to each other, because Dalpha and Dα′ are singular with respect to each
c
c
other arguing as before using the sequence X1M , X2M , . . . .
3 If M = M ′ , α1 = α1′ and α2 6= α2′ , one can take M to be {1, 2, . . . }.
This is the case of looking at two Dirichlet distributions on P = {(p1 , p2 , . . . ) :
P
0 ≤ pi ≤ 1, pi = 1}. These Dirichlet measures can be expressed in
terms of countable number of independent Gamma random variables.
From Kakutani’s theorem it will follow that Dα and Dα′ are either absolutely continuous with respect to each other or singular. Necessary
and sufficient conditions can be given for absolute continuity and singularity.
29
Other ways to introduce prior distributions on P
Consider a distribution function F (t). It is a right-continuous non-decreasing
function and we want a probability measure on the space of such functions,
i.e. we want a random function F (t). We will assume, for convenience, that
X = [0, ∞).
90
Consider a right continuous non-decreasing stochastic process
{X(t) : t ≥ 0 with X(t) ր X(∞) < ∞}.
Define
F (t) =
X(t)
.
X(∞)
Then F (t) is a random distribution function, and its distribution can be used
as a prior distribution in a nonparametric problem.
Here is an example of such a stochastic process.
Example 29.1. Let 0 < t1 < · · · < t∞ < ∞, and assume that the increments {X(t1 ), X(t2 ) − X(t1 ), . . . , X(tk ) − X(tk−1 )} are non-negative and
independent. In particular assume that these increments have independent
Gamma distributions with shape parameters {α(t1 ), α(t2 )−α(t1 ), . . . , α(tk )−
α(tk−1)},where α is a non-decreasing function with α(∞) < ∞. Such a
stochastic process exists and is called a Gamma process. The function α can
also be considered as a finite measure on [0, ∞).
It is clear that the distribution of (F (t1 ), . . . , F (tk ) − F (tk−1 )) is
D(α(t1 ), α(t2 ) − α(t1 ), . . . , α(tk ) − α(tk−1)),
in other words, the distribution of the pm P derived from F is our Dirichlet
distribution Dα . This is therefore another definition of Dirichlet Distribution.
Another property of the Gamma process is that it is a pure jump process.
This implies that F (and hence P ) is discrete with probability 1 under Dα .
We can then ask for the distribution of the jumps of P . We cannot talk
about the first jump of P . However, we can talk about the largest jump
91
p∗1 , the second largest jump p∗2 , etc. and also the locations of those jumps
X1∗ , X2∗ . . . . In that case,
P =
∞
X
p∗n δXn∗ .
1
The exact distribution of (p∗1 , p∗2 , . . . ) has been obtained in Klass and
Ferguson, and it is independent of X1∗ , X2∗ , . . . which are i.i.d. ᾱ. This
representation is different from our representation to be given in the next
class. However, when our probability masses (p1 , p2 , . . . ) are rearranged in
increasing order we get the distribution of (p∗1 , p∗2 , . . . ).
30
More on the estimation of the median m(P )
We already discussed the estimation of the median m(P ) = median of (P ).
The median is any number x satisfying P ((−∞, x]) ≥
1
, P ([x, ∞))
2
≥
1
.
2
Also, {P : m(P ) ≥ x} = {P : P ((−∞, x]) ≤ 21 }. We will now obtain the
distribution of m(P ) rather than the median of this distribution, which is
the Bayes estimate of m(P ) under the absolute deviation loss function.
Under the prior distribution Dα
1
Dα ({m(P ) ≥ x}) = Dα ({P (−∞, x]) ≤ })
2
Z 1 α(x)−1
α(X )−α(x)−1
2 u
(1 − u)
du.
=
B(α(x), α(X ) − α(x))
0
The following two graphs give the survival function and density function
of m(P ) under Dα for the two cases:
• α(x) is k times the uniform distribution on [0, 1], i.e. α(x) = kx.
• α(x) = kΦ(x), where Φ(x) is normal distribution function.
92
with k = 10.
2.5
2
pdf. of
m(P)
survival fn.
of m(P)
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
Figure 1: Survival function and pdf of m(P ) under Dirichlet with paramter
10 times uniform.
93
1.4
1.2
1
pdf.of
m(P)
0.8
0.6
0.4
survival
fn. of m(P)
0.2
0
−3
−2
−1
0
1
2
3
Figure 2: Survival function and pdf of m(P ) under Dirichlet with paramter
10 times the standard Normal.
March 18, 2008
31
Under the assumption that α is non-atomic
Let us recap some results from previous classes.
Let P ∼ Dα and X1 , X2 , . . . , given P , be i.i.d P . Let α be non-atomic and
let Y1 .Y2 , . . . be the distinct observations with Y1 = X1 . We also established
that
1 P is discrete distribution with probability 1
2 limn Fn = P with probability 1, where the limit is “in distribution”,
and Fn is the empirical distribution of X1 , X2 , . . . , Xn
94
We also saw that P ({X1 }) is well defined measurable random variable
and
P ({X1 }) = lim Fn ({X1 }) = lim F2n ({X1 })
where F2n is the empirical distribution of X2 , X3 , . . . , Xn . To obtain the
conditional distribution of F2n ({X1 }) given X1 , we should look at the joint
distribution of X2 , X3 , . . . given X1 , which is a Polyá sequence with parameter α + δX1 . Hence the conditional distribution of P (A) = lim F2n (A) given
X1 is Beta(α(A) + δX1 (A), α(Ac ) + δX1 (Ac )). Putting A = {X1 }, we find
that the conditional distribution of P ({X1 }) given X1 is B(1, α(X )) since
α is non-atomic. Again, since this does not depend on X1 , the marginal
distribution of P ({X1}) is B(1, α(X )) and P ({X1 }) is independent of X1 .
Let us now look at Y1 , Y2 , . . . the distinct observations.
Then Y1 = X1 and
P ({Y1}) is independent of Y1 , and P ({Y1}) ∼ Beta(1, α(X))
What is the distribution of Y2 ?
Let {X1′ , X2′ , . . . } be the sequence {X1 , X2 , . . . } after removing all mem-
′
} be the sequence
bers of this sequence equal to X1 . Let {X1′ , X2′ , . . . , Xm
n
X1 , X2 , . . . , Xn after removing all members equal to X1 ’s in it. It is easy to
see that mn ∞ as n → ∞. Notice that X1′ = Y2 . Let Gmn be the empirical
′
and nFn ({X1′ }) be the number of X1′ that
distribution of X1′ , X2′ , . . . , Xm
n
occur in X1 , X2 , . . . , Xn . Then
nFn ({X1′ })
n(1 − Fn ({X1′ }))
Fn ({X1′ })
=
(1 − Fn ({X1′ }))
Gmn ({X1′ }) =
95
Since X1′ , X2′ , . . . |P are i.i.d. P(X −{X1 }) and P(X −{X1 }) ∼ Dα(X −{X1 }) and
α(X −{X1 }) (X − {X1 }) = α(X ), it follows that
Gmn (Y2 ) →
P ({Y2})
∼ Beta(1, α(X ))
1 − P ({Y1})
and independent of Y2 from the same arguments as before and using the fact
that α is non-atomic.
Define pi = P ({Yi}), i = 1, 2, . . . and θ1 , θ2 , . . . as follows
θ1 = p1
p2
(1 − p1 )
p3
θ3 =
(1 − p1 − p2 )
..
.
θ2 =
Then, from the above results, θ1 , θ2 , . . . are i.i.d. with Beta(1, α(X )) distributions and independent of Y1 , Y2, . . . . Since Fn is concentrated on Y1 , Y2 , . . . ,
the distinct observations, and converges to P , we also have
P =
∞
X
i=1
P ({Yi})δYi =
∞
X
pi δYi
i=1
with (p1 , p2 , . . . ) independent of (Y1 , Y2 , . . . ). Further more the distribution of (p1 , p2 , . . . ) is given through the iid random variables (θ1 , θ2 , . . . ) and
Y1 , Y2 , . . . are iid with common distribution ᾱ.
All this was established under the curious assumption that α is nonatomic.
96
32
Sethuraman’s definition of the Dirichlet
distribution
We will now establish that the above results are true without any assumption
on α giving a new and direct definition of the Dirichlet distribution on the
space of probability measures on an arbitrary measurable space X .
Consider random variables, (on some probability space (Ω, A, Q)), θ =
(θ1 , θ2 , . . . ) taking values in [0, 1] which are iid Beta(1, α(X )) and random
variables Y = (Y1 , Y2 , . . . ) taking values in X which are iid ᾱ. Assume that
θ and Y are independent. Define
p1 = θ1
p2 = θ2 (1 − θ1 )
p3 = θ3 (1 − θ1 )(1 − θ2 )
..
.
Then
P
n
pn = 1 with probability 1 since 1 − p1 − · · · − pn =
Qn
1 (1 − θi )
→0
with probability 1. We can say that the discrete pm defined by p1 , p2 , . . . on
the integers has discrete failure rates θ1 , θ2 , . . . which are iid Beta(1, α(X )).
Define a random probability measure P on X by
P (B) = P ((θ, Y), B) =
∞
X
pi δYi (B).
(32.21)
i=1
Theorem 32.1. The random measure P defined in (32.21) has the Dirichlet
distribution Dα , i.e.
P ∼ Dα .
97
Proof:
1 Note that the random measure δX1 is a measurable map into P, since
{ω : δX1 (ω) (A) < r} =
{ω : X1 (ω) 6∈ A} if r < 1
{ω : X1 (ω) ∈ X } if r ≥ 1
and these are measurable sets. Hence P is a measurable map into P
and is a random probability measure.
2 It is also clearly a discrete pm for each ω.
3 To show that P has the Dirichlet distribution Dα we have to show that
(P (B1 ), . . . , P (Bk )) ∼ D(α(B1 ), . . . , α(Bk ))
for each partition (B1 , . . . , Bk ).
Define θ ∗ = (θ2 , θ3 , . . . ) and Y ∗ = (Y2 , Y3 , . . . ). Then (θ ∗ , Y ∗) has the
same distribution as (θ, Y) and is independent of θ1 and Y1 , which are
independent of each other.
We see that
P (B) = P ((θ, Y), B)
∞
X
=
pi δYi (B)
1
= p1 δY1 (B) +
∞
X
(pj δYj (B))
j=2
= θ1 δY1 (B) + (1 − θ1 )
∞
X
= θ1 δY1 (B) + (1 − θ1 )
∞
X
98
2
2
pj
δY (B)
1 − θ1 j
p′j δYj
where
pj
1 − θ1
Q
θj 1j−1 (1 − θr )
=
1 − θ1
j−1
Y
(1 − θr ), j ≥ 2
= θj
p′j =
2
and hence,
P = P ((θ, Y))
= θ1 δY1 + (1 − θ1 )P ((θ∗ , Y ∗))
d
= θ1 δY1 + (1 − θ1 )P
since P ((θ, Y)) and P ((θ ∗ , Y ∗ )) have the same distribution. Also,
notice in the last line above that θ1 ,Y1 and P are independent.
Hence



P (B1 )



 ..

 d
=
θ
 .

1



P (Bk )


δY1 (B1 )


..


+
(1
−
θ
)

.
1 


δY1 (Bk )

P (B1 )

..


.

P (Bk )
(32.22)
Where θ1 ∼ Beta(1, α) , Y1 ∼ ᾱ, and (θ1 , Y1 , (P (B1), P (B2 ), . . . , P (Bk )))
are independent.
We will now check that (P (B1 ), . . . , P (Bk )) ∼ D(α(B1 ), . . . , α(Bk ))
satisfies the distributional equation (32.22), by finding the distribution
of the RHS of (32.22). Conditional on Y1 ∈ Bi , the distribution of


δY1 (B1 )


 ..


 .


δY1 (Bk )
99
is degenerate at ei and is also therefore equal to Dei . The distribution
of


P (B1 )


 ..

 .



P (Bk )
is Dα∗ where

α(B1 )

 ..
α∗ =  .

α(Bk )



.

From a standard result on finite dimensional Dirichlet distributions,
the conditional distribution of the RHS, given that Y1 ∈ Bi , becomes
Dα∗ +ei . Since P (Y1 ∈ Bi ) = αi∗ /α(X ), 1 ≤ i ≤ k, it follows from an-
other standard fact on finite Dirichlet distributions that the distribution
of the RHS is Dα∗ . This checks this as a solution of the distributional
equation (32.22).
3 We will now show that this is the unique solution by proving the following theorem.
Theorem 32.2. Consider the distribution equation
d
V = U + WV
(32.23)
where U and V are in Rk (or in some linear space), W in R1 and
(U, V ) is independent of V . Assume that P (|W | < 1) = 1. Then if
there is a solution for the distribution equation of V , it is unique.
Proof of Theorem 32.2
100
Suppose that V ,V ′ are two random variables whose unequal distributions solve the distributional equation (32.23).
Let (U1 , W1 ),(U2 , W2 ) be i.i.d.with the distribution of (U.W ), and independent of V ,V ′ .
Define
V1 = V
V1′ = V ′
V2 = U1 + W1 V1
V2′ = U1 + W1 V1′
V3 = U2 + W2 V2
..
.
V3′ = U2 + W2 V2′
..
.
Then V2 ∼ V1 , V2′ ∼ V1′ ,V3 ∼ V1 ,V3′ ∼ V2′ ,etc.
d
|V1 − V1′ | = |Vn − Vn′ |
′
= |Wn ||Vn−1 − Vn−1
|
= |W1 ||W2 | · · · |Wn ||V1 − V1′ |
d
→ 0,
as n → ∞
since W are i.i.d. with |Wi | < 1.
d
Hence, |V1 − V1′ | = 0, which leads to a contradiction. This completes
the proof of Theorem 32.2.
Continuation of proof of Theorem (32.1)
Thus (P (B1 ), P (B2), . . . , P (Bk )) ∼ Dα∗ is only solution for (32.22).
This shows that the distribution of P defined in (32.21) is Dα .
101
March 25, 2008
33
Finiteness of integrals of random probability measures
We defined Dirichlet distributions to use them as priors in the standard
nonparametric problem, where the unknown distribution P is the parameter.
In such problems, we would also like to estimate functions of this parameter,
R
especially functions like g(x)dP (x) where g is some measurable function on
X . It will be useful to ask a question even before asking for a Bayes estimator
R
R
of g(x)dP (x). Is the function of interest, namely g(x)dP (x) finite under
the prior distribution Dα ? Only if we know that it is finite can we ask for an
estimator of that function.
We showed in a previous lecture that the Bayes estimator of this function,
with no data and under a prior which is Dα is
Z
Z
EDα ( g(x)dP (x)) = g(x)dᾱ(x)
under the assumption that |g| is integrable under ᾱ. This means that
R
R
g(x)dP (x) is not only finite but it is also integrable if g(x)dᾱ(x) < ∞.
Several researchers have discovered a necessary and sufficient condition
R
for the finiteness of g(x)dP (x) under a Dirichlet distribution. We will give
a simple and new proof of that result, and also obtain similar necessary and
sufficient conditions for a larger class of distributions that include Dirichlet
distributions.
102
Theorem 33.1. Let g be a measurable function on X . Let Ag = {P :
R
g(x)dP (x) is finite}. Then
Dα (Ag ) = 1 if and only if
Z
Proof: Since
R
log(1 + |g(x)|)dᾱ(x) < ∞ where ᾱ =
g(x)dP (x) is finite if and only if
R
α
.
α(X )
(33.24)
|g(x)|dP (x) is finite, we
can, without loss of generality, assume that g(x) is positive to prove this
theorem. Also, there is no loss of generality in assuming that g(x) = x, and
we will do so to complete the proof.
Recall our constructive definition of the Dirichlet distribution of a random
probability measure P :
P =
∞
X
pi δYi
1
where P = P (θ, Y), with p1 = θ1 , pn = θn
Qn−1
1
(1 − θi ) and θ = (θ1 , θ2 , . . . )
are i.i.d. random variables with a Beta(1, α) distribution and Y = (Y1 , Y2 , . . . )
are i.i.d. random variables with distribution ᾱ.
Aside: The random probability measure P above is measurable since the
random probability measure δY1 is measurable. This is so because

 {Y 6∈ A} if r < 1
1
{ω : δY1 (A) < r} =
 X
if r > 1
and both the sets on the right hand side are measurable. This makes δY1 a
measurable map into P. Since P is a convex combination of such measures
with measurable coefficients, it is a measurable map into P.
Notice that
Z
x dP (x) =
∞
X
1
103
pi Y i .
R
P
x dP is finite if and only if ∞
1 pi Yi converges, and Dα (Ag ) = 1 is
P
P
same as Dα ( Pi Yi converges) = 1 or Q( Pi Yi converges) = 1, where, as
Thus
usual, Q is the joint distribution of (θ, Y).
We will recall the famous Kolmogorov’s three-series theorem:
Theorem 33.2. Let X1 , X2 , . . . be independent random variables. Define

 X if |X| ≤ ǫ
ǫ
X =
= XI(|X| ≤ ǫ)
 0, if |X| > ǫ
Then
P
Xi converges with probability 1 if and only if
∞
X
1
P (|Xi| > ǫ) < ∞,
∞
X
E(Xiǫ ) < ∞,
(33.26)
∞
X
V (Xiǫ ) < ∞
(33.27)
1
and
(33.25)
1
Conditional on (p1 , p2 , . . . ), the random variables p1 Y1 , p2 Y2 , . . . are independent.
To continue the proof of Theorem 33.1, we will establish the following
theorem.
Theorem 33.3. Let X1 , X2 , . . . be i.i.d. random variables and let 0 < ρ < 1.
Then
∞
X
ρn Xn converges with probability 1
(33.28)
1
if and only if
E(log(1 + |X1 |)) < ∞.
104
(33.29)
Let (an , n = 1, 2, . . . ) be random variables independent of X1 , X2 , . . . and
suppose that
1
log(an ) → −K > 0 with probability 1 and K is a constant.
n
Then series
∞
X
(33.30)
an Xn converges with probability 1
1
if and only if (33.29) holds.
Proof: We can assume without loss of generality that the Xn ’s are positive
also that Xn ≥ 1, n = 1, 2, . . . .
Let ǫ > 0. Then
∞ >
∞
X
n=1
n
P (|ρ Xn | > ǫ) =
=
∞ X
∞
X
=
∞
X
n=1
i=1
≤ E(
and
∞
X
n=1
P (|X1| >
ǫ
)
ρn
∞
i
X
ǫ
ǫ
ǫ
ǫ X
P ( i < |X1 | < i+1 ) =
P ( i < |X1 | < i+1 )
1
ρ
ρ
ρ
ρ
i=n
i=1
n=1
iP (i <
log |X1 | − log(ǫ)
≤ i + 1)
− log ρ
log |X1 | − log(ǫ)
)
− log(ρ)
log |X1 | − log(ǫ)
)−1
− log(ρ)
⇔ E(log |X1 |) < ∞
≥ E(
This shows that condition (33.25) is equivalent to condition (33.29). Consequently, (33.28) implies condition (33.29).
Conversely let (33.29) hold. Then the condition analogous to (33.25)
P n
holds for the series
ρ Xn . We will now show that condition (33.29) also
105
implies the conditions analogous to (33.26) and (33.27). This will complete
the proof that (33.29) implies (33.28).
We show the details for just one of these. (Recall that, WLOG, we
assumed X1 > 0.)
∞
X
n
ǫ
E(ρ Xn ) =
n=1
∞
X
n=1
n
n
E(ρ Xn I(|ρ Xn | ≤ ǫ)) =
∞
X
E(ρn X1 , X1 <
n=1
ǫ
)
ρn
∞ X
n
X
ǫ
ǫ
=
[
E(ρn X1 , i−i < X1 ≤ i ) + E(ρn X1 , X1 ≤ ǫ)]
ρ
ρ
n=1 i=1
≤
∞
X
E(X1 ,
i=1
ǫ
ρi−1
∞
< X1 ≤
∞
ǫ X n
ρǫ
)
ρ
+
ρi n=i
(1 − ρ)
1 X i
X1
ǫ
ǫ
ρǫ
=
ρ E(
log X1 , i−1 < X1 ≤ i ) +
1 − ρ i=1
log X1
ρ
ρ
(1 − ρ)
∞
ρi ρǫi
ǫ
ǫ
ρǫ
1 X
E(log X1 , i−1 < X1 ≤ i ) +
≤
1 − ρ i=1 log(ǫ) − i log(ρ)
ρ
ρ
(1 − ρ)
∞
ǫ
ǫ
ρǫ
1 X ǫ
E(log X1 , i−1 < X1 ≤ i ) +
≤
1 − ρ i=1 log(ǫ)
ρ
ρ
(1 − ρ)
ǫ
ρǫ
=
E(log X) +
<∞
(1 − ρ) log(ǫ)
(1 − ρ)
This completes the proof of the first part of Theorem 33.3. The second
part is immediate.
Continuation of the proof of Theorem 33.1: From Theorem 33.3 it is
enough to show that
1
log(pn ) → −K
n
where K > 0 is a constant. Note that
n−1
log(θn ) 1 X
1
log(1 − θi )
log pn =
+
n
n
n i=1
→ E(log(1 − θ1 )) < 0
106
since E(log(1 − θ1 )) > −∞ when θ1 ∼ Beta(1, α(X )). We have also used the
fact
log(θn )
n
→ 0 with probability 1 since E(| log(θ1 )|) < ∞.
We can state the following easy generalization.
Theorem 33.4. Let P be a random probability measure defined by
P =
∞
X
pn δYn
1
where (p1 , p2 , . . . ) is independent of the i.i.d. random variables {Y1 , Y2 , . . . }
satisfying the following condition:
1
log(pn ) → −K with probability 1
(33.31)
n
R
where K > 0 is a constant. Then g(x)dP (x) is finite with probability 1 if
and only if
E(log(1 + |g(Y1)|)) < ∞
We can now ask “What is the distribution of
(33.24)
R
xdP (x)” under the DirichR
let distribution Dα . Yamato showed in a paper that xdP (x) has a Cauchy
distribution of if ᾱ = Cauchy. The result does not depend on the value of
α(X ).
For a quick proof of this result, notice that
EDα (eit
R
P
= E(e
XdP
)
pn itYn
P
)
= E(E(e
pn itYn
|p1 , p2 , . . . ))
= E(E(e−
P
))
pn |t|
= e−|t|
107
since
P
pn = 1. This result does not depend on the actual distributions of
(p1 , p2 , . . . ); it is enough if (p1 , p2 , . . . ) is independent of (Y1 , Y2 , . . . .
If we assume that α ∼ k ∗ N(0, 1), then exists Z, such that
Z
and Z ∼
P
XdP |Z ∼ N(0, Z)
p2n . The moments of
for instance the first moment is
P
p2n can be determined with determination;
1
.
α(X )+1
108
March 27, 2008
34
Convergence of random probability measures
We begin with describing the several objects that we have defined with this
sketch.
Basic space
pm’s on X
pm’s on P
(X , B)
(P, σ(P))
M
x
P
µ
X ∼P
P ∼µ
element
rv’s and dist’s
w
w
convergence
xn → x
Pn → P
µn → µ
We know the meaning of convergence in the first two columns. How do
we define the convergence in the last column?
Let us first look at the definition of convergence in P. This is often done
through random variables taking values on X even though it is a property of
only their distributions.
w
We say that Pn → P if
Fn (x) = P ((−∞, x]) → F (x) = P ((−∞, x])
for all x ∈ C(F ) where C(F ) is the set of continuity points of F . If Xn , X
w
are random variables on X with Xn ∼ P, X ∼ P and Pn → P , we will also
w
say that Xn → X or that Xn converges to Xin distribution. (Note that all
convergences here are n → ∞.)
109
Another equivalent definition is
Z
Z
w
Pn → P if and only if
g(x)dPn (x) → g(x)dP (x)
for all bounded continuous functions g on X .
We have other notions of convergence of random variables and probability
measures. For instance, we say that Pn → P in variation norm if
sup ||Pn (B) − P (B)|| → 0.
B
For random variables, we have the following notions of convergence. Let
Xn , X be measurable functions from a probability space (Ω, A, λ) into (X , B).
We say that Xn → X with probability 1, if
λ({ω : Xn (ω) → X(ω)}) = 1.
p
We say that Xn → X in probability (Xn → X) if
λ(|Xn − X| > ǫ) → 0 for each ǫ > 0.
If Xn ∼ Pn , X ∼ P and Xn → with probability 1 or in probability, we
w
can conclude that Pn → P . We cannot strengthen this conclusion and claim
that Pn → P in variation norm.
With these standard ideas in mind we will now define convergence of
random probability measures, that is, of elements of M.
Let µn , n = 1, 2, . . . , µ be random pm’s, i.e. be elements of M. In a way
w
analogous to our previous definition, we say that µn → µ if
Z
Z
H(P )dµn(P ) →
H(P )dµ(P )
P
P
110
for all bounded functions H(P ) on P which are continuous with respect to
weak convergence in P.
This definition is not totally clear since we do not know all the continuous
functions H(P ) with reselect to weak convergence in P.
However, the following results are true, which will allow us to show weak
convergence of random pm’s.
Let Pn , n = 1, 2, . . . , P be random mappings from some probability space
(Ω, A, λ) into (P, σ(P)).
w
Let Pn ∼ µn , n = 1, 2, . . . and P ∼ µ. Suppose that Pn → P with
λ-probability 1, i.e.
w
λ(ω : Pn (ω) → P (ω)) = 1.
w
Then µn → µ.
The stronger condition
||Pn − P || → 0 with λ − probability 1
w
will also only imply µn → µ.
One generally shows that a sequence converges by first showing that it is
pre-compact.
Let {Pn } be a sequence of probability measures. It is said to be precompact if every subsequence of {Pn } has a further subsequence that converges weakly to some probability measure.
The sequence {Pn } is said to be tight if for any δ > 0, there exists a
compact set Kδ ∈ X , such that
Pn (Kδ ) ≥ 1 − δ for all n
111
which is the same as
Pn (Kδc ) ≤ δ for all n.
The famous Prohorov theorem states
Theorem 34.1. {Pn } is pre-compact if and only if {Pn } is tight.
Similar definitions and results hold for random pm’s, i.e. for elements of
M.
Suppose that µ ∈ M and P ∼ µ, we can compute
Z
def
µ̄(B) = Eµ (P (B)) =
P (B)dµ(P )
P
and it is easy to see that µ̄(·) is a probability measure on (X , B), i.e. an
element in P. The pm µ̄is called the mean measure of the random pm µ.
The following theorem gives a nice characterization of tightness for random pm’s.
Theorem 34.2. Let {µn } be a sequence inM. Then {µn } is tight if and only
if {µ̄n } is tight.
Proof:
We first describe some sets in P which are compact under weak convergence.
Fix δ > 0. For k = 1, 2, . . . , find a compact set Kk . Such that
{µ̄n }(Kkc ) ≤
δ 6
.
k3 π2
Let
C = ∩k {P : P (Kkc ) ≤
112
1
}
k
From one of the equivalent conditions in the Portmanteau theorem for weak
convergence, it is easy to see that C is closed.
We now claim that C is compact.
Let {Pm } ∈ C. Fix ǫ > 0, find k, such that
1
k
< ǫ.
c
Then {Pm } ∈ {P : P (KK
) < ǫ}, i.e. Pm (Kkc ) < ǫ for all m.
Hence {Pm } is tight and a subsequence converges to some probability
measure. Hence C is compact.
Let {µ̄n } be tight and C be as above.. Then
µn ({P : P (Kkc ) >
1
})
k
≤ kEµn (P (Kkc ))
= k µ̄n (Kkc )
δ 6
≤ 2 2
k π
and
µn (C c )
∞
X
1
≤
µn (P : P (Kkc ) > )
k
k=1
∞
X
1 6
≤δ
k2 π2
1
=δ
for all n. This shows that {µn } is tight.
Conversely, suppose that {µn } is tight. Then exists {nr }, such that
w
µ nr → µ
Then
Eµnr (P (A)) → Eµ (P (A))
113
if P (A) is continuous with respect to µ.
Hence
µ̄nr (A) → µ̄(A)
for all A which are such that P (A) is continuous with respect to ]mu. The
w
class of such sets A is sufficiently rich, and thus µ̄nr → µ̄. This shows that
{µ̄n } is tight.
Now we explore the convergence for Dαn for sequences αn .
Example 34.1. Let αn = αn (X )β for n = 1, 2, . . . , where β = ᾱn is a pm.
P
Let Pn = ∞
m=1 pn,m δYm , where Y1 , Y2 , . . . are i.i.d. β, and θn,1 , θn,2 , . . .
are i.i.d. Beta(1, αn (X )) and pn,m = θn,m (1 − θn,1 ) . . . (1 − θn,m−1 ), m =
1, 2, . . . .
From our constructive definition, Pn ∼ Dαn and
Pn =
∞
X
pn,m δYm
m=1
= pn,1 δY1 + (1 − pn,1 )Pn∗
where Pn∗ ∼ Dαn . Let P = δY1 .
Then
||Pn − P || = ||pn,1δY1 + (1 − pn,1)Pn∗ − δY1 ||
= ||(1 − pn,1 )(Pn∗ − δY1 )||
≤ 2(1 − pn,1 )
because that Pn and δY1 are probability measures. Since pn,1 ∼ Beta(1, αn (X ))
and αn (X ) → 0,
E(1 − pn,1 ) =
αn (X )
→0
αn (X ) + 1
114
and
V ar(1 − pn,1 ) ≤
αn (X )
→ 0.
(αn (X ) + 1)2 (αn (X ) + 2)
This means that (1 − pn,1 ) → 0 in probability and ||Pn − δY1 || → 0 in probaw
bility, and that Dαn → δY1 .
115
April 1, 2008
35
Convergence of sequences of Dirichlet distributions
Let X be the basic space, P be the space for all the probability measure on
X and M be the space of probability measures on P. We will also refer µ as a
random probability measure since an example of such a µ is the distribution
of a random variable P taking values in P.
A random pm µ is well defined if µ({P : P ∈ C}) is defined for all Borel
sets C in P. This is not a convenient definition. We have already seen that
if the distribution of (P (B1 ), . . . , P (Bk )) under µ is defined for all partitions
(B1 , . . . , Bk ) (or partitions based on special classes of sets like intervals with
rational end points, etc.) of X , then it identifies µ uniquely.
A random pm µ is uniquely defined if the distribution of
R
g(x)dP (x)
under µ is defined for all bounded continuous functions g on X .
Let Y be a random variable on X with distribution ᾱ. Then δY is a
mapping into P and its distribution will be a random pm in M. We will not
give a name for this random pm and will simply use δY to denote it.
We restate a few results from the last class.
w
Theorem 35.1. Let {µn } ∈ M. Then µn → µ if
Z
Z
g(P )dµn(P ) → g(P )dµ(P )
for every bounded continuous function g on P.
116
Theorem 35.2. A sequence {µn } in M is tight if and only if the sequence
of mean pm’s {µ̄n } is tight. The mean pm µ̄n is defined by the relation
Z
µ̄n (B) = P (B)dµn (P ).
We will now consider the convergence of sequences of Dirichlet distributions Dαn under different conditions on αn .
The next result was already established in the last class.
Theorem 35.3. Let µn = Dαn , with αn = αn (X )ᾱ and let αn (X ) → 0.
Then
w
µn → δY
where Y is a random variable on X with distribution barα.
Proof: Note that the mean measures
µ̄n (B) = EDα (P (B)) = ᾱ(B)
do not depend on n. Hence {µ̄n } is tight and from Theorem 34.2 {µn } is
tight.
However, we do not know if the sequence converges.
Using the constructive definition, define random probability measures
Pn , n = 1, 2, . . . , as follows:
Pn =
∞
X
pj,n δYj ,
j=1
where Y1 , Y2, . . . are i.i.d ᾱ, θ1,n , θ2,n , . . . are i.i.d. B(1, αn (X )) and
pj,n = θj,n (1 − θj−1,n ) . . . (1 − θ1,n ), j ≥ 1, n ≥ 1.
117
We know that Pn ∼ Dαn . Then
|Pn (B) − δY1 (B)|
= |p1,n δY1 (B) + (1 − p1,n )Pn∗ (B) − δY1 (B)|
≤ (1 − p1,n )(|Pn∗ (B)| + δY1 (B))
≤ 2(1 − p1,n ).
Therefore, supB |Pn (B) − P (B)| ≤ 2(1 − p1,n ). Also we know that (1 −
p1n ) ∼ B(αn (X ), 1) and, since αn (X ) → 0,
E((1 − p1n )) =
αn (X )
→0
αn (X ) + 1
and
V ar((1 − p1,n )) =
αn (X )
→ 0.
(αn (X ) + 1)2 (αn (X ) + 2)
This shows that ||Pn − δY1 || → 0 in probability and thus
w
µn → δY1 .
Notice that µ is not a Dirichlet distribution. Since if we look at a partition
of X ,(B1 , . . . , Bk )
(δY1 (B1 ), . . . , δY1 (Bk )) = ei
with probability
P (Y1 ∈ Bi ), i = 1, 2, . . . , k.
Thus µ is a mixture of degenerate Dirichlet distributions.
We will now look at another example.
Theorem 35.4. Let µn = Dαn and let αn = αn (X )βn , with αn (X ) → α(X ),
w
with 0 < α(X ) < ∞, and βn → β. Write α = α(X )β.
w
Then Dαn → Dα .
118
Proof: We already know that {Dαn } is tight since the mean measures are
converging. We need to identify a unique limit.
From the constructive definition, we can write Pn =
P
pi,n δYi,n , where
(Y1,n , Y2,n , . . . ) are i.i.d. βn , pi,n = θi,n (1 − θi−1,n ) . . . (1 − θ1,n ), i = 1, 2, . . .
and (θ1,n , θ2,n , . . . ) are i.i.d. B(1, αn (X )), n = 1, 2 . . . with all the random
variables defined on a single space Ω.
We already know that Pn ∼ Dαn = µn .
w
Since αn (X ) → α(X ), βn → β, we can choose θ1,n , θ2,n , . . . , such that
θi,n → θi , i = 1, 2 . . . with probability 1, Yi,n → Yi , i = 1, 2, . . . with probw
ability 1 and θ1 ∼Beta(1, α(X )), Y1 ∼Beta(1, α(X )). Thus Pn (B) → P (B)
with probability 1 for each B. This indetifies Dα as the unique limit for all
w
convergent sequences and that the whole sequence Dαn → Dα .
A third example is given by the following theorem
Theorem 35.5. Let µn = Dαn , with αn = αn (X )βn , αn (X ) → ∞ and
w
βn → β. Let µ be the degenerate random pm δβ . Then
w
µn → µ.
Proof: Note that under P = β with probability 1 under µ.
Since the mean measures {βn } converge, the sequence {µn } is tight. Once
again, we need to identify the unique limit.
Under µn , the distribution of (P (B1 ), . . . , P (Bk )) is Dirichlet distribution
with parameter (αn (X )β(B1 ), . . . , αn (X )β(Bk )). Further
Eµn (P (B1 )) = βn (B1 ) → β(B1 )
V arµn (P (B1 )) =
β(B1 )β(B1c )
→0
(αn (X ) + 1)
119
since αn (X ) → 0. Hence the distribution of (P (B1 ), . . . , P (Bk )) under µn
converges to the distribution degenerate at (β(B1 ), . . . , β(Bk )). This shows
w
that µnk → µ.
Corollary 35.1. The conclusion of the previous theorem is also true if µn =
w
Dαn where αn = αn (X )βn is random with αn (X ) → ∞ and βn → β in
probability.
Consider that the standard nonparametric problem: P ∼ Dα and
X1 , X2 , . . . |P i.i.d P . We know that the posterior distribution of P given
X1 , X2 , . . . , Xn is Dα+P δXi . Suppose that the random variables X1 , X2 , . . .
are actually i.i.d. P0 . From the Glivenko-Cantelli lemma, we know that
Pn
w
1
1 δXi → P0 with probability 1. Since α(X ) + n → ∞, we have
n
P
α(B) + δXi (B)
→ P0 (B)
α(X ) + n
and thus
w
Dα+P δXi → δP0
with probability 1 under P0 This is called posterior consistency.
36
An example with censoring
Let us look at another application Let P ∼ Dα and X|P ∼ P . Suppose that
we cannot observe X in its entirety, but that we can observe Y defined by.

 X if X ∈ Ac
Y =
 θ, if X ∈ A
where A is a known subset in X . What is the distribution of P |Y ?
120
If Y ∈ Ac , then Y = X. So,
P |Y = P |X ∼ Dα+δX .
If Y = θ, then we need to find the distribution of P |(Y = θ).
Even though we do not know this conditional distribution, we can write
down the conditional distributions of (P, X) given Y = θ as follows:
P |(Y = θ, X) ∼ Dα+δX
and
X|(P, Y = θ) = PA
where PA (B) =
P (AB)
.
P (A)
Now we can employ Gibbs sampling, which can be described as follows.
Consider the bivariate random variable (X, Y ). Suppose we know L(X|Y )
and L(Y |X) and can generate observations from them. Then we can start
with some initial value (X0 , Y0 ) and proceed as follows to generate (X1 , Y1 ),
(X2 , Y2 ), . . . . Let
X1 ∼ L(X|Y = Y0 )
Y1 ∼ L(Y |X = X1 )
..
.
This generates a Markov chain {(Xn , Yn )} starting from (X0 , Y0 ) and
w
(Xn , Yn ) → (X∞ , Y∞ ) where (X∞ , Y∞ ) ∼ (X, Y ).
This is generally proved by showing that the distribution of (X, Y ) is an
invariant distribution for this Markov chain, and the chain is ergodic. For
more details see my paper on this topic.
121
Going back to our problem, we can consider using Gibbs sampling to find
the distribution of (P, X)|Y = θ and obtain the distribution of P |Y = θ by
taking just the marginal distribution.
To generate an observation P from L(P |X, Y = θ), we can take P =
P∞
α(.)+δX (.)
1 pi δYi where (Y1 , Y2 , . . . ) are i.i.d. α(X )+1 and independent of (p1 , p2 , . . . )
which is generated as usual from i.i.d. random variables with distribution
B(1, α + 1).
Choose J = i with probability pi . This can be done by taking a U is
uniform on [0, 1] and choosing J as
J = i if {p1 + · · · + pi−1 < U ≤ p1 + · · · + pi }.
Generate Y1 , Y2 , . . . as i.i.d.
α(.)+δX (.)
.
α(X )+1
Then YJ will have the distribution of
X |P, Y = θ.
Thus starting from an initial value (P0 , X0 ), we can generate X1 , X2 , . . . , Xn
without generating P1 , P2 , . . . , Pn . The distribution of Pn+1 is Dα+Pn0 δXi and
will approximate P|Y = θ.
To repeat, the steps are:
1. Generate a U from uniform [0, 1].
2. Generate p1 ∼ B(1, α(X ) + 1).
3. Find J.
If U ≤ p1 putJ = 1.
If U > p1 , generate p2 , and if p1 < U ≤ p1 + p2 , then put J = 2.
etc.
122
Put X1 = YJ . Generate X2 , X3 , . . . , X10,000 successively. Then Dα+P10,000 Xi
0
is an approximation to L(P|Y = θ).
123
April 3, 2008
37
Variants of the constructive definition of
Dirichlet distributions
A random pm P with a Dirichlet distribution Dα was given as follows.
P =
∞
X
pi δYi
(37.32)
1
where (θ1 , Y1 ), (θ2 , Y2 ), . . . are i.i.d. with Y1 , Y2 , . . . i.i.d. ᾱ and θ1 , θ2 , . . .
i.i.d. Beta(1, α) and pi = θi (1 − θ1 ) . . . (1 − θi−1 ), for i = 1, 2, . . . .
This constructive definition has inspired others to make several variants
to at least formally define more random pm’s. The distributions of such
random pm’s may not be well understood or described. For instance, in
the above, one may change just the common distribution of θ1 , θ2 . . . to be
Beta(2, α(X )), and thus define a new random pm P through (37.32), but
its distribution is not well understood and it is not described in terms of
Dirichlet distributions. However, if the common distribution of θ1 , θ2 , . . . is
assumed to be Beta(1, 2α(X )), the the random P defined in (37.32) is just
D2α .
Another generalization arises by allowing for the independent random
variables θ1 , θ2 , . . . to have non-identical distributions. For instance, when ᾱ
is non-atomic, and θk ∼Beta(1 − δ, γ + kδ), k = 1, 2, . . . with 0 ≤ δ < 1, γ >
−δ, the distribution of the random pm P defined by (37.32) is called the
“two-parameter” Dirichlet distribution, denoted by Dγ,δ,ᾱ . This distribution
has been studied in other contexts, but has not found applications in Bayesian
124
nonparametrics.
Another generalization comes by assuming that (Y1 , Y2, . . . ) are exchangeable with a Polya distribution Qα . As before it will be assumed that (Y1 , Y2 ,
. . . ) and (θ1 , θ2 , . . . ) are independent and that the latter sequence consists
of i.i.d. Beta(1, δ) random variables. The distribution of the random pm P
defined by (37.32) is not Dirichlet, but it can be described as follows. From
the properties of a Polya distribution Qα , there exists a random pm R with
distribution Dα , such that given R, the random variables Y1 , Y2 , . . . are i.i.d.
R. Thus from our constructive definition of a Dirichlet distribution, the distribution of P given R is DδR . Thus the distribution of P is the mixture of
Dirichlet distributions
Z
DδR (·) Dα (dR).
This is an unusual mixture of Dirichlet distributions.
We will now study the usual mixtures of Dirichlet distributions and also
show that there is a constructive definition of a random pm with the mixture
Dirichlet distribution.
Let P ∼ Dα . Given P , let the random variables Y1 , Y2 . . . . be i.i.d. P .
Given P, Y1 = y1 , Y2 = y2 , . . . , let the random variables X1 , X2 , . . . be independent random variables with distributions given by the df’s K(x1 |y1 ),
K(x2 , y2), . . . , where K(x|y) is a df.
Then, given P , the random variables X1 , X2 , . . . are i.i.d. common distriR
R
bution given by the df K(x|y)P (dy). Thus K(x|y)P (dy) is a random df
and its distribution is usually referred to as a Dirichlet mixture distribution.
One may describe this model by saying that there is a random pm R =
R
K(x|y)P (dy) whose distribution is a Dirichlet mixture distribution, and
125
given R, the random variables X1 , X2 , . . . are i.i.d. R.
In the special case when K(x|y) = I(x ≥ y) is the discrete measure
concentrated at y, then the random df R((−∞, x]) reduces to
Z
I(x ≥ y)P (dy) = P ((−∞, x])
which is the random df with the usual Dirichlet distribution Dα .
From the constructive definition (37.32), one can give the following constructive definition of the Dirichlet mixture random pm R:
X
R((−∞, x]) =
pn K(x|Yn )
n
where pn , Yn , n = 1, 2, . . . are as defined in (37.32).
The statistical problem is to determine the posterior distribution of P
given the data X1 , . . . , Xn in a Dirichlet mixture model. Let X = (X1 , . . . , Xn ),
Y = (Y1 , . . . , Yn ) and let ν(Y|X) denote the conditional distribution of Y
given X. Since the conditional distribution of P given X, Y = y is Dα+Pn1 δyi ,
one can write the posterior distribution as
Z
L(P |X) = Dα+Pn1 δyi dν(y|X).
To obtain ν(y|x), we will first write down the joint distribution of (X, Y).
Q(Y1 ∈ A1 , . . . , Yn ∈ An , X1 ∈ B1 , . . . , Xn ∈ Bn )
Z
n
Y
= E(
K(Bi |yi )P (dy1) . . . P (dyn ))
A1 ×,...,×An
=
Z
A1 ×,...,×An
1
n
Y
1
Pi−1
δyj (dyj ))
(α(dyi) + j=1
K(Bi |yi )
α(X )(α(X ) + 1) . . . (α(X ) + n − 1)
Qn
i=1
Assuming that the df’s K(x|y) have pdf’s k(x|y), it follows that
ν(y|x) ∝
n
Y
1
n
i−1
Y
X
k(xi , yi ) (α(dyi) +
δyj (dyj )).
i=1
126
j=1
Some simplifications can be made to the above if one assumes that α is
non-atomic.
To repeat, the posterior distribution of P given X = x is given by
Z
Q(P ∈ A|x) = Dα+Pn1 δyi (A)dν(dy|x).
It should be noted that this posterior distribution is not a Dirichlet mixture distribution.
38
Approximations to Dirichlet distributions
One way to approximate Dirichlet distributions is to approximate the random
pm given in the constructive definition (37.32) by another random pm. If
these two random pm’s are close to one another then their distributions will
also be close to one another.
One such choice is
PN =
N
X
pi δYi
1
where N is chosen to be large, and pN adjusted to make
PN
1
pi = 1. This is
the same as taking θN = 1.
Q
Clearly, ||PN − P || ≤ 1N −1 (1 − θi ) which will go to 0 with probability 1
as N → ∞. Thus the distribution of PN will converge weakly to Dα .
If we want more control on the actual error in this approximation, we can
take random N as follows. Given ǫ > 0 let
Nǫ = inf{n :
n−1
Y
1
(1 − θi ) < ǫ}
127
As before put θNǫ = 1 and look at the truncated PNǫ . This random measure
will satisfy
||PNǫ − P || < ǫ.
One can give write down the distribution of Nǫ and give bounds on the tails
of its distribution.
One can use such approximations either with the prior distribution or the
posterior distribution Dα+Pn1 δYi , since they are both Dirichlet distributions.
Then what is the distribution of (p1 , . . . , pN ) when we put θN = 1? We
see that
p2
pN −1
,...,
1 − p2
1 − p1 − · · · − pN −1
are i.i.d. ∼ B(1, α(X )). Thus the joint distribution of (p1 , . . . , pN ) is the
p1 ,
Connor-Moissiman distribution, which was given as a homework in this class.
Another approximation to the Dirichlet distribution is given by
PN∗
=
n
X
p∗i δYi
1
where (p∗1 , . . . , p∗n ) ∼∼
)
)
D( α(X
, . . . , α(X
)
N
N
and is independent of (Y1 , . . . , YN )
which are i.i.d. ᾱ.
Note that, conditional on (Y1 , . . . , Yn ),
N
X
1
p∗i δYi =
N
X
1
p∗i DδYi ∼ D α PN δY
N
1
i
from the standard result on Dirichlet distributions that AU + (1 − A)V ∼
P
P
Dα+β , if U ∼ D(α), A ∼ B( Xi , βi ) and V ∼ Dβ and independent.
From the Glivenko-Cantelli theorem,
N
1 X
w
δYi → ᾱ
N 1
128
with probability 1. From the theorems on convergence of Dirichlet distributions, it follows that
w
Pn∗ → Dα
with probability 1.
129
April 8, 2008
39
Example of an inconsistent Bayes estimator
I will now present an example of an inconsistent Bayes estimator due to
Ferguson, Phadia and Tiwari.
Let I = 1, 2 with Q(I = 1) = Q(I = 2) = 21 .
Let P1 = δβ where β is the distribution function of U, a uniform random
variable on [0, 1]. Let P2 = Dβ .
Let the data consist of X1 , . . . , Xn . Assume that, given P, I, the random
variables X1 , . . . , Xn are i.i.d.P where P ∼ PI .
Another way to describe this model is the following. The data X1 , . . . , Xn
given I = 1 are i.i.d. P and P ≡ β and given I = 2 are i.i.d. P where
P ∼ Dβ .
What is the posterior distribution of P given X1 , . . . , Xn ?
Clearly, if any two Xi ’s are equal, then I = 2 and hence the posterior
distribution of P is Dβ+Pn1 δXi .
We will now compute the posterior distribution of P given that the Xi ’s
are distinct. Note that
Q(I = 1, X1 , . . . , Xn , Xi ’s are distinct)
= Q(X1 , . . . , Xn , Xi ’s are distinct)
1
1
=1∗ = .
2
2
Here Q stands for the probability mass function of I and the pdf of X1 , . . . , Xn .
130
We know that, when I = 2, we have P ∼ Dβ and X1 , . . . , Xn given P are
i.i.d. P , and the joint distribution of X1 , . . . , Xn is
Qn
Pi
1 [β(Xi ∈ Ai ) +
1 δXj (Ai )]
.
n!
Hence
Q(I = 2, X1 , . . . , Xn , Xi ’s are distinct)
1 1
=
∗ .
n! 2
This leads to
Q(I = 1|X1 , . . . , Xi ’s are distinct) =
1
n!
,
1 =
n! + 1
1 + n!
Q(I = 2|X1, . . . , Xn , Xi ’s are distinct) =
1
.
n! + 1
Hence, the posterior distribution of P is given by
Q(P |X1 , . . . , Xn , Xi ’s are distinct)
= Q(P, I = 1|X1 , . . . , Xn , Xi’s are distinct)
+Q(P, I = 2|X1 , . . . , Xn , Xi ’s are distinct)
= Q(P |X1 , . . . , Xn , Xi ’s are distinct, I = 1)
∗Q(I = 1|X1 , . . . , Xn , Xi ’s are distinct)
+Q(P |X1 , . . . , Xn , Xi ’s are distinct, I = 2)
∗Q(I = 2|X1 , . . . , Xn , X − i’s are distinct)
n!
1
=
δβ +
Dβ+Pn1 Xi
n! + 1
n! + 1
w
→ δβ as n → ∞.
when the Xi ’s are distinct.
131
If our data X1 , X2 , . . . are i.i.d. coming from a distribution G on [0, 1],
then above formula tell us that the posterior distribution of P , goes to δβ ,
where β is the uniform distribution function. Thus the Bayes estimator converges to β irrespective of the common continuous distribution of X1 , . . . , Xn
and is therefore not consistent.
132
April 10, 2008
40
Bayesian Analysis of Failure Models
Consider a system consisting of a single item that is maintained, after each
failure, by an instantaneous repair or by a replacement with a new item.
We will study several models for the inter failure times X1 , X2 , . . . of such
a system. Any failure model will have to specify the joint distribution of
X1 , X 2 , . . . .
Let F be the distribution function of the life of a new item. Such an F is
a df on [0, ∞) with F (0) = 0. For a ≤ 0 define the residual life distribution
function Fa as follows:
Fa (x) =
F (a + x) − F (a)
, x>0
(1 − F (a)
The following define several repair models which are standard in the literature
The distribution of Xn after the (n − 1)th failure describes the nature of
the repair at that stage. Here are some examples.
• Minimum repair
Q(Xn ≤ x|X1 , . . . , Xn−1 ) = FXn−1 (x), x > 0
• Perfect repair
Q(Xn ≤ x|X1 , . . . , Xn−1) = F (x) ≡ F0 (x), x > 0.
133
• Brown-Proschan model of repair
A number p in (0, 1) is fixed in advance. With probability p,
Q(Xn ≤ x|X1 , . . . , Xn−1 ) = F0 (x)
and with probability 1 − p,
Q(Xn ≤ x|X1 , . . . , Xn−1 ) = FXn−1 (x), x > 0.
This can be restated as follows. Define i.i.d. uniform random variables
U1 , U2 , . . . in advance.

 F (x)
if Un−1 ≤ p
0
Q(Xn ≤ x|X1 , U1 , . . . , Xn−1 , Un−1 ) =
 F
Xn−1 (x), if Un−1 > p
= Fǫn−1 (x) where ǫn−1 = 0, if Un−1 ≤ p and ǫn−1 = Xn−1 ,if Un−1 > p.
• The Block-Borges-Savits model
Fix a function p(t) : [0, ∞) → (0, 1). Define

 F (x)
if Un−1 ≤ p(Xn−1 )
0
Q(Xn ≤ x|X1 , U1 , . . . , Xn−1 , Un−1 ) =
 F
Xn−1 (x), if Un−1 > p(Xn−1 )
= Fǫn−1 (x) where ǫn−1 = 0, if Un−1 ≤ p(Xn−1 ) and ǫn−1 = Xn−1 ,if
Un−1 > p(Xn−1 ).
• The Kijima models
Together with the inter-failure times X1 , . . . we also have random variables D1 , D2 , . . . which represent degree of repair and
Q(Xn ≤ |X1 , D1 , . . . , Xn−1 , Dn−1 ) ∼ Fǫn−1 (x)
134
where ǫn−1 is a function of X1 , . . . , Xn−1, D1 , . . . , Dn−1 ; The two Kijima
models are
– Kijima Model I
ǫn−1 =
n−1
X
Di Xi ,
i=1
and
– Kijima II model
ǫn−1 =
n−1 n−1
X
Y
D j Xj .
i=1 j=i
We will now show how all these repair models can be described in a single
general repair model. Let (X , A) be our basic space. In the above examples
with life distributions, the space X was equal to (0, ∞). Let P be a pm
on (X , A). Define the joint distribution of the dependent random variables
(inter-failure times) X1 , X2 , . . . together with some environmental variables
(or covariates) U1 , U2 , . . . by
X1 ∼ P
L(X2 |X1 ) = PA1
..
.
L(Xn |X1 , . . . , Xn−1 , U1 , . . . , Un−1 ) = PAn−1
where An−1 is a set in X that depends (measurably) on (X1 , . . . , Xn−1 , U1 , . . . , Un−1 );
i.e. Q(Xn ∈ A|X1 , . . . , Xn−1, U1 , . . . , Un−1 ) should be well defined for A ∈ A.
When there are no covariates U1 , U2 . . . , this is accomplished if
An−1 is a (x1 , . . . , xn−1 ) − section of a set in An .
135
In the above we have used the definition of the restricted measure PA defined
by
PA (B) =
P (AB)
P (A)
B ∈ B.
Our goal is to estimate P , given X1 , . . . , Xn .
41
Bayesian methods with covariates
A Bayesian will consider the covariates to be constants if their distribution
is independent of the parameter of interest, namely P . To make this idea
more precise, let us examine the following two models.
Theorem 41.1. The data consists of (X, Y ) and their distribution depends
on two parameters (θ, δ).
• First Model
In this model,
(θ, δ) are independent
θ ∼ π1
δ ∼ π2
Y |δ, θ ∼ g(y|δ)
X|Y, δ, θ ∼ k(x|y, θ)
• Second Model
136
In this model
θ ∼ π1
Y is fixed
X|Y, θ ∼ k(x|y, θ)
In both these models the posterior distribution of θ given the data (X, Y ) is
π1 (θ|x, y) ∝ π1 (θ)k(x|y, θ) which is what one gets if one assumes that Y is a
constant.
Proof: In the first model, the joint density of (X, Y, θ, δ) is
q(x, y, δ, θ) = π1 (θ)π2 (δ)g(y|δ)k(x|y, θ)
and the posterior distribution of θ given X, Y is proportional to this joint
density after retaining only those parts with dependence on X and θ. If we
do this we get
π1 (θ|x, y) ∝ π1 (θ)k(x|y, θ)
In the second model, the joint density is
q(x, θ) = π1 (θ)k(x|y, θ)
and the posterior density of θ given X is also
π1 (θ|x) ∝ π1 (θ)k(x|y, θ).
This completes the proof.
137
42
Partition based (PB) distributions
Let B = (B1 , . . . , Bk ) be a measurable partition of X . Then we can write
P (.) =
k
X
P (Bi)PBi (.).
(42.33)
1
Define the vector P (B) = (P (B1 ), . . . , P (Bk )) which varies in the simplex
of Rk . There is a one-to-one relationship
P ↔ ((P (B), PB1 , . . . , PBk )
in view of (42.33).
We can denote the random vector P (B) as (Y1 , . . . , Yk ) and assume that it
has a pdf c h(y1 , . . . , yk ), where c is a constant, with respect to the Lebesgue
measure on the simplex of Rk . Assume that P (B), PB1 , . . . , PBk are independent. Further assume that the random pm’s PB1 , . . . , PBk have distributions
G1 , . . . , Gk , respectively. This gives a distribution for P which we define to
be a Partition Based (PB) distribution and denote it by
H(B, h, G1 × · · · × Gk ).
Theorem 42.1. The Dirichlet measure Dα is a partition based (PB) distribution with respect to every partition B. In fact
Dα = H(B, h, G1 × · · · × Gk )
with h(y1 , . . . , yk ) =
1, . . . , k.
Qk
1
α(Bi )−1
yi
and Gi = Dα(Bi ) if α(Bi ) > 0 for i =
We have already proved this theorem as a homework.
138
Recall: If P ∼ Dα , then (P (A), P (Ac)), PA and PAc are independent.
We define a Partition Based Dirichlet (PBD) distribution D(B, α) if the
distribution of (P (B), PB1 , . . . , PBk ) is the PB distribution H(B, h, G1 × · · · ×
Gk ) with
h(y1 , . . . , yk ) =
k
Y
α(Bi )−1
yi
1
and
Gi = DαBi , i = 1, . . . , k
where αB (A) = α(A ∪ B). We will show in the next class that if P has
the PBD distribution D(B, h, α), then it is also a PBD distribution on any
sub-partition of B.
139
April 15, 2008
43
Partition based distributions
In the last class we defined a large class of probability measures for P called
Partition Based (PB) priors denoted by H(B, h, G) where G = G1 ×· · ·×Gm ,
as follows.
Definition 43.1. Under H(B, h, G),
(a) the real vector P(B) and the m restricted pm’s PB1 , PB2 , . . . , PBm are
independently distributed,
(b) P(G) has the pdf c · h(y1 , . . . , ym ) for some normalizing constant, and
(c) PBr has distribution Gr , r = 1, 2, . . . , m.
Here and later, we will not worry about the non-uniqueness of h arising
from a multiplicative constant.
The Dirichlet distributions Dα is the PB distribution H(B, h, G) with
Q α(Bi )−1
h(y) = m
and G = DαB1 × · · · × DαBm .
1 yi
A PB distribution H(B, h, G) is called the Partition Based Dirichlet (PBD)
distribution D(B, h, α) if G = DαB1 × · · · × DαBm . These are more general
than Dirichlet distributions.
Let B∗ be a sub-partition of B. A PBD prior D(B, h, α) can also be
represented as the PBD prior D(B∗ , h∗ , α) with an explicit expression of h∗ ,
from properties of the Dirichlet measure, as can be seen from the following
theorem.
140
Theorem 43.1. Let B = (B1 , . . . , Bm ) be a partition and consider the PBD
distribution D(B, h, α), where h is a pdf on the simplex Rm and α is a measure
on (X , A) with α(Br ) > 0 for r = 1, . . . , m. Split the set Br as Br1 ∪ Br2
with α(Br1 ) > 0, α(Br2) > 0. Denote the partition (B1 , . . . , Br−1, Br1 , Br2 ,
Br+1 , . . . , Bm ) as B∗ . Let
h∗ (y1 , . . . , yr−1 , yr1, yr2 , yr+1, . . . , ym )
= c∗ · h(y1 , . . . , yr−1, yr , yr+1, . . . , ym )
α(Br1 )−1 α(Br2 )−1
yr2
α(Br )−1
yr
yr1
(43.34)
where yr = yr1 + yr2 and c∗ is a normalizing constant. Then
D(B, h, α) = D(B∗ , h∗ , α).
(43.35)
Proof We can write
P =
X
P (Bi )PBi =
1≤i≤m
X
P (Bi)PBi + P (Br1 )PBr1 + P (Br2 )PBr2 .
1≤i≤m,i6=r
Since P ∼ D(B, h, α), the vector P(B) has pdf h(y1 , . . . , ym ) and is independent of PB1 , . . . , PBm which are themselves independently distributed
as DαB1 , . . . , DαBm . Note that (PBr )Brj = PBrj , (αBr )Brj = αBrj , j = 1, 2.
r1 ) P (Br2 )
Since PBr ∼ DαBr , it follows that ( PP(B
,
) has pdf proportional to
(Br ) P (Br )
α(Br1 )−1 α(Br2 )−1
yr2
α(B )−1
yr r
yr1
with yr = yr1 + yr2 . This vector is also independent of
PBr1 , PBr2 which are independent themselves with distributions DαBr1 , DαBr2 ,
respectively. This means that P(B∗ ), (PBj , Bj ∈ B∗ ) have the requisite distributions to say that P ∼ D(B∗ , h∗ , α), where h∗ is as defined in (43.34).
141
44
Posterior distributions under PB priors
when the data is from the general repair
model
The theorem below shows that when the prior is a PB distribution, the
posterior distribution is also PB, in the general repair model.
Theorem 44.1. Let B = (B1 , . . . , Bm ) be a partition and let G = G1 × · · · ×
Gm , where Gr is a random pm such that Gr ({P : P (Br ) = 1}) = 1, r =
1, . . . , m. Let the random pm P have distribution H(B, h, G). Let A ∈ B be
a union of some sets in the partition B, and in particular let A = ∪s∈E Bs
where E is a subset of indexes in {1, 2, . . . , m}. Let the random variable Y
be such that its distribution given P is PA , i.e. Y |P ∼ PA . Let r = r(Y ) be
the random index in E such that Y ∈ Br . Then the posterior distribution of
P given Y can be described as follows:
1 The random vector P(B) and the random pm’s PB1 , . . . , PBm are independently distributed.
2 The density of P(B) is c · hY (y1 , . . . , ym )
where hY (y1 , . . . , ym ) = h(y1 , . . . , ym) · yr /(
izing constant.
P
s∈E
ys ) and c is a normal-
3 For t 6= r the distribution of PBt is Gt , which is unchanged from the
prior distribution.
4 The distribution of PBr is GYr (the posterior distribution in a standard
nonparametrics problem).
142
This result may be succinctly stated as:
T he posterior distribution is H Y (B, h, G) = H(B, hY , G Y ),
(44.36)
def
where G Y = G1 , × · · · × Gr−1 × GYr × Gr+1 × · · · × Gm .
Proof: Let Q be the joint distribution of (P(B), (PBt , t = 1, . . . , m), Y, r(Y )).
Notice that Q(r(Y ) = j|P ) =
P
P (Bj )
P (Bt )
t∈E
and Q(Y ∈ C|P, r(Y ) = j) =
PBj (C).
This immediately shows that, given Y and r(Y ) = j, the random vector
P(B) is independent of PB1 , . . . , PBm and has pdf proportional to h(y) P
yj
t∈E
yt
.
Also since the distribution of Y given P and r(Y ) = j is Pj , the distribution of
Pj given Y and r(Y ) = j is GYj and the distribution of Pr , r 6= j given Y and
r(Y ) = j is unchanged. Thus the distribution of P given Y is H(B, hY , G Y )
as stated in (44.36).
In Theorem 44.2 below, we show that when the prior is a PBD distribution
the posterior is also a PBD distribution, whether or not the restriction set
B is a union of sets in the partition B.
Theorem 44.2. Let P have a PBD distribution D(B, h, α). Let the random
variable Y be such that its distribution given P is PA , i.e. Y |P ∼ PA . Let
B∗ = {B∗1 , . . . , B∗m∗ } be the partition generated by B and A.
The distribution of P can be written as D(B∗ , h∗ , α) for some pdf h∗ on
the simplex in Rm∗ , in view of Theorem 43.1.
Now A is a union of some sets B∗s in the partition B∗ . Thus there is a
subset E∗ ⊂ {1, 2, . . . , m∗ } such that A = ∪s∈E B∗s .
Let r = r(Y ) denote the random index in E∗ such that Y ∈ B∗r . Then
the posterior distribution of P given Y can be written as D(B∗ , hY∗ , α + δY )
143
where
hY∗ (y) = c∗ · h∗ (y) · P
yr
s∈E∗
ys
.
(44.37)
Proof: This follows immediately from Theorem 44.1. We have implicitly
assumed that α(B∗t ) > 0 for all t. Otherwise, we will just have to remove
the sets B∗t with α(B∗t ) = 0 from the partition B∗ .
Theorem 44.1 obtains, under a PB prior, the posterior distribution in
the general repair model when B is a union of sets in the partition B. When
B is not necessarily a union of sets in the partition B, Theorem 44.2 shows
how to obtain, under a PBD prior, the posterior distribution by enlarging
the partition B with the restriction set B to the larger partition B∗ which will
insure that B is a union of sets in B∗ . This step is valid because the restriction
set B is non-random. The same step can justified, under PBD priors, even
if we enlarge the partition with a still another set C that depends on the
observation Y ; this is described in Theorem 44.3.
Theorem 44.3. Let P ∼ D(B, h, α). Let the conditional distribution of Y
P
given P be PA where we can assume that A =
i=1r Bi , without loss of
generality, in view of Theorem 43.1. Let r(Y ) = j if Y ∈ Bj . The posterior
distribution of P given Y, r(Y ) = j is
D(B, hY , α + δY )
(44.38)
where
from Theorem 44.2.
yj
hY (y) = Pr h(y)
1 yi
(44.39)
Suppose that C is a set that depends on Y in a measurable way, i.e.
{(x, y) : x ∈ C(y)} ∈ A2 . Consider the enlarged partition B∗ = Bi1 =
144
Bi ∩ C, Bi2 = Bi ∩ C, 1 ≤ i ≤ m. After Y has been observed, C is a
nonrandom set, and hence we can use Theorem 43.1 to rewrite the posterior
distribution in (44.38) and (44.39) as
D(B∗ , hY∗ , α + δY )
where
hY∗ (y∗ )
yj
= Pr
1
yi
×n1
(44.40)
α(Bi1 )−1 α(Bi2 )−1
yi2
h(y)
α(Br )−1
yr
yi1
(44.41)
and yi1 + yi2 = yi , 1 ≤ i ≤ m.
This same answer is also obtained if we incorrectly express the prior
D(B, h, α) (by an invalid application of Theorem 43.1) as
D(B∗ , h∗ , α)
with
h∗ (y∗ ) =
(44.42)
α(Bi1 )−1 α(Bi2 )−1
yi2
n yi1
×1
h(y)
α(Bi )−1
yi
(44.43)
and formally calculate the posterior distribution using Theorem 44.2.
Proof: Note that {r(Y ) = j} = {Y ∈ Bj } = {Y ∈ Bj1 } ∪ {Y ∈ Bj2 }, 1 ≤
j ≤ r. Using the (unjustified) distribution in (44.42) and (44.43) as the prior,
a blind application of Theorem 44.2 leads us to the following as the posterior
distribution:
D(B∗ , h∗∗ , α + δY )
with
(44.44)
α(B )−1 α(B )−1
y i1 yi2 i2
yj1 + yj2
×n1 i1 α(Br )−1
h∗∗ (y∗ ) = Pr
yr
1 (yi1 + yi2 )
which are the same as (44.40) and (44.41).
145
h(y)
(44.45)
April 17, 2008
45
Example illustrating Theorem 44.3
We will illustrate Theorem 44.3 with a simple example.
Let P ∼ Dα and X|P ∼ P . Let α have a positive non-atomic part and
let A(x) = (x − ǫ, x + ǫ). Note that and α(A(x)) > 0 for each x.
We know that the distribution of P given X is Dα+δX . We will obtain
this answer by using the steps outlined in Theorem 44.3.
Consider the random A = A(X) and the partition B = (A, Ac ). By a
blind application of Theorem (43.1) one write the prior as a PBD distribution
on the partition B as follows
Dα = D(B, h(y1, y2 ), α)
(45.46)
α(A)−1 α(Ac )−1
y2
.
where h(y1 , y2) = y1
Now X|P ∼ P = PX and r(X) = 1 since X ∈ A. From another blind
application of Theorem 44.2, the posterior distribution of P given X is
D(B, hX , α + δX )
where hX = h(y1 , y2 ) ·
y1
y1 +y2
α(A) α(Ac )−1
y2
= y1
α(A)+δX (A)−1 α(Ac )+δX (Ac )−1
y2
.
= y1
From the standard properties of the Dirichlet distribution, the above PBD
distribution is equal to Dα+δX , which is the correct answer.
146
46
Bayesian methods with censored observations
Let P have some prior distribution. Let X be a sample from P , i.e. X|P ∼ P .
Let A be a measurable subset, that can depend on previous observations and
covariates, if any. The when X is censored by the set A, we just observe the
random variable Y defined by

 X
Y =
 θ,
if X ∈ Ac
(46.47)
if X ∈ A
The question is what is the posterior distribution of P given the censored
value Y .
Theorem 46.1. Let the prior distribution of P be the PB distribution H(B, h, G),
where B = (B1 , . . . , Bk ), his the pdf of P(B) and G is the joint distribution
of (PB1 , . . . , PBk ). Suppose that the censoring set A satisfies
A=
r
X
Bi .
1
Let X|P ∼ P and let Y be as defined in (46.47), be the censored value. Then
the posterior distribution of P given Y = θ is
H(B, hY , G)
where
hY (y) = yA h(y) with yA =
(46.48)
r
X
1
Proof: If Y ∈ Ac , then P |Y ∼ Dα+δY .
147
yi .
Q((P (B1 ), . . . , P (Bk )) ∈ C, (PB1 , . . . , PBk ) ∈ D, X ∈ A)
Z
Z
=
yA h(y)dy
G(dPB1 , . . . , dPBk ).
y∈C
D
Therefore
(PB1 , . . . , PBk )|X ∈ A) = G
and
(P (B1 ), . . . , P (Bk )|X ∈ A) ∝ yA h(y).
This completes the proof of this theorem.
The following theorem stated without proof is immediate.
Theorem 46.2. Let P ∼ H(B, h, G). Let A1 ∈ A. For n = 2, 3, . . . , let An
depend measurably on (X1 , . . . , Xn−1), i.e.
{(x1 , . . . , xn ) : xn ∈ An (x1 , . . . , xn−1 )} ∈ An .
Let X1 , X2 , . . . , Xn be such that
X1 |P ∼ P
Y1 = θ1 if X1 ∈ A1
X2 |P, X1 ∼ P
Y2 = θ2 if X2 ∈ A2
X3 |P, X2, X1 ∼ P
Y3 = θ3 if X3 ∈ A3
..
.
148
For each i assume that Ai is the union of subsets in the partition B. Then
P |(Y1 = θ1 , . . . , Yn = θn ) ∼ H(B, h∗ , G)
where h∗ = hyA1 . . . yAn .
Note that, in the above two theorems, we have slightly extended the
definition of PB priors from the previous lecture, by allowing PB1 , . . . , PBk
to be dependent.
47
Application
We will now apply the above results to the estimation of the distribution
function in the standard censoring problem. The Kaplan-Meyer estimate is
the frequentist estimate in this problem. Susarla and Van Ryzin gave the
Bayes estimate under a Dirichlet prior.
Let P ∼ Dα and X1 , . . . Xn |P be i.i.d.P . Suppose that X1 , . . . , Xr are
uncensored and therefore observed and let the rest be censored as follows
Xr+1 ∈ Ar+1 , . . . , Xn ∈ An , where Ar+1 , . . . , An can depend on previous
observations or independent covariates.
We will now obtain the posterior distribution P given such data.
Consider the partition B defined based on Ar+1 , . . . , An and view Dα
as the PBD distribution D(B, h, α). An extension of Theorem 46.1 on the
lines of Theorem 44.3 is valid, and we can use the above PBD distribution
(though it is based on sets depending on the sample) as a prior and blindly
apply Theorem 46.1 to obtain the posterior distribution as
149
∗
D(B, h , α +
r
X
δXi )
1
∗
where h = hyAr+1 . . . yAn .
The Bayes estimate of P (A) is just the expectation of P (A) under the
P
above PBD posterior distribution. Since P (A) =
P (Bi )PBi (A) this is
calculated as
P̂ (A) = E(P (A)) =
X
E(yi )
α(ABi)
α(X )
where
R
yi h(y)yAr+1 . . . yAn
E(yi) = R
h(y)yAr+1 . . . yAn
E(Zi ZAr+1 . . . ZAn )
=
E(ZAr+1 . . . ZAn )
where
(Z1 , . . . Zk ) ∼ D(α(B1), . . . , α(Bk ))
which can be generated as Zi =
Pwi
wi
and w1 , . . . , wk are i.i.d. Γ(α(B1 ), . . . , α(Bk )).
(j)
(j)
Otherwise, one can generate independent vectors (Z1 , . . . , Zk ), j =
1, . . . , N as above and use the Law of Large Numbers (LLN) to approximate E(yi) by
P
(j)
(j)
(j)
Zi ZAr+1 . . . ZAn
.
P (j)
(j)
j ZAr+1 . . . ZAn
j
In the right censoring example of Susarla and Van Ryzin, the sets Aj , j =
r + 1, . . . , n are intervals of the form (x, ∞). Then the pdf h∗ posterior
distribution is the pdf of a Connor-Mosimann distribution and quantities
like E(yi ) can be evaluated in closed form. The Bayes estimate of P ((x, ∞)
is the same as the Susarla-Van Ryzin estimate.
150
48
Reexamining the Susarla-Van Ryzin example
Susarla and Van Ryzin considered the data set
0.8, 1.0+, 2.7+, 3.1, 5.4, 7.0+, 9.2, 12.1+,
where a+ denotes that the censoring set was [a, ∞) and that potential observation was censored.
They used a Dirichlet prior for P with parameter α, which was 8 times
the exponential distribution with failure rate 0.12.
The posterior distribution given the uncensored observations is Dirichlet
with parameter α∗ = α + δ0.8 + δ3.1 + δ5.4 + δ9.2 . We can take this to be the
prior distribution and say that the remaining data 1.0+, 2.7+, 7.0+, 12.1+,
are all censored.
The partition formed by the censoring sets is
B = (B1 , . . . , B5 ) = ([0, a1 ), [a2 , a3 ), . . . , [a4 , ∞))
= ([0, 1.0), [1.0, 2.7), [2, 7, 7.0), [7.0, 12.1), [12.1, ∞)).
The posterior distribution given all the data is therefore
D(B, h(y) Y2 Y3 Y4 Y5 , α∗ )
where Yj = yj + · · · + y5 , j = 2, . . . , 5 are the tail sums of the y’s and
α∗ (Bj )−1
h(y) ∝ ×51 yj
151
.
We use our method to recalculate the estimates of F (t) based on SusarlaVan Ryzin data. When the transformation to independent Beta variables
is used we get the same results as Susarla and Van Ryzin. However, this
depends on the fact that all the censoring were right censoring.
We also use on LLN method to compute estimates of F (t). A comparison
of the results is given below:
One can also employ the MCMC method outlined earlier to obtain these
Bayes estimates. The table below compares these methods and also gives the
Kaplan-Meyer estimator.
F̂ (t)
t
0.80
1.00+
2.70+
3.10
5.40
7.00+
9.20
12.10+
Exact
0.1083 0.1190 0.2071 0.3006 0.4719 0.5256 0.6823
0.7501
LLN
0.1082 0.1189 0.2068 0.3000 0.4708 0.5244 0.6810
0.7487
MCMC
0.1083 0.1190 0.2084 0.3011 0.4706 0.5261 0.6802
0.7476
KME
0.1250 0.1450 0.1250 0.3000 0.4750 0.4750 0.7375
0.7375
152
49
Bayes estimates with left and right censoring
Suppose that we write the data from the example of Susarla-Van Ryzin as
0.8−, 1.0+, 2.7+, 3.1−, 5.4−, 7.0+, 9.2−, 12.1 + .
where a− indicates that a potential data was censored to the interval [0, a)
(i.e. right censored).
The data generates a partition B = (B1 , . . . , B9 ) of 9 intervals.
If we use a Dirichlet prior Dα , it can also be viewed as PB Dirichlet on
this partition. The prior distribution of P (B) = (P (B1 ), . . . , P (B9)) is the
finite dimensional Dirichlet with pdf proportional to
α(Bi )−1
h(y) = ×91 yi
.
The posterior distribution of P is a PB Dirichlet distribution
D(B∗ , h(y) Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 , α)
where
Zj =
Pj
1
yi , Yj =
P9
j
yi .
This also illustrates how to handle any kind of censoring in an arbitrary
space.
A closed form expression for E(yj ) under the posterior distribution is not
available. However, it is just
Eh (yj Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 )
.
Eh (Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9 )
153
where Eh denotes expectation under the finite dimensional Dirichlet distribution D(α(B1), . . . , α(B9 )).
We can generate samples (y1r , . . . , y9r ), r = 1, . . . , N from h(y) by using
independent Gamma random variables and approximate the above as
PN
r r r r r r r r r
r=1 yj Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9
.
PN
r r r r r r r r
r=1 Z1 Y3 Y4 Z5 Z5 Y7 Z8 Y9
This method is justified by the Law of Large Numbers and good rates of
convergence are well known.
It does not use imputation and MCMC methods, which add another level
of randomness in the final answer.
This method will handle any kind of censoring and is applicable to distributions in multidimensional spaces.
Of course, we can estimate many features other than just the distribution
function, i.e median, quantiles, etc.
We used the same Dirichlet prior for P , namely the Dirichlet prior with
parameter α, which was 8 times the exponential distribution with failure rate
0.12. Here is the Bayes estimate of F (t) under this two-way censoring.
F̂ (t)
t
0.80-
1.00+
2.70+
3.10-
5.40-
7.00+
9.20-
LLN 0.1737 0.1937 0.3625 0.3968 0.5274 0.5990 0.6796
154
12.10+
0.7721
The figure below gives the Bayes estimates of the distribution function
under right censoring, under both kinds of censoring and the prior mean of
the distribution function.
Estimates of the df from censored (all types) data, based on PB priors
0.9
0.8
Two−way censoring
0.7
F(t)
0.6
Prior df
0.5
0.4
0.3
Right censoring
0.2
0.1
0
50
0
5
t
10
15
Revisiting Bayes estimation for repair models
We will now give an example of Bayes estimation of the distribution (survival)
function when the data comes from a repair model. The following data
consists of inter-failure times, with observations from a chi2 -distribution with
5 degrees of freedom and some maintenance schedule. When repairs are made
it is minimal repair, indicated by a 0 on the column on the left, and at random
times replacements (perfect repair) are made, indicated by a 1..
155
new/repair
age
1
5.0093
0
8.9197
1
9.2638
0
12.7893
0
14.6230
0
16.9651
0
19.4155
1
2.4406
0
7.7570
0
8.2593
0
9.8226
0
11.7806
0
13.2864
156
Based on the order statistics of these inter-failure times (ages), one can
partition (0, ∞) into 15 intervals, and rearrange them as follows.
new/repair
age
partition no.
start partition no.
or rank
1
5.0093
2
1
0
8.9197
5
3
1
9.2638
6
1
0
12.7893
9
7
0
14.6230
11
10
0
16.9651
12
12
0
19.4155
13
13
1
2.4406
1
1
0
7.7570
3
2
0
8.2593
4
4
0
9.8226
7
5
0
11.7806
8
8
0
13.2864
10
9
The last two columns indicate the partition number into which the age
falls, and the partition number of the beginning partition into which it could
have fallen. Thus, it was after a perfect repair, the age could have fallen in
all partitions starting from the first partition. It it was a minimal repair, the
age can fall in all partitions starting from the age at the previous observation.
157
The following figure displays the same information.
158
We used a Dirichlet prior with a parameter equal to 2 times a χ2 -distribution
with 2 degrees of freedom and calculated the Bayes estimate of the survival
function which is given in the following figure. The Bayes estimate, the frequentist estimate of Whitaker and Samaniego together with the true survival
function is given in the following figure.
159
April 22, 2008
51
Students and the papers presented - I
Student
Paper title
Authors
Laura Taylor
Bayesian nonparametric estimation
Salinas-Torres, Pereira
in a series system or a competing risks model
and Tiwari
On choosing the centering distribution
Hanson, Sethuraman
in Dirichlet process mixture models
and Xu
Alex McLain
160
April 24, 2008
52
Students and the papers presented - II
Student
Paper title
Authors
Na Yang
Gibbs sampling methods for stick-breaking priors James and Ishwanran
Shuang Li
Gibbs sampling methods for stick-breaking priors James and Ishwanran
Peng Chen
Variational methods for the Dirichlet process
161
Blei and Jordan
Download