Bayesian Nonparametric Modelling with the Dirichlet

advertisement
Bayesian Nonparametric Modelling with the Dirichlet
Process Regression Smoother - Supplementary Material
J.E. Griffin and M. F. J. Steel
University of Kent and University of Warwick
Appendix A: Proofs
Proof of Proposition 1
Let Z ∼ H. Then
£
¤
(k)
E µx
=E
"∞
X
#
pi (x)θik = E[Z k ]
i=1
∞
X
E [pi (x)] = E[Z k ].
i=1
i
h
(k)
Similarly, E µy = E[Z k ].
"Ã ∞
!Ã ∞
!#
X
X
£ (k) (k) ¤
E µx µy = E
pi (x)θik
pi (y)θik
i=1
=
∞
X
i=1
E [pi (x)pi (y)] E
£
θi2k
¤
+
i=1
£
=E Z
∞
∞
X
X
£ ¤ £ ¤
E [pi (x)pj (y)] E θik E θjk
i=1 j=1;j6=i
¤
2k
∞
X
£
E [pi (x)pi (j)] + E Z
¤
k 2
Ã
i=1
1−
∞
X
!
E [pi (x)pi (y)]
i=1
£ ¤2
£
= E Z k + Var Z k
∞
¤X
E [pi (x)pi (y)]
i=1
Cov
¡
(k)
µ(k)
x , µy
¢
£
= Var Z
k
∞
¤X
E [pi (x)pi (y)]
i=1
and so
¢
¡
(k)
=
,
µ
Corr µ(k)
y
x
P∞
i=1 E [pi (x)pi (y)]
P
∞
2
i=1 E [pi (x) ]
1
which follows from the stationarity of p1 (x), p2 (x), p3 (x), . . .
Proof of Theorem 1
It is easy to show (see GS) that for any measurable set B
Corr(Fx (B), Fy (B)) = (M + 1)
∞
X
E[pi (x)pi (y)]
i=1
In this case, if x ∈
/ S(φi ) or y ∈
/ S(φi ) then
E[pi (x)pi (y)] = 0
(1)
(2)
Otherwise, let Ri = {j < i|x ∈ S(φi ) and y ∈ S(φi )}, Ri = {j < i|x ∈ S(φi ) and y ∈
/ S(φi )}
(3)
and Ri = {j < i|x ∈
/ S(φi ) and y ∈ S(φi )}


Y
Y
Y


(1 − Vi )
(1 − Vi )
(1 − Vi )2
E[pi (x)pi (y)] = E Vi2
(3)
(2)
(1)
j∈Ri
j∈Ri
j∈Ri


¶#Ri(1) µ
¶#Ri(2) µ
¶#Ri(3)
µ
2
M
M
M

=
E
(M + 1)(M + 2)
M +2
M +1
M +1
and so
"∞
#
Pi−1
Pi−1
X µ M ¶ j=1 Aj µ M + 1 ¶ j=1 Bj
2
Corr(Fx (B), Fy (B)) =
E
Bi
M +2
M +1
M +2
i=1
Proof of Theorem 2
E

= E
"∞
X
µ
Bi
i=1
M
M +1
X
{i|y∈S(φi )
or x∈S(φi )}
¶Pi−1
µ
j=1 Aj
µ
Bi
M
M +1
2
M +1
M +2
#
¶Pi−1
j=1 Bj
¶Pi−1
µ
j=1 Aj
M +1
M +2
¶Pi−1
j=1 Bj


The set {i|y ∈ S(φi ) or x ∈ S(φi )} must have infinite size since it is contained by the set {i|x ∈
S(φi )} which has infinite size. Let φ01 , φ02 , φ03 , . . . be the subset of φ1 , φ2 , φ3 for which {i|y ∈
S(φi ) or x ∈ S(φi )} and define Bi0 = I (y ∈ S(φ0i ) and x ∈ S(φ0i )) then
"∞
#
Pi−1
X µ M ¶i µ M + 1 ¶ j=1 Bj0
2
E
Bi0
Corr(Fs , Fv ) =
M +2
M +1
M +2
i=1
"µ
¶i
¶Bj0 #
∞ µ
i−1
Y
2 X
M
+
1
M
E
=
E [Bi0 ]
M + 2 i=1 M + 1
M +2
j=1
·µ
¶i
¶
¸i−1
∞ µ
M +1
2 X
M
=
ps,v
ps,v + (1 − ps,v )
M + 2 i=1 M + 1
M +2
µ
¶X
¶i
·µ
¶
¸i
∞ µ
2
M
M
M +1
=
ps,v
ps,v + (1 − ps,v )
M + 2 M + 1 i=0 M + 1
M +2
µ
¶
2
M
1
£¡ M ¢
¤
=
ps,v
M +2 M +1
1 − M +2 ps,v + MM+1 (1 − ps,v )
=
2 MM+2 ps,v
1+
M
p
M +2 s,v
Proof of Theorem 3
Since (C, r, t) follows a Poisson process on Rp × R2+ with intensity f (r), p(Ck |s ∈ S(φk ) or v ∈
S(φk ), rk ) is uniformly distributed on Brk (s) ∪ Brk (v) and p(rk |s ∈ S(φk ) or v ∈ S(φk )) =
ν (Brk (s)∪Brk (v))f (rk )
R
where ν(·) is Lebesgue measure. Then
ν (Brk (s)∪Brk (v))f (rk ) drk
ps,v = P (s, v ∈ S(φk )|s ∈ Sk or v ∈ S(φk ))
Z Z
=
p (Ck , rk |s ∈ S(φk ) or v ∈ S(φk )) dCk drk
Brk (s)∩Brk (v)
R
ν (Brk (s) ∩ Brk (v)) f (rk ) drk
=R
.
ν (Brk (s) ∪ Brk (v)) f (rk ) drk
Proof of Theorem 4
+1
)/(1 +
The autocorrelation function can be expressed as f (ps,s+u ) where f (x) = 2( M
M +2
Then by Faá di Bruno’s formula
¶mj
µ j
X
n!
dn
dm1 +···+mn f Y
d ps,s+u 1
f (ps,s+u ) =
,
1 +···+mn
dun
m1 !m2 !m3 ! . . . dpm
duj j!
s,s+u
{j|mj 6=0}
3
M
x).
M +2
where m1 + 2m2 + 3m3 + · · · + nmn = n with mj ≥ 0, j = 1, . . . , n, and so
µ j
¶mj
X
dn
n!
dm1 +···+mn f Y
d ps,s+u 1
lim
f (ps,s+u ) =
lim m1 +···+mn
.
u→0 dun
m1 !m2 !m3 ! . . . u→0 dps,s+u
duj j!
{j|mj 6=0}
Since limu→0
dm1 +···+mn f
m1 +···+mn
dps,s+u
= limps,s+u →1
dm1 +···+mn f
m1 +···+mn
dps,s+u
is finite and non-zero for all values of n, the
degree of differentiability of the autocorrelation function is equal to the degree of differentiability
¡
¢−1
dk ps,s+u
of ps,s+u . We can write ps,s+u = 4µ
with
a
=
2µ
−uI.
Now
−
1
= (k−1)!(4µ−a)−k
2
a
ak
dk ps,s+u
and limu→0 dak = (k−1)!(2µ)−k which is finite and non-zero. By application of Faá di Bruno’s
formula
¶mj
µ j
X
dn
n!
dm1 +···+mn ps,s+u Y
da1
ps,s+u =
dun
m1 !m2 !m3 ! . . . dam1 +···+mn
duj j!
{j|mj 6=0}
and the degree of differentiability is determined by the degree of differentiability of a. If p(r) ∼
¡ ¢α
¡ ¢α−1
dI
2
= − 12 u2 exp{−u/2} and du
= − 12 u2
exp{−u/2} and it is easy to
Ga(α, β) then dµ
du
dn a
α−n+1
show that dun = Cn u
exp{−u/2} + ζ where ζ contains terms with power of x greater than
α − n + 1. If limu→0 uα exp{−u/2} is finite then so is limu→0 uα+k exp{u/2} for k > 0 and so
the limit will be finite iff α − n + 1 ≥ 0, i.e. α ≥ n − 1.
Appendix B: Computational details
As we conduct inference on the basis of the Poisson process restricted to the set R, all quantities
(C, r, t, V, θ) should have a superscript R. To keep notation manageable, these superscripts are not
explicitly used in this Appendix.
Updating the centres
We update each centre C1 , . . . , CK from its full conditional distribution Metropolis-Hastings random walk step. A new value Ci0 for the i-th centre is proposed from N(Ci , σC2 ) where σC2 is chosen
so that the acceptance rate is approximately 0.25. If there is no xi such that xi ∈ (Ci0 − ri , Ci0 + ri )
or if there is one value of j such that sj = i for which xi ∈
/ (Ci0 − ri , Ci0 + ri ) then α(Ci , Ci0 ) = 0.
Otherwise, the acceptance probability has the form
Qn Q
(1 − Vh )
j=1
h<sj and Ch0 −rh <xj <Ch0 +rh
α(Ci , Ci0 ) = Qn Q
.
(1 − Vh )
j=1
h<sj and Ch −rh <xj <Ch +rh
4
Updating the distances
The distances can be updated using a Gibbs step since the full conditional distribution of rk has a
simple piecewise form. Recall that dik = |xi − Ck | and let Sk = {j|sj ≥ k}. We define Skord to
be a version of Sk where the element have been ordered to be increasing in dik , i.e. if i > j and
i, j ∈ Skord then dik > djk . Finally we define d?k = max[{xmin − Ck , Ck − xmax } ∪ {dik |si = k}]
and m? be such that xi ∈ Skord and xm? > d?k and xm? −1 < d?k . Let l be the length of Skord . The full
conditional distribution has density

if d?k < z ≤ dSm
ord k

 f (z)
?
?
?
i−m +1
f (z) ∝
i = m? , . . . , l − 1 .
f (z)(1 − Vk )
if dSiord k < z ≤ dSi+1
ord k ,


?
f (z)(1 − Vk )l−m +1 if z > dSlord k
Swapping the positions of atoms
The ordering of the atoms should also be updated in the sampler. One of the K included atoms, say
(Vi , θi , Ci , ri ), is chosen at random to be swapped with the subsequent atom (Vi+1 , θi+1 , Ci+1 , ri+1 ).
If i < K, the acceptance probability of this move is min {1, (1 − Vi+1 )ni /(1 − Vi )ni+1 } . If i = K,
then a new point (VK+1 , θK+1 , CK+1 , rK+1 ) is proposed from their prior and the swap is accepted
with probability min {1, (1 − VK+1 )ni }.
Updating θ and V
Q
The full conditional distribution of θi is proportional to h(θi ) {j|si } k(yj |θi ), where h is the density
P
function of H. We update Vi from a Beta distribution with parameters 1 + nj=1 I(sj = i) and
P
M + nj=1 I(sj > i, |xj − Ci | < ri ).
Updating M
This parameter can be updated by a random walk on the log scale. Propose M 0 = M exp(²) where
2
2
² ∼ N(0, σM
) with σM
a tuning parameter chosen to maintain an acceptance rate close to 0.25.
The proposed value should be accepted with probability
n
iM 0
PK o
0 αK
0
β(M ) exp −β(M ) i=1 ri p(M 0 )
M
i=1 (1 − Vi )
,
iM
n
hQ
PK o
K
αK
K+1
β(M ) exp −β(M ) i=1 ri p(M )
M
i=1 (1 − Vi )
0 K+1
hQ
K
5
where β(M ) is β expressed as a function of M , as in our suggested form
µ
¶
2
1+M +ε
.
β = ? log
x
ε(M + 2)
Posterior inferences on Fx̃
We are often interested in inference at some point x̃ ∈ X about the distribution Fx̃ . We define
(Ṽ1 , θ̃1 ), (Ṽ2 , θ̃2 ), . . . , (ṼJ , θ̃J ) to be the subset of (V1 , θ1 ), (V1 , θ2 ) . . . , (VK , θK ) for which |x̃ −
Ci | < ri . Then
Fx̃ =
J
X
δθ̃i Ṽi
i=1
+
Y
(1 − Ṽj )
j<i
ni
N X
X
i=1 j=1
nj ³
YY
1 − Vl
(j)
´
+
(i)
δθ(i) Vj
j
1−
(i)
Vj
∞
´ X
i≤J j=1
j≤i l=1
Y³
ni ³
YY
1 − Vl
(i)
´Y
(1 − Vl )
¡
1 − Vm(l)
Y
l=J+1
m<l
(i)
(i)
(1 − Ṽm )
¢
m=1
l<i
l<j
nl
Y
δθ̃l Ṽl
where nj is a geometric random variable with success probability 1−p̃, θj ∼ H, Vj ∼ Be(1, M ),
θ̃m ∼ H and Ṽm ∼ Be(1, M ) for m > N . We calculate p̃ in the following way. If xmin < x̃ <
xmax , define i so that x(i) < x̃ < x(i+1) , where x(1) , . . . , x(n) is an ordered version of x1 , . . . , xn ,
β
q̃ where
then p̃ = 2α
µ
¶
µ
¶
µ
¶
x(i+1) − x(i)
x̃ − x(i)
x(i+1) − x̃
q̃ =(x(i+1) − x(i) )I
+ (x(i) − x̃)I
− (x(i+1) − x̃)I
2
2
2
µ
¶
µ
¶
µ
¶
x(i+1) − x(i)
x̃ − x(i)
x(i+1) − x̃
− 2µ?
+ 2µ?
+ 2µ?
2
2
2
Ry
Ry
with I(y) = 0 f (r) dr and µ? (y) = 0 rf (r) dr. Otherwise if x̃ < xmin
µ
¶
µ
µ
¶¶
xmin − x̃
xmin − x̃
?
q̃ = 2µ
+ (xmin − x̃) 1 − I
2
2
and if x̃ > xmax
µ
q̃ = 2µ
?
x̃ − xmax
2
¶
µ
µ
¶¶
x̃ − xmax
+ (x̃ − xmax ) 1 − I
.
2
We use a truncated version of Fx̃ with h elements which are chosen so that
² is usually taken to be 0.001.
6
Ph
i=1
pi = 1 − ² where
Model 2
This section is restricted to discussing the implementation when m(x) follows a Gaussian process
prior where we define Pij = ρ(xi , xj ). We also reparametrise from ui to φi = σ 2 ψi .
Updating ψi |s
The full conditional distribution has the density
P
I(sj =i,1≤j≤n))
p(φi ) ∝
exp{−0.5φi /σ 2 },
φi > φmin
©
ª
where φmin = max (yi − m(xi ))2 |sj = i, 1 ≤ j ≤ n . A rejection sampler for this full conditional distribution can be constructed using the envelope
(
P
0.5(1− I(sj =i,1≤j≤n))
φi
φmin < φi < z
?
h (φi ) ∝
P
0.5(1− I(sj =i,1≤j≤n))
2
z
exp{−0.5(φi − z)/σ }
φi > z
0.5(1−
φi
which can be sampled using inversion sampling. The acceptance probability is
(
exp{−0.5(φi − φmin )/σ 2 }
φmin < φi < z
¡ φi ¢0.5−0.5k
α(φi ) =
2
exp{−0.5(z − φmin )/σ }
φi > z
z
P
and the choice z = σ 2 I(sj = i, 1 ≤ j ≤ n) maximizes the acceptance rate.
Updating σ −2
Using the prior Gamma(ν1 , ν2 ), the full conditional distribution of σ −2 is again a Gamma distribution, where we define P = (Pij )
Ã
!
K
X
3K n
1
1
σ −2 ∼ Ga ν1 +
+ , ν2 +
φi +
m(x)T P −1 m(x) .
2
2
2 i=1
2ω
Updating m(x1 ), . . . , m(xn )
It is possible to update m(xi ) using its full conditional distribution. However this tends to lead to
slowly mixing algorithms. A more useful approach uses the transformation m(x) = C ? z where
C ? is the Cholesky factor of σ0−2 P −1 , where z ∼ N(0, I). We then update zj using their full
conditional distribution which is a standard normal distribution truncated to the region ∩ni=1 (yi −
√
√
P
P
φi , yi − k6=j Cik zk + φi ).
k6=j Cik zk −
7
Updating ω
We define ω 2 = σ 2 /σ02 . If ω 2 follows a Gamma distribution with parameters a0 and b0 then
the full conditional of σ0−2 follows a Gamma distribution with parameters a0 + n/2 and b0 +
σ −2 m(x)T P −1 m(x)/2. A similar updating occurs for the Generalized inverse Gaussian prior used
here.
Updating the Matèrn parameters
We update any parameters of the Matèrn correlation structure by a Metropolis-Hastings random
walk. The full conditional distribution of the parameters (ζ, τ ) would be proportional to
©
ª
|P |−1/2 exp −σ −2 ω −2 m(x)T P −1 m(x) p(ζ, τ ).
8
Download