Bayesian Nonparametric Modelling with the Dirichlet

Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother - Supplementary Material J.E. Griffin and M. F. J. Steel University of Kent and University of Warwick Appendix A: Proofs Proof of Proposition 1 Let Z ∼ H. Then £ ¤ (k) E µx =E "∞ X # pi (x)θik = E[Z k ] i=1 ∞ X E [pi (x)] = E[Z k ]. i=1 i h (k) Similarly, E µy = E[Z k ]. "Ã ∞ !Ã ∞ !# X X £ (k) (k) ¤ E µx µy = E pi (x)θik pi (y)θik i=1 = ∞ X i=1 E [pi (x)pi (y)] E £ θi2k ¤ + i=1 £ =E Z ∞ ∞ X X £ ¤ £ ¤ E [pi (x)pj (y)] E θik E θjk i=1 j=1;j6=i ¤ 2k ∞ X £ E [pi (x)pi (j)] + E Z ¤ k 2 Ã i=1 1− ∞ X ! E [pi (x)pi (y)] i=1 £ ¤2 £ = E Z k + Var Z k ∞ ¤X E [pi (x)pi (y)] i=1 Cov ¡ (k) µ(k) x , µy ¢ £ = Var Z k ∞ ¤X E [pi (x)pi (y)] i=1 and so ¢ ¡ (k) = , µ Corr µ(k) y x P∞ i=1 E [pi (x)pi (y)] P ∞ 2 i=1 E [pi (x) ] 1 which follows from the stationarity of p1 (x), p2 (x), p3 (x), . . . Proof of Theorem 1 It is easy to show (see GS) that for any measurable set B Corr(Fx (B), Fy (B)) = (M + 1) ∞ X E[pi (x)pi (y)] i=1 In this case, if x ∈ / S(φi ) or y ∈ / S(φi ) then E[pi (x)pi (y)] = 0 (1) (2) Otherwise, let Ri = {j < i|x ∈ S(φi ) and y ∈ S(φi )}, Ri = {j < i|x ∈ S(φi ) and y ∈ / S(φi )} (3) and Ri = {j < i|x ∈ / S(φi ) and y ∈ S(φi )}   Y Y Y   (1 − Vi ) (1 − Vi ) (1 − Vi )2 E[pi (x)pi (y)] = E Vi2 (3) (2) (1) j∈Ri j∈Ri j∈Ri   ¶#Ri(1) µ ¶#Ri(2) µ ¶#Ri(3) µ 2 M M M  = E (M + 1)(M + 2) M +2 M +1 M +1 and so "∞ # Pi−1 Pi−1 X µ M ¶ j=1 Aj µ M + 1 ¶ j=1 Bj 2 Corr(Fx (B), Fy (B)) = E Bi M +2 M +1 M +2 i=1 Proof of Theorem 2 E  = E "∞ X µ Bi i=1 M M +1 X {i|y∈S(φi ) or x∈S(φi )} ¶Pi−1 µ j=1 Aj µ Bi M M +1 2 M +1 M +2 # ¶Pi−1 j=1 Bj ¶Pi−1 µ j=1 Aj M +1 M +2 ¶Pi−1 j=1 Bj   The set {i|y ∈ S(φi ) or x ∈ S(φi )} must have infinite size since it is contained by the set {i|x ∈ S(φi )} which has infinite size. Let φ01 , φ02 , φ03 , . . . be the subset of φ1 , φ2 , φ3 for which {i|y ∈ S(φi ) or x ∈ S(φi )} and define Bi0 = I (y ∈ S(φ0i ) and x ∈ S(φ0i )) then "∞ # Pi−1 X µ M ¶i µ M + 1 ¶ j=1 Bj0 2 E Bi0 Corr(Fs , Fv ) = M +2 M +1 M +2 i=1 "µ ¶i ¶Bj0 # ∞ µ i−1 Y 2 X M + 1 M E = E [Bi0 ] M + 2 i=1 M + 1 M +2 j=1 ·µ ¶i ¶ ¸i−1 ∞ µ M +1 2 X M = ps,v ps,v + (1 − ps,v ) M + 2 i=1 M + 1 M +2 µ ¶X ¶i ·µ ¶ ¸i ∞ µ 2 M M M +1 = ps,v ps,v + (1 − ps,v ) M + 2 M + 1 i=0 M + 1 M +2 µ ¶ 2 M 1 £¡ M ¢ ¤ = ps,v M +2 M +1 1 − M +2 ps,v + MM+1 (1 − ps,v ) = 2 MM+2 ps,v 1+ M p M +2 s,v Proof of Theorem 3 Since (C, r, t) follows a Poisson process on Rp × R2+ with intensity f (r), p(Ck |s ∈ S(φk ) or v ∈ S(φk ), rk ) is uniformly distributed on Brk (s) ∪ Brk (v) and p(rk |s ∈ S(φk ) or v ∈ S(φk )) = ν (Brk (s)∪Brk (v))f (rk ) R where ν(·) is Lebesgue measure. Then ν (Brk (s)∪Brk (v))f (rk ) drk ps,v = P (s, v ∈ S(φk )|s ∈ Sk or v ∈ S(φk )) Z Z = p (Ck , rk |s ∈ S(φk ) or v ∈ S(φk )) dCk drk Brk (s)∩Brk (v) R ν (Brk (s) ∩ Brk (v)) f (rk ) drk =R . ν (Brk (s) ∪ Brk (v)) f (rk ) drk Proof of Theorem 4 +1 )/(1 + The autocorrelation function can be expressed as f (ps,s+u ) where f (x) = 2( M M +2 Then by Faá di Bruno’s formula ¶mj µ j X n! dn dm1 +···+mn f Y d ps,s+u 1 f (ps,s+u ) = , 1 +···+mn dun m1 !m2 !m3 ! . . . dpm duj j! s,s+u {j|mj 6=0} 3 M x). M +2 where m1 + 2m2 + 3m3 + · · · + nmn = n with mj ≥ 0, j = 1, . . . , n, and so µ j ¶mj X dn n! dm1 +···+mn f Y d ps,s+u 1 lim f (ps,s+u ) = lim m1 +···+mn . u→0 dun m1 !m2 !m3 ! . . . u→0 dps,s+u duj j! {j|mj 6=0} Since limu→0 dm1 +···+mn f m1 +···+mn dps,s+u = limps,s+u →1 dm1 +···+mn f m1 +···+mn dps,s+u is finite and non-zero for all values of n, the degree of differentiability of the autocorrelation function is equal to the degree of differentiability ¡ ¢−1 dk ps,s+u of ps,s+u . We can write ps,s+u = 4µ with a = 2µ −uI. Now − 1 = (k−1)!(4µ−a)−k 2 a ak dk ps,s+u and limu→0 dak = (k−1)!(2µ)−k which is finite and non-zero. By application of Faá di Bruno’s formula ¶mj µ j X dn n! dm1 +···+mn ps,s+u Y da1 ps,s+u = dun m1 !m2 !m3 ! . . . dam1 +···+mn duj j! {j|mj 6=0} and the degree of differentiability is determined by the degree of differentiability of a. If p(r) ∼ ¡ ¢α ¡ ¢α−1 dI 2 = − 12 u2 exp{−u/2} and du = − 12 u2 exp{−u/2} and it is easy to Ga(α, β) then dµ du dn a α−n+1 show that dun = Cn u exp{−u/2} + ζ where ζ contains terms with power of x greater than α − n + 1. If limu→0 uα exp{−u/2} is finite then so is limu→0 uα+k exp{u/2} for k > 0 and so the limit will be finite iff α − n + 1 ≥ 0, i.e. α ≥ n − 1. Appendix B: Computational details As we conduct inference on the basis of the Poisson process restricted to the set R, all quantities (C, r, t, V, θ) should have a superscript R. To keep notation manageable, these superscripts are not explicitly used in this Appendix. Updating the centres We update each centre C1 , . . . , CK from its full conditional distribution Metropolis-Hastings random walk step. A new value Ci0 for the i-th centre is proposed from N(Ci , σC2 ) where σC2 is chosen so that the acceptance rate is approximately 0.25. If there is no xi such that xi ∈ (Ci0 − ri , Ci0 + ri ) or if there is one value of j such that sj = i for which xi ∈ / (Ci0 − ri , Ci0 + ri ) then α(Ci , Ci0 ) = 0. Otherwise, the acceptance probability has the form Qn Q (1 − Vh ) j=1 h<sj and Ch0 −rh <xj <Ch0 +rh α(Ci , Ci0 ) = Qn Q . (1 − Vh ) j=1 h<sj and Ch −rh <xj <Ch +rh 4 Updating the distances The distances can be updated using a Gibbs step since the full conditional distribution of rk has a simple piecewise form. Recall that dik = |xi − Ck | and let Sk = {j|sj ≥ k}. We define Skord to be a version of Sk where the element have been ordered to be increasing in dik , i.e. if i > j and i, j ∈ Skord then dik > djk . Finally we define d?k = max[{xmin − Ck , Ck − xmax } ∪ {dik |si = k}] and m? be such that xi ∈ Skord and xm? > d?k and xm? −1 < d?k . Let l be the length of Skord . The full conditional distribution has density  if d?k < z ≤ dSm ord k   f (z) ? ? ? i−m +1 f (z) ∝ i = m? , . . . , l − 1 . f (z)(1 − Vk ) if dSiord k < z ≤ dSi+1 ord k ,   ? f (z)(1 − Vk )l−m +1 if z > dSlord k Swapping the positions of atoms The ordering of the atoms should also be updated in the sampler. One of the K included atoms, say (Vi , θi , Ci , ri ), is chosen at random to be swapped with the subsequent atom (Vi+1 , θi+1 , Ci+1 , ri+1 ). If i < K, the acceptance probability of this move is min {1, (1 − Vi+1 )ni /(1 − Vi )ni+1 } . If i = K, then a new point (VK+1 , θK+1 , CK+1 , rK+1 ) is proposed from their prior and the swap is accepted with probability min {1, (1 − VK+1 )ni }. Updating θ and V Q The full conditional distribution of θi is proportional to h(θi ) {j|si } k(yj |θi ), where h is the density P function of H. We update Vi from a Beta distribution with parameters 1 + nj=1 I(sj = i) and P M + nj=1 I(sj > i, |xj − Ci | < ri ). Updating M This parameter can be updated by a random walk on the log scale. Propose M 0 = M exp(²) where 2 2 ² ∼ N(0, σM ) with σM a tuning parameter chosen to maintain an acceptance rate close to 0.25. The proposed value should be accepted with probability n iM 0 PK o 0 αK 0 β(M ) exp −β(M ) i=1 ri p(M 0 ) M i=1 (1 − Vi ) , iM n hQ PK o K αK K+1 β(M ) exp −β(M ) i=1 ri p(M ) M i=1 (1 − Vi ) 0 K+1 hQ K 5 where β(M ) is β expressed as a function of M , as in our suggested form µ ¶ 2 1+M +ε . β = ? log x ε(M + 2) Posterior inferences on Fx̃ We are often interested in inference at some point x̃ ∈ X about the distribution Fx̃ . We define (Ṽ1 , θ̃1 ), (Ṽ2 , θ̃2 ), . . . , (ṼJ , θ̃J ) to be the subset of (V1 , θ1 ), (V1 , θ2 ) . . . , (VK , θK ) for which |x̃ − Ci | < ri . Then Fx̃ = J X δθ̃i Ṽi i=1 + Y (1 − Ṽj ) j<i ni N X X i=1 j=1 nj ³ YY 1 − Vl (j) ´ + (i) δθ(i) Vj j 1− (i) Vj ∞ ´ X i≤J j=1 j≤i l=1 Y³ ni ³ YY 1 − Vl (i) ´Y (1 − Vl ) ¡ 1 − Vm(l) Y l=J+1 m<l (i) (i) (1 − Ṽm ) ¢ m=1 l<i l<j nl Y δθ̃l Ṽl where nj is a geometric random variable with success probability 1−p̃, θj ∼ H, Vj ∼ Be(1, M ), θ̃m ∼ H and Ṽm ∼ Be(1, M ) for m > N . We calculate p̃ in the following way. If xmin < x̃ < xmax , define i so that x(i) < x̃ < x(i+1) , where x(1) , . . . , x(n) is an ordered version of x1 , . . . , xn , β q̃ where then p̃ = 2α µ ¶ µ ¶ µ ¶ x(i+1) − x(i) x̃ − x(i) x(i+1) − x̃ q̃ =(x(i+1) − x(i) )I + (x(i) − x̃)I − (x(i+1) − x̃)I 2 2 2 µ ¶ µ ¶ µ ¶ x(i+1) − x(i) x̃ − x(i) x(i+1) − x̃ − 2µ? + 2µ? + 2µ? 2 2 2 Ry Ry with I(y) = 0 f (r) dr and µ? (y) = 0 rf (r) dr. Otherwise if x̃ < xmin µ ¶ µ µ ¶¶ xmin − x̃ xmin − x̃ ? q̃ = 2µ + (xmin − x̃) 1 − I 2 2 and if x̃ > xmax µ q̃ = 2µ ? x̃ − xmax 2 ¶ µ µ ¶¶ x̃ − xmax + (x̃ − xmax ) 1 − I . 2 We use a truncated version of Fx̃ with h elements which are chosen so that ² is usually taken to be 0.001. 6 Ph i=1 pi = 1 − ² where Model 2 This section is restricted to discussing the implementation when m(x) follows a Gaussian process prior where we define Pij = ρ(xi , xj ). We also reparametrise from ui to φi = σ 2 ψi . Updating ψi |s The full conditional distribution has the density P I(sj =i,1≤j≤n)) p(φi ) ∝ exp{−0.5φi /σ 2 }, φi > φmin © ª where φmin = max (yi − m(xi ))2 |sj = i, 1 ≤ j ≤ n . A rejection sampler for this full conditional distribution can be constructed using the envelope ( P 0.5(1− I(sj =i,1≤j≤n)) φi φmin < φi < z ? h (φi ) ∝ P 0.5(1− I(sj =i,1≤j≤n)) 2 z exp{−0.5(φi − z)/σ } φi > z 0.5(1− φi which can be sampled using inversion sampling. The acceptance probability is ( exp{−0.5(φi − φmin )/σ 2 } φmin < φi < z ¡ φi ¢0.5−0.5k α(φi ) = 2 exp{−0.5(z − φmin )/σ } φi > z z P and the choice z = σ 2 I(sj = i, 1 ≤ j ≤ n) maximizes the acceptance rate. Updating σ −2 Using the prior Gamma(ν1 , ν2 ), the full conditional distribution of σ −2 is again a Gamma distribution, where we define P = (Pij ) Ã ! K X 3K n 1 1 σ −2 ∼ Ga ν1 + + , ν2 + φi + m(x)T P −1 m(x) . 2 2 2 i=1 2ω Updating m(x1 ), . . . , m(xn ) It is possible to update m(xi ) using its full conditional distribution. However this tends to lead to slowly mixing algorithms. A more useful approach uses the transformation m(x) = C ? z where C ? is the Cholesky factor of σ0−2 P −1 , where z ∼ N(0, I). We then update zj using their full conditional distribution which is a standard normal distribution truncated to the region ∩ni=1 (yi − √ √ P P φi , yi − k6=j Cik zk + φi ). k6=j Cik zk − 7 Updating ω We define ω 2 = σ 2 /σ02 . If ω 2 follows a Gamma distribution with parameters a0 and b0 then the full conditional of σ0−2 follows a Gamma distribution with parameters a0 + n/2 and b0 + σ −2 m(x)T P −1 m(x)/2. A similar updating occurs for the Generalized inverse Gaussian prior used here. Updating the Matèrn parameters We update any parameters of the Matèrn correlation structure by a Metropolis-Hastings random walk. The full conditional distribution of the parameters (ζ, τ ) would be proportional to © ª |P |−1/2 exp −σ −2 ω −2 m(x)T P −1 m(x) p(ζ, τ ). 8

Bayesian Nonparametric Modelling with the Dirichlet

Related documents

Products

Support

Bayesian Nonparametric Modelling with the Dirichlet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib