1

advertisement
STAT544
Homework 3
1
2008-02-19
Keys
(a) The contour plot for the log-likelihood, log L(µ, σ|X1 , ..., X10 ).
0.5
1.0
σ
1.5
2.0
LogLikelihood
3
4
5
6
7
µ
(b)
For independent prior, log g(µ, σ 2 ) = log g1 (µ) + log g2 (σ 2 ). When µ is fixed, log g(µ, σ) = log g2 (σ 2 )+
constant if µ and σ are independent. The following figure shows their difference. The lines in the left hand
side are parallel but the lines in the right hand side are not.
1(b)(5) log G
diff width
0.5
1.0
1.5
2.0
−50
−100
−150
same width
different width
−200
−10
−8
−6
log−g with several fixed mu
same width
−14
log−g with several fixed mu
−4
1(b)(3) log G
0.5
sigma
1.0
1.5
sigma
–1–
2.0
STAT544
Homework 3
2008-02-19
Keys
1. The contour plot for the log g(µ, σ) and log L(µ, σ)+log g(µ, σ) with Jefreeys improper priors, g(µ, σ) =
1
.
σ2
1.5
0.5
1.0
sigma
1.0
0.5
sigma
1.5
2.0
1(b)(1) log−g + log L
2.0
1(b)(1) log−g
3
4
5
6
7
3
4
mu
5
6
7
mu
Please look at page 7 for more detailed contour plots of (b)(2)-(5).
2. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, 102 )
independent of σ 2 ∼ Inv-χ2 (1, 102 ).
1.5
0.5
1.0
sigma
1.0
0.5
sigma
1.5
2.0
1(b)(2) log−g + log−L
2.0
1(b)(2) Log−g
3
4
5
6
7
3
mu
4
5
mu
–2–
6
7
STAT544
Homework 3
2008-02-19
Keys
3. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, 22 )
independent of σ 2 ∼ Inv-χ2 (3, 22 ).
8
6
0
2
4
sigma
6
4
0
2
sigma
8
10
1(b)(3) log−g + log−L
10
1(b)(3) Log−g
0
5
10
15
0
5
10
mu
15
mu
4. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, σ 2 ),
σ 2 ∼ Inv-χ2 (1, 22 ).
8
6
0
2
4
sigma
6
4
0
2
sigma
8
10
1(b)(4) log−g + log−L
10
1(b)(4) Log−g
0
5
10
15
0
mu
5
10
15
mu
In the 1st to 3rd plot, the shape of marginal slices fixed and only location changed. In the 4th and 5th
plot(next page), since the shape of marginal slices changed, the contour plots show fan shape in the right
end.
–3–
STAT544
Homework 3
2008-02-19
Keys
5. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, σ 2 /10),
σ 2 ∼ Inv-χ2 (1, 22 ).
8
6
0
2
4
sigma
6
4
0
2
sigma
8
10
1(b)(5) log−g + log−L
10
1(b)(5) Log−g
0
5
10
15
0
5
mu
10
15
mu
Several ways to find the pdf of a scaled Inv-χ2 distribution.
2 ν/2)ν/2
ν
2
2
x− 2 −1 exp( −νs
X ∼ Inv-χ2 (ν, s2 ) with pdf f (x) = (sΓ(ν/2)
2x ), ν > 0, s > 0, x > 0
2
⇔ X ∼ Inv-Γ( ν2 , νs2 )
⇔ X = νs2 Y1 , Y ∼ χ2 (ν), get f (x) by the transformation from f (y)
Rcode:
gsize=50
x<-c(4.90791,4.83425,5.19801,6.85308,4.07085,4.66076,4.16201,5.49752,4.15438,3.72703)
mu<-3+seq(-3,14,length=gsize)
sigma<-.1+seq(0,10,length=gsize)
theta<-expand.grid(mu,sigma)
logL<- function (theta){sum(dnorm(x,theta[1],theta[2], log=TRUE)) }
z<-apply(theta,1,logL)
z<-matrix(z,gsize,gsize)
library(geoR)
logL=function(theta){ sum(dnorm(x,theta[1],theta[2],log=TRUE)) }
logG1=function(theta){ -2*log( theta[2] ) }
logG2=function(theta){ dnorm(theta[1], mean=0, sd=10, log=TRUE) + dinvchisq(theta[2]^2, 1, 100, log=TRUE) }
logG3=function(theta){ dnorm(theta[1], mean=0, sd=2, log=TRUE) + dinvchisq(theta[2]^2, 3, 4, log=TRUE) }
logG4=function(theta){ dnorm(theta[1], mean=0, sd=theta[2], log=TRUE) + dinvchisq(theta[2]^2, 1, 4, log=TRUE) }
logG5=function(theta){ dnorm(theta[1], mean=0, sd=theta[2]/sqrt(10), log=TRUE) + dinvchisq(theta[2]^2, 1, 4, log=TRUE ) }
w<-matrix(apply(theta,1, logG1),gsize,gsize)
contour(mu,sigma,w,main=main1 ,xlab="mu", ylab="sigma")
contour(mu,sigma,w+z,main=main2, xlab="mu", ylab="sigma")
# These will give a less detailed contour plot and will provide information to find better values of "level".
–4–
STAT544
Homework 3
2008-02-19
Keys
2
(a)
−1
The WinBUGS code specifies a set of bivairate Normal
dataÃYi =
)T ∼ M V N (µ,
Ã
! (Y1i , Y2iÃ
!!Σ = R = D)
0
100 0
for i = 1, ..., 10. The prior of µ is µ ∼ M V N Alpha =
, Σ0 =
. (i.e. tau =
0
0 100

Ã
!
Ã
!−1 
0.01
0
4 0
.
). The prior of Σ is Σ = D = R−1 ∼ Inv-Wishartν=4 Lambda−1 = ∆ =
0
0.01
0 4
Ã
!
4
0
Therefore, Σ has mean (ν − k − 1)−1 Lambda =
.
0 4
The following are statistic result from WinBUGS for 5 parameters, with 4 chains, initial values are
generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples.
(b)
Ã
!
68 0
The following are statistic result from WinBUGS for 5 parameters, set ν = 20 and Lambda =
0 68
such that hold the same prior mean for Σ, with 4 chains, initial values are generated by default, and beg
from 90000 to end 100000 with thin 10 to get 4000 samples.
Since the prior suggest higher σ, and the degrees of freedom ν = 20 is increased to hold the same prior
mean for Σ when changing the Lambda, that induced higher mean of σ and smaller ρ and also induced
wider credible intervals for mu, narrower credible intervals for σ and smaller ρ.
(c)
The following are statistic result from WinBUGS for 5 parameters, set Σ0 = τ0 = I2×2 , with 4 chains,
initial values are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples.
–5–
STAT544
Homework 3
2008-02-19
Keys
Since the prior for µ suggest a more condensed distribution, the posterior means are closer to the prior
means zero. That also increases the standard deviation and induce a wider credible interval.
(d)
Ã
!
4 −3
stats result from WinBUGS for 5 parameters, set Lambda =
, with 4 chains, initial values
−3 4
are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples.
Comparing to (b), this prior suggest a negative correlation between 2 variables. The posterior distribution
reflects this change.
(e)
Ã
!
40 0
stats result from WinBUGS for 5 parameters, set Lambda =
, with 4 chains, initial values
0 40
are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples.
Compare to (b) the prior of σ suggest higher expectation. The posterior distribution reflects this change.
The following table shows the estimate of means and covariance structure and the corresponding credible
intervales and length of these intervales.
Note
mu[1]
mu[2]
rho
sig1
sig2
Note
mu[1]
mu[2]
rho
sig1
sig2
(a)Mean
4.892
4.254
0.5822
1.43
1.626
(d)Mean
4.748
4.383
0.2609 ↓
1.379
1.612
2.50
3.865
3.04
-0.171
0.8715
1.048
2.50
3.703
3.153
-0.5033
0.8584
1.038
97.50
6.009
5.384
0.9181
2.425
2.63
97.50
5.837
5.548
0.8277
2.378
2.621
Length
2.144
2.344
1.0891
1.5535
1.582
Length
2.134
2.395
1.331
1.5196
1.583
(b)Mean
4.664
4.456
0.06784 ↓
1.823
1.871
(e)Mean
4.638
4.432
0.09748 ↓
2.55
2.584
2.50
3.327
3.111
-0.3219
1.38
1.423
2.50
2.644
2.527
-0.4965
1.622
1.656
–6–
97.50
6.077
5.759
0.4437
2.43
2.485
97.50
6.634
6.297
0.6533
4.342
4.15
Length
2.75
2.648
0.7656
1.05
1.062
Length
3.99
3.77
1.1498
2.72
2.494
(c)Mean
2.257↓
1.414↓
0.8887 ↑
3.061 ↑
3.071 ↑
2.50
0.1471
-0.4374
0.5471
1.276
1.573
97.50
3.971
3.205
0.9855
6.134
5.451
Length
3.8239
3.6424
0.4384
4.858
3.878
STAT 544
2012-03-07
Homework 3
1(b) Complement (the supports of
Keys
µ
and
σ
from the original keys are not wide enough)
- 1-
STAT 544
2012-03-07
Homework 3
Keys
3. With k = 3, D = ∆ = diag(.1, 1, 10), the following OpenBUGS/WinBUGS
Inverse − W ishart ν = k + 1 = 4, Lambda = ∆−1 = diag(10, 1, .1) .
code could sample from
#########################################################################
model {
R[1:3 , 1:3] ~dwish(Lambda[ , ], nu)
Sigma[1:3 , 1:3]<-inverse(R[1:3 , 1:3])
diag1 <- Sigma[1,1]
diag2 <- Sigma[2,2]
diag3 <- Sigma[3,3]
rho12 <- Sigma[1,2] / sqrt(diag1 * diag2)
rho13 <- Sigma[1,3] / sqrt(diag1 * diag3)
rho23 <- Sigma[2,3] / sqrt(diag2 * diag3)
}
list(nu=4,
Lambda = structure(.Data = c(10,0,0,0,1,0,0,0,.1), .Dim = c(3, 3))
)
#########################################################################
The following are the statistic results from
OpenBUGS
with initial values generated by default, and beg from
10000 with thin 10 to get 50000 samples. Note that the medians of diagonal elements are 7.165, .7209, and .07149,
which are very close to
.72/δ11 = 7.2, .72/δ22 = .72,
and
.72/δ33 = .072.
The densities of correlations are at in
(-1,1), agree with the claim that all correlations uniformly distributed on (-1,1).
- 2-
STAT 544
2012-03-07
Homework 3
Keys
4. Error message shows up when running the given code. The updater cannot sample the node. This is not a
probability model for
(Y, X),
because
2
1
f (x|y) = √ e−(x−y) /2 ;
2π
ˆ ˆ
ˆ ˆ
f (x, y)dxdy =
R
R
R
(0)
where
ˆ ˆ
f (x|y)f (y)dxdy =
R
The Gibbs sampling algorithm for
R
2
f (x, y) ∝ e−(x−y)
c>0
f (y)
is not a density);
ˆ
ˆ
f (x|y)dx f (y)dy =
f (y)dy = c
dy = ∞.
R
R
R
/2
; generate
Similar as question 2(c) in 2008 midterm exam, the sequence
V ar(y (n) ) = 2(n − 1) → ∞.
spread out as n → ∞.
It has variance
increasingly
(Note that
(0)
y = 0, x = 0.
j = 1, 2, . . . , N , generate y (j) ∼ N (x(j−1) , 1)
(1) Start with
(2) For
f (y) = c,
x(j) ∼ N (y (j) , 1).
y (1) , x(1) , y (2) , x(2) , . . .
is a normal random walk.
So the empirical distribution will not converge. Both marginals will
##################################
Y = X = rep(NA, 1001)
Y[1] = X[1] = 0
for (j in 2:1001) {
Y[j] <- rnorm(1,X[j-1],1)
X[j] <- rnorm(1,Y[j],1)
}
##################################
- 3-
STAT 544
2012-03-07
Homework 3
Keys
5. Metropolis algorithm with proposal
(1) Start at initial point
(2) For
J(x|x0 ) =
0 2
2
√ 1 e−(x−x ) /(2σ ) :
2πσ
x(0) = 0.
j = 1, 2, . . . , N ,
Generate the candidate
x∗
from
N (x(j−1) , σ 2 ).
f (x∗ )
f (x )/J(x |x
)
= f (x
(j−1) )
f (x(j−1) )/J(x(j−1) |x∗ )
(j)
(j)
Generate y
∼Bernoulli(min(r , 1)).
(j)
Take x
= y (j) x∗ + (1 − y (j) )x(j−1) .
Calculate
∗
r(j) =
∗
(j−1)
=
.9φ(x∗ )+.1φ(x∗ −10)
.
.9φ(x(j−1) )+.1φ(x(j−1) −10)
With the following R code we can get the history plots for
σ = .01, .1, 1, 10, 100.
######################################################################################
sigma = c(.01,.1,1,10,100)
x = matrix(0,nrow=1001,ncol=5)
for (i in 1:5){
for (j in 2:1001) {
cand = rnorm(1, x[j-1,i], sigma[i])
r = (.9*dnorm(cand)+.1*dnorm(cand-10))/(.9*dnorm(x[j-1,i])+.1*dnorm(x[j-1,i]-10))
u = runif(1)
x[j,i] = ifelse(u<r, cand, x[j-1,i])
}
}
######################################################################################
Discussion of tuning
σ:
From the history plots above, we can see that as
σ
increases, the number of accepted
candidates becomes less, i.e., the rejection rate increases. In this case it will take too long to gure out the transfer
probability between states. However, if we unify the range of
x,
as shown below, it is easy to see that when
σ
is
small, the chain moves too slowly that it can not reach the other 'island' in 1000 iterations. To sum up, we neither
want
σ
to be too big nor too small, the tuning of a Metropolis jumping distribution should be very careful.
- 4-
STAT 544
2012-03-07
Homework 3
Keys
- 5-
STAT 544
2012-03-07
Homework 3
Keys
6. (a) Gibbs algorithm with the conditional distributions
X1 |X2 = x2 ∼ N (.99x2 , .0199);
X2 |X1 = x1 ∼ N (.99x1 , .0199):
(1) Start at the initial point
(0)
(0)
x1 = 0, x2 = 0.
j = 1, 2, . . . , N ,
(j)
(j−1)
sample x1 from N(.99x2
, .0199);
(j)
(j)
sample x2 from N(.99x1 , .0199).
(2) For
########################################
N = 1000
sigma = sqrt(.0199)
x = matrix(0,nrow=N+1,ncol=2)
for (j in 2:(N+1)) {
x[j,1] = rnorm(1,.99*x[j-1,2],sigma)
x[j,2] = rnorm(1,.99*x[j,1],sigma)
}
########################################
To compare the empirical distributions with iteration
N = 1000, 5000, 10000,
µ
dimension plot, and the contour lines are from a theoretical bivariate normal with
We can see that when
I draw the samples on a twoand
N = 1000 the points mostly scattered in the inner contour line.
Σ
dened in the question.
The shape of the distribution
looks like a linear band instead of an ellipse, and the tails are unlike normal distribution. When
5000, the shape looks more like an ellipse; while as
checking the marginal distribution of
X1
and
X2
N = 10000,
N
increases to
the sample ts the theoretical contour lines well. If
using histogram or QQ plot,
N = 5000
is better than 1000, and
seems enough to see the Normal.
X1
N = 1000
N = 5000
X2
N = 10000
N = 1000
- 6-
N = 5000
N = 10000
STAT 544
2012-03-07
Homework 3
Keys
(b) Gibbs algorithm with the conditional distributions
(0)
Z1 |Z2 ∼ N (0, 1); Z2 |Z1 ∼ N (0, 1):
(0)
z1 = 0, z2 = 0.
(j)
(j)
(2) For j = 1, 2, . . . , N , sample z1 , z2
from N(0, 1).
Actually Z1 and Z2 are independent here. Using similar
(1) Start at the initial point
code as part (a) and taking
N = 100, 500, 1000, I get
N = 500 is probably an
the following plots. 100 iterations seem not enough to see the standard bivariate normal.
appropriate iteration number, because from the scatter plot the points are dense in the center and sparse outside
with a circle shape, and the marginal distribution looks like Normal.
With 1000 iteration the bivariate normal
shape looks more clear.
Z1
N = 100
For 1000 pairs of
Z2
N = 500
(z1 , z2 )',
N = 1000
N = 100
N = 500
calculate
y1
y2
=
√
1.99
√
√ 2
1.99
√
2
√
√.01
√2
− √.01
2
!
z1
z2
.
The empirical joint distribution and marginal distributions are shown below.
- 7-
N = 1000
STAT 544
2012-03-07
Homework 3
Keys
Note that the theoretical distribution of
the fact: if
Y = AX
∼
∼
where
A
is a known
(Y1 , Y2 )
k×k
(X1 , X2 ) inpart (a). To prove this,
use
we can X ∼ M V Nk µ, Σ , then Y ∼ M V Nk Aµ, AΣAT .
is the same as
matrix and
∼
∼
∼
∼
Hence it is easy to show
y1
y2
Compare the empirical distribution of
∼ M V N2
0
1 .99
,
.
0
.99 1
(Y1 , Y2 )
(X1 , X2 )
with
for
N = 1000.
The following plots show that
although the two methods sample from the same distribution, with same number of iterations, both using Gibbs
algorithm, the empirical distribution of transformed standard Normal looks better than the distribution sampling
directly from the original bivariate Normal. This is an issue we discussed in class. Sometimes it is slow to sample
from the original distribution, so we could consider dierent parameterizations.
J
0
1
,
0
.99
! (0)
x1
0
=
(0)
0
x2
(c) Metropolis algorithm using proposal
to be the density of
M V N2
(1) Start at initial point
(2) For
0
0
(x1 , x2 )|(x1 , x2 ) =
.99
.
1
1
2πc
exp
.
j = 1, 2, . . . , N ,
Generate the candidate
(j)
=
x∗1
x∗2
(j−1)
from
M V N2
x1
(j−1)
x2
(j−1) (j−1)
∗
∗ ∗
,x2
)
f (x∗
1 ,x2 )/J (x1 ,x2 )|(x1
(j−1) (j−1)
(j−1) (j−1)
∗
f (x1
,x2
)/J (x1
,x2
)|(x∗
1 ,x2 )
Calculate
r
Generate
y (j) ∼Bernoulli(min(r(j) , 1)).
- 8-
=
!
!
, cI
.
∗
f (x∗
1 ,x2 )
(j−1) (j−1) .
,x2
)
f (x1
0
0
(x −x )2 +(x −x )2
− 1 1 2c 2 2
. Denote
f (x1 , x2 )
STAT 544
2012-03-07
Homework 3
Keys
(j)
Take
x1
(j)
x2
!
= y (j)
x∗1
x∗2
(j−1)
+ (1 − y (j) )
x1
(j−1)
x2
!
.
###########################################################
library(LearnBayes)
N = 1000
ct = c(.01,.1,1,10)
mu = c(0,0)
sigma = matrix(c(1,.99,.99,1),nrow=2)
x = array(0,dim=c(N+1,2,length(ct)))
cand=c(NA,NA)
for (i in 1:length(ct)) {
for (j in 2:(N+1)) {
x[j,,i] = x[j-1,,i]
cand[1] = rnorm(1,x[j-1,1,i],sqrt(ct[i]))
cand[2] = rnorm(1,x[j-1,2,i],sqrt(ct[i]))
r = dmnorm(cand,mu,sigma)/dmnorm(x[j-1,,i],mu,sigma)
u = runif(1)
if (u<r) {
x[j,,i] = cand
}
}
}
###########################################################
- 9-
STAT 544
2012-03-07
Homework 3
Discussion of tuning
Keys
c:
From the history plots and the scatterplots for dierent
c,
we see that when
(e.g., 0.01), the acceptance rate is high, but each step is so tiny that during the 1000 iterations both
move within a relatively small area. However, if
c
X1
c
and
is small
X2
only
is large (e.g., 10), the steps are big, while the acceptance rate is
too low that we only obtain a few dierent updates in 1000 iterations. Considering both the steps and acceptance
rate,
c = .1
is the best choice of the four.
- 10-
Download