STAT544 Homework 3 1 2008-02-19 Keys (a) The contour plot for the log-likelihood, log L(µ, σ|X1 , ..., X10 ). 0.5 1.0 σ 1.5 2.0 LogLikelihood 3 4 5 6 7 µ (b) For independent prior, log g(µ, σ 2 ) = log g1 (µ) + log g2 (σ 2 ). When µ is fixed, log g(µ, σ) = log g2 (σ 2 )+ constant if µ and σ are independent. The following figure shows their difference. The lines in the left hand side are parallel but the lines in the right hand side are not. 1(b)(5) log G diff width 0.5 1.0 1.5 2.0 −50 −100 −150 same width different width −200 −10 −8 −6 log−g with several fixed mu same width −14 log−g with several fixed mu −4 1(b)(3) log G 0.5 sigma 1.0 1.5 sigma –1– 2.0 STAT544 Homework 3 2008-02-19 Keys 1. The contour plot for the log g(µ, σ) and log L(µ, σ)+log g(µ, σ) with Jefreeys improper priors, g(µ, σ) = 1 . σ2 1.5 0.5 1.0 sigma 1.0 0.5 sigma 1.5 2.0 1(b)(1) log−g + log L 2.0 1(b)(1) log−g 3 4 5 6 7 3 4 mu 5 6 7 mu Please look at page 7 for more detailed contour plots of (b)(2)-(5). 2. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, 102 ) independent of σ 2 ∼ Inv-χ2 (1, 102 ). 1.5 0.5 1.0 sigma 1.0 0.5 sigma 1.5 2.0 1(b)(2) log−g + log−L 2.0 1(b)(2) Log−g 3 4 5 6 7 3 mu 4 5 mu –2– 6 7 STAT544 Homework 3 2008-02-19 Keys 3. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, 22 ) independent of σ 2 ∼ Inv-χ2 (3, 22 ). 8 6 0 2 4 sigma 6 4 0 2 sigma 8 10 1(b)(3) log−g + log−L 10 1(b)(3) Log−g 0 5 10 15 0 5 10 mu 15 mu 4. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, σ 2 ), σ 2 ∼ Inv-χ2 (1, 22 ). 8 6 0 2 4 sigma 6 4 0 2 sigma 8 10 1(b)(4) log−g + log−L 10 1(b)(4) Log−g 0 5 10 15 0 mu 5 10 15 mu In the 1st to 3rd plot, the shape of marginal slices fixed and only location changed. In the 4th and 5th plot(next page), since the shape of marginal slices changed, the contour plots show fan shape in the right end. –3– STAT544 Homework 3 2008-02-19 Keys 5. The contour plot for the log g(µ, σ) and log L(µ, σ) + log g(µ, σ) with a priori g(µ, σ), µ ∼ N(0, σ 2 /10), σ 2 ∼ Inv-χ2 (1, 22 ). 8 6 0 2 4 sigma 6 4 0 2 sigma 8 10 1(b)(5) log−g + log−L 10 1(b)(5) Log−g 0 5 10 15 0 5 mu 10 15 mu Several ways to find the pdf of a scaled Inv-χ2 distribution. 2 ν/2)ν/2 ν 2 2 x− 2 −1 exp( −νs X ∼ Inv-χ2 (ν, s2 ) with pdf f (x) = (sΓ(ν/2) 2x ), ν > 0, s > 0, x > 0 2 ⇔ X ∼ Inv-Γ( ν2 , νs2 ) ⇔ X = νs2 Y1 , Y ∼ χ2 (ν), get f (x) by the transformation from f (y) Rcode: gsize=50 x<-c(4.90791,4.83425,5.19801,6.85308,4.07085,4.66076,4.16201,5.49752,4.15438,3.72703) mu<-3+seq(-3,14,length=gsize) sigma<-.1+seq(0,10,length=gsize) theta<-expand.grid(mu,sigma) logL<- function (theta){sum(dnorm(x,theta[1],theta[2], log=TRUE)) } z<-apply(theta,1,logL) z<-matrix(z,gsize,gsize) library(geoR) logL=function(theta){ sum(dnorm(x,theta[1],theta[2],log=TRUE)) } logG1=function(theta){ -2*log( theta[2] ) } logG2=function(theta){ dnorm(theta[1], mean=0, sd=10, log=TRUE) + dinvchisq(theta[2]^2, 1, 100, log=TRUE) } logG3=function(theta){ dnorm(theta[1], mean=0, sd=2, log=TRUE) + dinvchisq(theta[2]^2, 3, 4, log=TRUE) } logG4=function(theta){ dnorm(theta[1], mean=0, sd=theta[2], log=TRUE) + dinvchisq(theta[2]^2, 1, 4, log=TRUE) } logG5=function(theta){ dnorm(theta[1], mean=0, sd=theta[2]/sqrt(10), log=TRUE) + dinvchisq(theta[2]^2, 1, 4, log=TRUE ) } w<-matrix(apply(theta,1, logG1),gsize,gsize) contour(mu,sigma,w,main=main1 ,xlab="mu", ylab="sigma") contour(mu,sigma,w+z,main=main2, xlab="mu", ylab="sigma") # These will give a less detailed contour plot and will provide information to find better values of "level". –4– STAT544 Homework 3 2008-02-19 Keys 2 (a) −1 The WinBUGS code specifies a set of bivairate Normal dataÃYi = )T ∼ M V N (µ, à ! (Y1i , Y2ià !!Σ = R = D) 0 100 0 for i = 1, ..., 10. The prior of µ is µ ∼ M V N Alpha = , Σ0 = . (i.e. tau = 0 0 100 à ! à !−1 0.01 0 4 0 . ). The prior of Σ is Σ = D = R−1 ∼ Inv-Wishartν=4 Lambda−1 = ∆ = 0 0.01 0 4 à ! 4 0 Therefore, Σ has mean (ν − k − 1)−1 Lambda = . 0 4 The following are statistic result from WinBUGS for 5 parameters, with 4 chains, initial values are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples. (b) à ! 68 0 The following are statistic result from WinBUGS for 5 parameters, set ν = 20 and Lambda = 0 68 such that hold the same prior mean for Σ, with 4 chains, initial values are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples. Since the prior suggest higher σ, and the degrees of freedom ν = 20 is increased to hold the same prior mean for Σ when changing the Lambda, that induced higher mean of σ and smaller ρ and also induced wider credible intervals for mu, narrower credible intervals for σ and smaller ρ. (c) The following are statistic result from WinBUGS for 5 parameters, set Σ0 = τ0 = I2×2 , with 4 chains, initial values are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples. –5– STAT544 Homework 3 2008-02-19 Keys Since the prior for µ suggest a more condensed distribution, the posterior means are closer to the prior means zero. That also increases the standard deviation and induce a wider credible interval. (d) à ! 4 −3 stats result from WinBUGS for 5 parameters, set Lambda = , with 4 chains, initial values −3 4 are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples. Comparing to (b), this prior suggest a negative correlation between 2 variables. The posterior distribution reflects this change. (e) à ! 40 0 stats result from WinBUGS for 5 parameters, set Lambda = , with 4 chains, initial values 0 40 are generated by default, and beg from 90000 to end 100000 with thin 10 to get 4000 samples. Compare to (b) the prior of σ suggest higher expectation. The posterior distribution reflects this change. The following table shows the estimate of means and covariance structure and the corresponding credible intervales and length of these intervales. Note mu[1] mu[2] rho sig1 sig2 Note mu[1] mu[2] rho sig1 sig2 (a)Mean 4.892 4.254 0.5822 1.43 1.626 (d)Mean 4.748 4.383 0.2609 ↓ 1.379 1.612 2.50 3.865 3.04 -0.171 0.8715 1.048 2.50 3.703 3.153 -0.5033 0.8584 1.038 97.50 6.009 5.384 0.9181 2.425 2.63 97.50 5.837 5.548 0.8277 2.378 2.621 Length 2.144 2.344 1.0891 1.5535 1.582 Length 2.134 2.395 1.331 1.5196 1.583 (b)Mean 4.664 4.456 0.06784 ↓ 1.823 1.871 (e)Mean 4.638 4.432 0.09748 ↓ 2.55 2.584 2.50 3.327 3.111 -0.3219 1.38 1.423 2.50 2.644 2.527 -0.4965 1.622 1.656 –6– 97.50 6.077 5.759 0.4437 2.43 2.485 97.50 6.634 6.297 0.6533 4.342 4.15 Length 2.75 2.648 0.7656 1.05 1.062 Length 3.99 3.77 1.1498 2.72 2.494 (c)Mean 2.257↓ 1.414↓ 0.8887 ↑ 3.061 ↑ 3.071 ↑ 2.50 0.1471 -0.4374 0.5471 1.276 1.573 97.50 3.971 3.205 0.9855 6.134 5.451 Length 3.8239 3.6424 0.4384 4.858 3.878 STAT 544 2012-03-07 Homework 3 1(b) Complement (the supports of Keys µ and σ from the original keys are not wide enough) - 1- STAT 544 2012-03-07 Homework 3 Keys 3. With k = 3, D = ∆ = diag(.1, 1, 10), the following OpenBUGS/WinBUGS Inverse − W ishart ν = k + 1 = 4, Lambda = ∆−1 = diag(10, 1, .1) . code could sample from ######################################################################### model { R[1:3 , 1:3] ~dwish(Lambda[ , ], nu) Sigma[1:3 , 1:3]<-inverse(R[1:3 , 1:3]) diag1 <- Sigma[1,1] diag2 <- Sigma[2,2] diag3 <- Sigma[3,3] rho12 <- Sigma[1,2] / sqrt(diag1 * diag2) rho13 <- Sigma[1,3] / sqrt(diag1 * diag3) rho23 <- Sigma[2,3] / sqrt(diag2 * diag3) } list(nu=4, Lambda = structure(.Data = c(10,0,0,0,1,0,0,0,.1), .Dim = c(3, 3)) ) ######################################################################### The following are the statistic results from OpenBUGS with initial values generated by default, and beg from 10000 with thin 10 to get 50000 samples. Note that the medians of diagonal elements are 7.165, .7209, and .07149, which are very close to .72/δ11 = 7.2, .72/δ22 = .72, and .72/δ33 = .072. The densities of correlations are at in (-1,1), agree with the claim that all correlations uniformly distributed on (-1,1). - 2- STAT 544 2012-03-07 Homework 3 Keys 4. Error message shows up when running the given code. The updater cannot sample the node. This is not a probability model for (Y, X), because 2 1 f (x|y) = √ e−(x−y) /2 ; 2π ˆ ˆ ˆ ˆ f (x, y)dxdy = R R R (0) where ˆ ˆ f (x|y)f (y)dxdy = R The Gibbs sampling algorithm for R 2 f (x, y) ∝ e−(x−y) c>0 f (y) is not a density); ˆ ˆ f (x|y)dx f (y)dy = f (y)dy = c dy = ∞. R R R /2 ; generate Similar as question 2(c) in 2008 midterm exam, the sequence V ar(y (n) ) = 2(n − 1) → ∞. spread out as n → ∞. It has variance increasingly (Note that (0) y = 0, x = 0. j = 1, 2, . . . , N , generate y (j) ∼ N (x(j−1) , 1) (1) Start with (2) For f (y) = c, x(j) ∼ N (y (j) , 1). y (1) , x(1) , y (2) , x(2) , . . . is a normal random walk. So the empirical distribution will not converge. Both marginals will ################################## Y = X = rep(NA, 1001) Y[1] = X[1] = 0 for (j in 2:1001) { Y[j] <- rnorm(1,X[j-1],1) X[j] <- rnorm(1,Y[j],1) } ################################## - 3- STAT 544 2012-03-07 Homework 3 Keys 5. Metropolis algorithm with proposal (1) Start at initial point (2) For J(x|x0 ) = 0 2 2 √ 1 e−(x−x ) /(2σ ) : 2πσ x(0) = 0. j = 1, 2, . . . , N , Generate the candidate x∗ from N (x(j−1) , σ 2 ). f (x∗ ) f (x )/J(x |x ) = f (x (j−1) ) f (x(j−1) )/J(x(j−1) |x∗ ) (j) (j) Generate y ∼Bernoulli(min(r , 1)). (j) Take x = y (j) x∗ + (1 − y (j) )x(j−1) . Calculate ∗ r(j) = ∗ (j−1) = .9φ(x∗ )+.1φ(x∗ −10) . .9φ(x(j−1) )+.1φ(x(j−1) −10) With the following R code we can get the history plots for σ = .01, .1, 1, 10, 100. ###################################################################################### sigma = c(.01,.1,1,10,100) x = matrix(0,nrow=1001,ncol=5) for (i in 1:5){ for (j in 2:1001) { cand = rnorm(1, x[j-1,i], sigma[i]) r = (.9*dnorm(cand)+.1*dnorm(cand-10))/(.9*dnorm(x[j-1,i])+.1*dnorm(x[j-1,i]-10)) u = runif(1) x[j,i] = ifelse(u<r, cand, x[j-1,i]) } } ###################################################################################### Discussion of tuning σ: From the history plots above, we can see that as σ increases, the number of accepted candidates becomes less, i.e., the rejection rate increases. In this case it will take too long to gure out the transfer probability between states. However, if we unify the range of x, as shown below, it is easy to see that when σ is small, the chain moves too slowly that it can not reach the other 'island' in 1000 iterations. To sum up, we neither want σ to be too big nor too small, the tuning of a Metropolis jumping distribution should be very careful. - 4- STAT 544 2012-03-07 Homework 3 Keys - 5- STAT 544 2012-03-07 Homework 3 Keys 6. (a) Gibbs algorithm with the conditional distributions X1 |X2 = x2 ∼ N (.99x2 , .0199); X2 |X1 = x1 ∼ N (.99x1 , .0199): (1) Start at the initial point (0) (0) x1 = 0, x2 = 0. j = 1, 2, . . . , N , (j) (j−1) sample x1 from N(.99x2 , .0199); (j) (j) sample x2 from N(.99x1 , .0199). (2) For ######################################## N = 1000 sigma = sqrt(.0199) x = matrix(0,nrow=N+1,ncol=2) for (j in 2:(N+1)) { x[j,1] = rnorm(1,.99*x[j-1,2],sigma) x[j,2] = rnorm(1,.99*x[j,1],sigma) } ######################################## To compare the empirical distributions with iteration N = 1000, 5000, 10000, µ dimension plot, and the contour lines are from a theoretical bivariate normal with We can see that when I draw the samples on a twoand N = 1000 the points mostly scattered in the inner contour line. Σ dened in the question. The shape of the distribution looks like a linear band instead of an ellipse, and the tails are unlike normal distribution. When 5000, the shape looks more like an ellipse; while as checking the marginal distribution of X1 and X2 N = 10000, N increases to the sample ts the theoretical contour lines well. If using histogram or QQ plot, N = 5000 is better than 1000, and seems enough to see the Normal. X1 N = 1000 N = 5000 X2 N = 10000 N = 1000 - 6- N = 5000 N = 10000 STAT 544 2012-03-07 Homework 3 Keys (b) Gibbs algorithm with the conditional distributions (0) Z1 |Z2 ∼ N (0, 1); Z2 |Z1 ∼ N (0, 1): (0) z1 = 0, z2 = 0. (j) (j) (2) For j = 1, 2, . . . , N , sample z1 , z2 from N(0, 1). Actually Z1 and Z2 are independent here. Using similar (1) Start at the initial point code as part (a) and taking N = 100, 500, 1000, I get N = 500 is probably an the following plots. 100 iterations seem not enough to see the standard bivariate normal. appropriate iteration number, because from the scatter plot the points are dense in the center and sparse outside with a circle shape, and the marginal distribution looks like Normal. With 1000 iteration the bivariate normal shape looks more clear. Z1 N = 100 For 1000 pairs of Z2 N = 500 (z1 , z2 )', N = 1000 N = 100 N = 500 calculate y1 y2 = √ 1.99 √ √ 2 1.99 √ 2 √ √.01 √2 − √.01 2 ! z1 z2 . The empirical joint distribution and marginal distributions are shown below. - 7- N = 1000 STAT 544 2012-03-07 Homework 3 Keys Note that the theoretical distribution of the fact: if Y = AX ∼ ∼ where A is a known (Y1 , Y2 ) k×k (X1 , X2 ) inpart (a). To prove this, use we can X ∼ M V Nk µ, Σ , then Y ∼ M V Nk Aµ, AΣAT . is the same as matrix and ∼ ∼ ∼ ∼ Hence it is easy to show y1 y2 Compare the empirical distribution of ∼ M V N2 0 1 .99 , . 0 .99 1 (Y1 , Y2 ) (X1 , X2 ) with for N = 1000. The following plots show that although the two methods sample from the same distribution, with same number of iterations, both using Gibbs algorithm, the empirical distribution of transformed standard Normal looks better than the distribution sampling directly from the original bivariate Normal. This is an issue we discussed in class. Sometimes it is slow to sample from the original distribution, so we could consider dierent parameterizations. J 0 1 , 0 .99 ! (0) x1 0 = (0) 0 x2 (c) Metropolis algorithm using proposal to be the density of M V N2 (1) Start at initial point (2) For 0 0 (x1 , x2 )|(x1 , x2 ) = .99 . 1 1 2πc exp . j = 1, 2, . . . , N , Generate the candidate (j) = x∗1 x∗2 (j−1) from M V N2 x1 (j−1) x2 (j−1) (j−1) ∗ ∗ ∗ ,x2 ) f (x∗ 1 ,x2 )/J (x1 ,x2 )|(x1 (j−1) (j−1) (j−1) (j−1) ∗ f (x1 ,x2 )/J (x1 ,x2 )|(x∗ 1 ,x2 ) Calculate r Generate y (j) ∼Bernoulli(min(r(j) , 1)). - 8- = ! ! , cI . ∗ f (x∗ 1 ,x2 ) (j−1) (j−1) . ,x2 ) f (x1 0 0 (x −x )2 +(x −x )2 − 1 1 2c 2 2 . Denote f (x1 , x2 ) STAT 544 2012-03-07 Homework 3 Keys (j) Take x1 (j) x2 ! = y (j) x∗1 x∗2 (j−1) + (1 − y (j) ) x1 (j−1) x2 ! . ########################################################### library(LearnBayes) N = 1000 ct = c(.01,.1,1,10) mu = c(0,0) sigma = matrix(c(1,.99,.99,1),nrow=2) x = array(0,dim=c(N+1,2,length(ct))) cand=c(NA,NA) for (i in 1:length(ct)) { for (j in 2:(N+1)) { x[j,,i] = x[j-1,,i] cand[1] = rnorm(1,x[j-1,1,i],sqrt(ct[i])) cand[2] = rnorm(1,x[j-1,2,i],sqrt(ct[i])) r = dmnorm(cand,mu,sigma)/dmnorm(x[j-1,,i],mu,sigma) u = runif(1) if (u<r) { x[j,,i] = cand } } } ########################################################### - 9- STAT 544 2012-03-07 Homework 3 Discussion of tuning Keys c: From the history plots and the scatterplots for dierent c, we see that when (e.g., 0.01), the acceptance rate is high, but each step is so tiny that during the 1000 iterations both move within a relatively small area. However, if c X1 c and is small X2 only is large (e.g., 10), the steps are big, while the acceptance rate is too low that we only obtain a few dierent updates in 1000 iterations. Considering both the steps and acceptance rate, c = .1 is the best choice of the four. - 10-