Markov Chain Monte Carlo: Metropolis-Hastings Algorithm I. Markov Chain Monte Carlo (MCMC) The objective of MCMC is to simulate data from a distribution through a non-random sample. Markov Chain is a stochastic process in which future states do not depend on past states given the present state. Consider a draw to be a state at iteration t. The next draw is dependent only on the current draw and not on any past draws. t t 1 t P t 1 | 1 , 2 ,..., t P t 1 | t This conditional probability distribution is called transition kernel; This represents the probability of moving from to . The objective of Markov Chain is to find conditions under which there exists an invariant distribution, and conditions under which iterations of the transition kernel converge to the invariant distribution. t t 1 1 In MCMC, the invariant distribution is known (up to a multiplicative constant) and is called target distribution, denoted by , and it is the distribution from which we would like to generate representative sample values. The difficulty in MCMC should be in the construction of the transition kernel P | t 1 t that is associated with . A function p | satisfies the reversibility condition if t 1 t t p t 1 | t t 1 p t | t 1 is the invariant density of P | The left hand side of the previous equation is the unconditional probability of moving from to whereas, the right hand side is the unconditional probability of moving from to . The reversibility condition tells us that during transitions back and forth between adjacent points, the relative probability of the transitions kernel exactly matches the relative value of the target distribution. Where t t t 1 t 1 t 2 This implies that adjacent positions will be visited proportionally to their relative values in the target distribution. As we said earlier, the difficult part in MCMC resides in the construction of the transition kernel P | , however, there exists some methods of deriving such kernels that are universal. The Metropolis-Hastings (M-H) Algorithm is an example of those methods. Gibbs sampling is another one. Our objective in this presentation is to elaborate on M-H. t 1 t II. Metropolis Algorithm Let’s assume that we have a proposal distribution q | that we can draw random values from. That’s given a draw to be a state at iteration t. The next draw is generated from q | . t 1 t t t 1 t 1 t If q | satisfies the reversibility condition, then draws from the proposal distribution can be used as draws from the target distribution. But, most often, we will have this p | p | . t 1 t t t 1 t t 1 t t 1 3 This indicates that we are most likely to move from to , but rarely we will move from to . To fix this problem we will find the probability of moving what is defined as follows: q | p min ,1 q | Now let’s describe how the simulation will be done 1.Simulate a candidate value from q | 2.Compute the probability of moving to the proposed position as follows q | p min ,1 q | 3.After finding the probability of moving we will decide to move to the proposed position by generating a random uniform number between zero and 1. 4.We accept the proposed value if the random uniform number is less than P , otherwise, we stay at the current position. 5.Repeat steps above until it is judged that a sufficient representative set of values is sampled. t t 1 t 1 t t 1 move t t 1 t t t 1 t 1 t 1 move t t 1 t t t t 1 move 4 III. How do we select the proposal distribution M-H Algorithm is pretty much easy to implement. However to make sure that our representative values have converged to the target distribution we need to find a welldefined proposal distribution. One practical approach to construct the proposal distribution is to take into account the previously generated draw to simulate the future draw. With this you will explore the neighborhood of the current draw. This approach is called random walk and is implemented as follows where ~ q . If q is symmetric p min ,1 If the candidate is drawn independently of the current position in the chain, then q | q and the probability of moving is q p min ,1 . q Most often, requirements that the proposal distribution has to meet are: t 1 t t t 1 t t 1 t 1 move 1 t t 1 t 1 move t t t 1 t 1 t 5 1.The proposal distribution has to have enough dispersion to lead to an exploration of the entire domain of the target distribution. The proposal distribution should dominate the target on the tail. 2.Roberts, Gelman, and Gilks (1994) showed that in the case of random walk proposal, if the target and the proposal distributions are normal, then the scale of the proposal should be chose so that the acceptance rate is approximately 0.45 in one dimension problem and being around 0.25 in multi dimension. The acceptance rate is the probability that a proposal draw is retained. IV. Application of the M-H algorithm: Bayes For any statistical analysis, we would have to define the statistical model that we would like to model. The statistical model is often given in the form of the probability distribution f Y | . When looking at as a function of instead of Y , this 6 distribution is called likelihood and is written as L ; y f | y . In Bayesian’s view, we assume that is random. That’s based on our prior knowledge on we will assume a probability distribution for that will summarize any information that we have about it that is not contained in the data. This distribution is called prior distribution or just prior. Our knowledge about is updated after we take into account the data. The distribution of given the data is called the posterior and is the basic of all inferences about . The posterior distribution is written as follows: f | y f y | f y | , since h y f y | d h y is a normalized constant. The posterior distribution is a true probability distribution that must sum to 1. One of the problem with Bayesian analysis is to derive h y . Most often we cannot have a close form of h y . Furthermore, the posterior distribution does not belong to a known family 7 of distribution. Therefore, to have representative values of posterior distribution, we will use sampling technique as MetropolisHasting and Gibbs sampling. Here, we will implement the Metropolis-Hasting algorithm. V. Bayesian Analysis To complete any Bayesian analysis, the work load can be divided in four steps: Specify the probability distribution of your data given parameters in your model (Likelihood). Based on your believe about parameters in your model, specify a prior distribution of parameters in your model. Derive the posterior distribution as the product of likelihood time the prior. Make any inference (Mean, SD, Median, Highest density interval (HDI)) about parameters of your model through the posterior distribution. This can be done through simulation or numerical derivation. HDI is another way to summarize your distribution. The HDI indicates which points of a distribution are most credible. 8 The HDI summarizes the distribution by indicating an interval that spans most of the distribution, say 95% of it, such that every point inside of it has higher credibility than any point outside. VI. Example: In this example, we will apply the MetropolisHastings Algorithm to a Bernoulli trial. From August 30 2014 to December 17 2014, Huskers’ football played 13 games. Among those 13 games, they won 9 games and lost 4. Let’s assume that each game is a Bernoulli trial. That’s. The goal is to estimate the probability that Husker wins a game. Let’s assume y y , y ,..., y , y ~ Bernoulli 1, p since p is between [0,1], we will use beta , , where , are known constants as the prior distribution for p . The derivation of the posterior is given in the appendix. 1 2 n i n yi 1 The posterior distribution is f p | y p 1 p We will use a random walk to generate our sample values. The proposal distribution is a i 1 n n yi 1 i 1 9 normal distribution centered at zero with standard deviation (SD) denoted (Known). The Metropolis-Hastings Algorithm then proceeds as follows. Start at an arbitrary initial value of p (in the valid range). This is the current value denoted p . Then: 1)Randomly generate a proposed jump, a candidate value, p ~ normal 0, and denote the proposed value of the parameter as cur 2 p prop pcur p 2)Compute the probability of moving to the proposed position as follows p 1 p . p min 1, p 1 p If the proposed value happens to fall outside the range of p , then the prior and the likelihood is set to zero. 3)After finding the probability of moving we will decide to move to the proposed position by generating a random uniform number between zero and 1. 4)We accept the proposed value if the random uniform number is less than P , otherwise, we stay at the current position. move z a 1 pro N z b1 pro z a 1 cur N z b1 cur move 10 5)Repeat the above steps until it is judged that a sufficiently representative sample has been generated. > > > > > > > # Specify the data, to be used in the likelihood function. myData = c(rep(0,4),rep(1,9)) # Define the target function, p(D|theta)*p(theta).For our application, this # target distribution is the unnormalized posterior distribution. # The argument theta could be a vector, not just a scalar. # target distribution or posterior distribution targetRelProb = function( theta , data, a, b ) { z = sum( data ) N = length( data ) pDataGivenTheta = theta^(z + a - 1) * (1-theta)^(N - z + b - 1) # The theta values passed into this function are generated at random, # and therefore might be inadvertently greater than 1 or less than 0. # The likelihood for theta > 1 or for theta < 0 is zero: pDataGivenTheta[ theta > 1 | theta < 0 ] = 0 return( pDataGivenTheta ) } > run_metropolis_MCMC <- function(startvalue, iterations){ nAccepted = 0 nRejected = 0 burnIn = ceiling( 0.0 * iterations ) chain = matrix(NA, nrow = iterations + 1, ncol = 1) chain[1,] = startvalue for (i in 1:iterations){ proposal = rnorm(1,mean = 0, sd= c(0.2)) pmove = min(1,targetRelProb(chain[i,] + proposal, myData , a = 1, b = 1)/ targetRelProb(chain[i,], myData , a = 1, b = 1)) if (runif(1) < pmove){ chain[i+1,] = proposal + chain[i,] if ( i > burnIn ) { nAccepted = nAccepted + 1 } }else{ chain[i+1,] = chain[i,] if ( i > burnIn ) { nRejected = nRejected + 1 } } } return(list(chain,nAccepted, nRejected, burnIn )) } > startvalue = c(0.2) > set.seed(12345) > chain = run_metropolis_MCMC(startvalue, 100000) > nAccepted <- chain[2] > trajectory <- chain[[1]] > head(trajectory ) [,1] [1,] 0.2000000 [2,] 0.3171058 [3,] 0.5583403 [4,] 0.5583403 [5,] 0.5583403 [6,] 0.6843600 > burnIn <- chain[[3]] > # Extract the post-burnIn portion of the trajectory. > acceptedTraj = trajectory[ (burnIn+1) : dim(trajectory)[1], ] > head(acceptedTraj) 11 [1] 0.3843991 0.3843991 0.5739092 0.5739092 0.4918485 0.5341864 > trajLength = length(acceptedTraj) > Mean = mean(acceptedTraj) > Median <- median(acceptedTraj) > SD <- sd(acceptedTraj) > densCurve = density( acceptedTraj , adjust=2 ) > Mode = densCurve$x[which.max(densCurve$y)] > names <- c("Mean", "Median", "Mode", "SD") > Sum.Stat <- c(Mean, Median, Mode, SD) > Summary <- data.frame(names, Sum.Stat) > Summary names Sum.Stat 1 Mean 0.6685017 2 Median 0.6759711 3 Mode 0.6963898 4 SD 0.1171666 > HDIofMCMC = function( sampleVec , credMass=0.95 ) { # Computes highest density interval from a sample of representative values, # estimated as shortest credible interval. # Arguments: # sampleVec # is a vector of representative values from a probability distribution. # credMass # is a scalar between 0 and 1, indicating the mass within the credibe # interval that is to be estimated. # Value: # HDIlim is a vector containing the limits of the HDI sortedPts = sort( sampleVec ) ciIdxInc = ceiling( credMass * length( sortedPts ) ) nCIs = length( sortedPts ) - ciIdxInc ciWidth = rep( 0 , nCIs ) for ( i in 1:nCIs ) { ciWidth[ i ] = sortedPts[ i + ciIdxInc ] - sortedPts[ i ] } HDImin = sortedPts[ which.min( ciWidth ) ] HDImax = sortedPts[ which.min( ciWidth ) + ciIdxInc ] HDIlim = c( HDImin , HDImax ) return( HDIlim ) } > HDI <- matrix(NA, ncol = 2, nrow = 1) > colnames(HDI) <- c("HDI_Lower", "HDI_Upper") > Parameters <- "Theta" > HDI <- data.frame(Parameters, HDI) > HDI[1,2:3] <- HDIofMCMC(sampleVec = acceptedTraj, credMass=0.95) > HDI Parameters HDI_Lower HDI_Upper 1 Theta 0.4394185 0.8891786 > layout( matrix(1:3,ncol=3) ) > par(mar=c(3,4,2,1),mgp=c(2,0.7,0)) > library(coda) Loading required package: lattice Warning messages: 1: package ‘coda’ was built under R version 3.0.3 2: package ‘lattice’ was built under R version 3.0.3 > border <- "skyblue" > col <- "skyblue" > histinfo = hist( acceptedTraj , freq=F, border=border , xlab = Parameters, main=bquote( list( "SD" == .(round(sd(acceptedTraj),3)) , "Median" == .(round(median(acceptedTraj),3)), "Mean" == .(round(mean(acceptedTraj),1)) ) )) > lines( densCurve$x , densCurve$y , type="l" , lwd=2, col = "red" ) > cenTendHt = 0.9*max(histinfo$density) > cvHt = 0.7*max(histinfo$density) > ROPEtextHt = 0.55*max(histinfo$density) 12 > # Display central tendency: > mn = Mean > med = Median > mo = Mode > text( mo , cenTendHt , bquote(mode ==.(signif(mo,3))) , adj=c(.5,0) , cex=1.5 ) > # Display the HDI. > credMass <- 0.95 > cex <- 1.5 > HDItextPlace=0.7 > lines( HDI[1,2:3] , c(0,0) , lwd=4 ) > text( mean(as.numeric(HDI[1,2:3])) , 0 , bquote(.(100*credMass) * "% HDI" ) , adj=c(.5,-1.7) , cex=cex ) > text( HDI[1,2:3][1] , 0 , bquote(.(signif(HDI[1,2:3][1],3))) , adj=c(HDItextPlace,-0.5) , cex=cex ) > text( HDI[1,2:3][2] , 0 , bquote(.(signif(HDI[1,2:3][2],3))) , adj=c(1.0-HDItextPlace,-0.5) , cex=cex ) > #------------------------------------------------------------------------------> # comparing our simulated draws to the exact distributiob. > #-----------------------------------------------------------------------------> shape1 <- sum(myData) + 1 > shape2 <- length(myData) - sum(myData) + 1 > plot.ecdf(x = acceptedTraj, verticals = TRUE, do.p = FALSE, lwd = 2, panel.first = grid(), ylab = "Probability",+ xlab = "Theta", col = " red", main = "EDF") > abline(h = c(0,1)) > curve(expr = pbeta(q = x, shape1 = shape1, shape2 = shape2), add = T, col = "blue", lwd = 2) > legend(x = 0.01, y = 1, legend = c( "Metropolist","Beta"), bty = "n", lty = 1, col = c( "Red","blue")) > #PDF > hist(x = acceptedTraj, freq = FALSE, main = "", col = "red", ylim = c(0,4)) > curve(expr = dbeta(x = x, shape1 = shape1, shape2 = shape2), add = T, col = "blue", lwd = 2, main = "CDF") > legend(x = 0.2, y = 4, legend = c("Metropolist", "Beta"), bty = "n", lty = 1, col = c("red", "Blue")) 13 14 Appendix Derivation of the posterior distribution: Let’s assume y y , y ,..., y , y ~ Bernoulli 1, p , since p is between [0,1], we will use beta , , where , are known constants as the prior distribution for p . The posterior distribution is derived as follows 1 2 n i n f yi | p p f p | y i 1 h y n f yi | p p i 1 1 n 1 p 1 y p p yi 1 p i 1 beta , 1 n f p | y p yi 1 i 1 1 p n n yi 1 i 1 This is a kernel of a beta distribution. So the posterior distribution is beta y , n y n i1 n i i 1 i distribution. Since the posterior distribution is a distribution that we know, we can make inference through exact formula of statistic of interest. For n instance E p | y y i 1 i n , n n y n yi yi 1 i i 1 mod e i1 , var i1 2 n 2 n n 1 n . For this example, we see that the posterior and the prior distribution come from the same family of distribution (beta). 15 We say that beta distribution is a conjugate distribution for Bernoulli (binomial as well). Now let’s assume that y y1 , y2 ,..., yn , yi ~ poisson , ~ Normal , 2 , , 2 are known hyperparameters. We calculate the posterior distribution as follows: n f yi | f | y i 1 h y n f yi | i 1 2 e e 2 2 i1 yi ! 2 n yi 2 n 2 n 2 f | y e yi i 1 This is not a kernel of a known distribution. References: Books: - Jim Albert: Bayesian Computation with R, second edition; - John K. Kruschke: Doing Bayesian Data Analysis: A tutorial with R, JAGS, and Stan, second edition - Bradley P. Carlin, Thomas A. Louis: Bayesian Methods for Data Analysis Paper: 16 Siddhartha Chib and Edward Greenberg. Understanding the Metropolis-hasting Algorithm. The American Statistician, November 1995, Vol. 49, No. 4. 17