This is a kernel of a beta distribution. So the posterior distribution is

advertisement
Markov Chain Monte Carlo:
Metropolis-Hastings
Algorithm
I. Markov Chain Monte Carlo (MCMC)
 The objective of MCMC is to simulate data from
a distribution through a non-random sample.
 Markov Chain is a stochastic process in which
future states do not depend on past states
given the present state.
 Consider a draw    to be a state at iteration t.
The next draw    is dependent only on the
current draw    and not on any past draws.
t
t 1
t

 
P  t 1 |  1 ,  2 ,..., t   P  t 1 |  t 

 This conditional probability distribution is
called transition kernel; This represents the
probability of moving from    to    .
 The objective of Markov Chain is to find
conditions under which there exists an
invariant distribution, and conditions under
which iterations of the transition kernel
converge to the invariant distribution.
t
t 1
1
 In MCMC, the invariant distribution is known
(up to a multiplicative constant) and is called
target distribution, denoted by   , and it is
the distribution from which we would like to
generate representative sample values.
 The difficulty in MCMC should be in the
construction of the transition kernel P    |    
t 1
t
that is associated with   .
 A function p  |   satisfies the reversibility
condition if
t 1
t
  t  p  t 1 |  t     t 1  p  t |  t 1 
is the invariant density of P  |    
 The left hand side of the previous equation is
the unconditional probability of moving from   
to    whereas, the right hand side is the
unconditional probability of moving from    to
  .
 The reversibility condition tells us that during
transitions back and forth between adjacent
points, the relative probability of the
transitions kernel exactly matches the relative
value of the target distribution.
Where
  
t
t
t 1
t 1
t
2
 This implies that adjacent positions will be
visited proportionally to their relative values in
the target distribution.
 As we said earlier, the difficult part in MCMC
resides in the construction of the transition
kernel P    |     , however, there exists some
methods of deriving such kernels that are
universal. The Metropolis-Hastings (M-H)
Algorithm is an example of those methods.
Gibbs sampling is another one. Our objective in
this presentation is to elaborate on M-H.
t 1
t
II. Metropolis Algorithm
 Let’s assume that we have a proposal
distribution q  |   that we can draw random
values from.
 That’s given a draw    to be a state at iteration
t. The next draw    is generated from q  |   .
t 1
t
t
t 1
t 1
t
 If q  |   satisfies the reversibility condition,
then draws from the proposal distribution can
be used as draws from the target distribution.
 But, most often, we will have this
   p  |       p  |   .
t 1
t
t
t 1
t
t 1
t
t 1
3
 This indicates that we are most likely to move
from    to    , but rarely we will move from   
to    .
 To fix this problem we will find the probability
of moving what is defined as follows:
    q  |   
p
 min 
,1
    q  |   


 Now let’s describe how the simulation will be
done
1.Simulate a candidate value    from q  |  
2.Compute the probability of moving to the
proposed position as follows
    q  |   
p
 min 
,1
    q  |   


3.After finding the probability of moving we
will decide to move to the proposed
position by generating a random uniform
number between zero and 1.
4.We accept the proposed value if the
random uniform number is less than P ,
otherwise, we stay at the current position.
5.Repeat steps above until it is judged that a
sufficient representative set of values is
sampled.
t
t 1
t 1
t
t 1
move
t
t 1
t
t
t 1
t 1
t 1
move
t
t 1
t
t
t
t 1
move
4
III. How do we select the proposal
distribution
 M-H Algorithm is pretty much easy to
implement. However to make sure that our
representative values have converged to the
target distribution we need to find a welldefined proposal distribution.
 One practical approach to construct the
proposal distribution is to take into account the
previously generated draw to simulate the
future draw. With this you will explore the
neighborhood of the current draw. This
approach is called random walk and is
implemented as follows      where
 ~ q     .
 

 If q  is symmetric p  min    ,1
    
 If the candidate is drawn independently of the
current position in the chain, then q  |    q  
and the probability of moving is
    q   
p
 min 
,1 .
    q   


 Most often, requirements that the proposal
distribution has to meet are:
t 1
t
t
t 1
t
t
1
t 1
move
1
t
t 1
t 1
move
t
t
t 1
t 1
t
5
1.The proposal distribution has to have
enough dispersion to lead to an exploration
of the entire domain of the target
distribution. The proposal distribution
should dominate the target on the tail.
2.Roberts, Gelman, and Gilks (1994) showed
that in the case of random walk proposal, if
the target and the proposal distributions
are normal, then the scale of the proposal
should be chose so that the acceptance
rate is approximately 0.45 in one dimension
problem and being around 0.25 in multi
dimension. The acceptance rate is the
probability that a proposal draw is
retained.
IV. Application of the M-H algorithm:
Bayes
 For any statistical analysis, we would have to
define the statistical model that we would like
to model.
 The statistical model is often given in the form
of the probability distribution f Y |   . When
looking at as a function of  instead of Y , this
6




distribution is called likelihood and is written
as L  ; y   f  | y  .
In Bayesian’s view, we assume that  is
random. That’s based on our prior knowledge
on  we will assume a probability distribution
for  that will summarize any information that
we have about it that is not contained in the
data.
This distribution is called prior distribution or
just prior. Our knowledge about  is updated
after we take into account the data.
The distribution of  given the data is called
the posterior and is the basic of all inferences
about  .
The posterior distribution is written as follows:
f  | y  
f  y |     
 f  y |      , since h  y    f  y |      d  
h y
is a
normalized constant. The posterior distribution
is a true probability distribution that must sum
to 1.
 One of the problem with Bayesian analysis is to
derive h  y  . Most often we cannot have a close
form of h  y  . Furthermore, the posterior
distribution does not belong to a known family
7
of distribution. Therefore, to have
representative values of posterior distribution,
we will use sampling technique as MetropolisHasting and Gibbs sampling. Here, we will
implement the Metropolis-Hasting algorithm.
V. Bayesian Analysis
 To complete any Bayesian analysis, the work
load can be divided in four steps:
 Specify the probability distribution of your data
given parameters in your model (Likelihood).
 Based on your believe about parameters in your
model, specify a prior distribution of
parameters in your model.
 Derive the posterior distribution as the product
of likelihood time the prior.
 Make any inference (Mean, SD, Median, Highest
density interval (HDI)) about parameters of
your model through the posterior distribution.
This can be done through simulation or
numerical derivation.
 HDI is another way to summarize your
distribution.
 The HDI indicates which points of a distribution
are most credible.
8
 The HDI summarizes the distribution by
indicating an interval that spans most of the
distribution, say 95% of it, such that every
point inside of it has higher credibility than any
point outside.
VI. Example:
In this example, we will apply the MetropolisHastings Algorithm to a Bernoulli trial. From
August 30 2014 to December 17 2014, Huskers’
football played 13 games. Among those 13
games, they won 9 games and lost 4. Let’s
assume that each game is a Bernoulli trial.
That’s. The goal is to estimate the probability
that Husker wins a game.
 Let’s assume y   y , y ,..., y  , y ~ Bernoulli 1, p  since p is
between [0,1], we will use beta  ,   , where  ,  are known
constants as the prior distribution for p .
 The derivation of the posterior is given in the
appendix.
1
2
n
i
n
 yi  1
 The posterior distribution is f  p | y   p
1  p 
 We will use a random walk to generate our
sample values. The proposal distribution is a
i 1
n
n
 yi   1
i 1
9
normal distribution centered at zero with
standard deviation (SD) denoted  (Known).
 The Metropolis-Hastings Algorithm then
proceeds as follows.
 Start at an arbitrary initial value of p (in the
valid range). This is the current value denoted
p . Then:
1)Randomly generate a proposed jump, a
candidate value, p ~ normal  0,  and denote
the proposed value of the parameter as
cur
2
p prop  pcur  p
2)Compute the probability of moving to the
proposed position as follows
 p
1  p
  .
p
 min 1,
 p
1  p
 

If the proposed value happens to fall
outside the range of p , then the prior and
the likelihood is set to zero.
3)After finding the probability of moving we
will decide to move to the proposed
position by generating a random uniform
number between zero and 1.
4)We accept the proposed value if the
random uniform number is less than P ,
otherwise, we stay at the current position.
move
z  a 1
pro
N  z b1
pro
z  a 1
cur
N  z b1
cur
move
10
5)Repeat the above steps until it is judged
that a sufficiently representative sample
has been generated.
>
>
>
>
>
>
>
# Specify the data, to be used in the likelihood function.
myData = c(rep(0,4),rep(1,9))
# Define the target function, p(D|theta)*p(theta).For our application, this
# target distribution is the unnormalized posterior distribution.
# The argument theta could be a vector, not just a scalar.
# target distribution or posterior distribution
targetRelProb = function( theta , data, a, b ) {
z = sum( data )
N = length( data )
pDataGivenTheta = theta^(z + a - 1) * (1-theta)^(N - z + b - 1)
# The theta values passed into this function are generated at random,
# and therefore might be inadvertently greater than 1 or less than 0.
# The likelihood for theta > 1 or for theta < 0 is zero:
pDataGivenTheta[ theta > 1 | theta < 0 ] = 0
return( pDataGivenTheta )
}
> run_metropolis_MCMC <- function(startvalue, iterations){
nAccepted = 0
nRejected = 0
burnIn = ceiling( 0.0 * iterations )
chain = matrix(NA, nrow = iterations + 1, ncol = 1)
chain[1,] = startvalue
for (i in 1:iterations){
proposal = rnorm(1,mean = 0, sd= c(0.2))
pmove = min(1,targetRelProb(chain[i,] + proposal, myData , a = 1, b = 1)/
targetRelProb(chain[i,], myData , a = 1, b = 1))
if (runif(1) < pmove){
chain[i+1,] = proposal + chain[i,]
if ( i > burnIn ) { nAccepted = nAccepted + 1 }
}else{
chain[i+1,] = chain[i,]
if ( i > burnIn ) { nRejected = nRejected + 1 }
}
}
return(list(chain,nAccepted, nRejected, burnIn ))
}
> startvalue = c(0.2)
> set.seed(12345)
> chain = run_metropolis_MCMC(startvalue, 100000)
> nAccepted <- chain[2]
> trajectory <- chain[[1]]
> head(trajectory )
[,1]
[1,] 0.2000000
[2,] 0.3171058
[3,] 0.5583403
[4,] 0.5583403
[5,] 0.5583403
[6,] 0.6843600
> burnIn <- chain[[3]]
> # Extract the post-burnIn portion of the trajectory.
> acceptedTraj = trajectory[ (burnIn+1) : dim(trajectory)[1], ]
> head(acceptedTraj)
11
[1] 0.3843991 0.3843991 0.5739092 0.5739092 0.4918485 0.5341864
> trajLength = length(acceptedTraj)
> Mean = mean(acceptedTraj)
> Median <- median(acceptedTraj)
> SD <- sd(acceptedTraj)
> densCurve = density( acceptedTraj , adjust=2 )
> Mode = densCurve$x[which.max(densCurve$y)]
> names <- c("Mean", "Median", "Mode", "SD")
> Sum.Stat <- c(Mean, Median, Mode, SD)
> Summary <- data.frame(names, Sum.Stat)
> Summary
names Sum.Stat
1
Mean 0.6685017
2 Median 0.6759711
3
Mode 0.6963898
4
SD 0.1171666
> HDIofMCMC = function( sampleVec , credMass=0.95 ) {
# Computes highest density interval from a sample of representative values,
#
estimated as shortest credible interval.
# Arguments:
#
sampleVec
#
is a vector of representative values from a probability distribution.
#
credMass
#
is a scalar between 0 and 1, indicating the mass within the credibe
#
interval that is to be estimated.
# Value:
#
HDIlim is a vector containing the limits of the HDI
sortedPts = sort( sampleVec )
ciIdxInc = ceiling( credMass * length( sortedPts ) )
nCIs = length( sortedPts ) - ciIdxInc
ciWidth = rep( 0 , nCIs )
for ( i in 1:nCIs ) {
ciWidth[ i ] = sortedPts[ i + ciIdxInc ] - sortedPts[ i ]
}
HDImin = sortedPts[ which.min( ciWidth ) ]
HDImax = sortedPts[ which.min( ciWidth ) + ciIdxInc ]
HDIlim = c( HDImin , HDImax )
return( HDIlim )
}
> HDI <- matrix(NA, ncol = 2, nrow = 1)
> colnames(HDI) <- c("HDI_Lower", "HDI_Upper")
> Parameters <- "Theta"
> HDI <- data.frame(Parameters, HDI)
> HDI[1,2:3] <- HDIofMCMC(sampleVec = acceptedTraj, credMass=0.95)
> HDI
Parameters HDI_Lower HDI_Upper
1
Theta 0.4394185 0.8891786
> layout( matrix(1:3,ncol=3) )
> par(mar=c(3,4,2,1),mgp=c(2,0.7,0))
> library(coda)
Loading required package: lattice
Warning messages:
1: package ‘coda’ was built under R version 3.0.3
2: package ‘lattice’ was built under R version 3.0.3
> border <- "skyblue"
> col <- "skyblue"
> histinfo = hist( acceptedTraj , freq=F, border=border ,
xlab = Parameters, main=bquote( list( "SD" == .(round(sd(acceptedTraj),3)) ,
"Median" == .(round(median(acceptedTraj),3)),
"Mean" == .(round(mean(acceptedTraj),1)) ) ))
> lines( densCurve$x , densCurve$y , type="l" , lwd=2, col = "red" )
> cenTendHt = 0.9*max(histinfo$density)
> cvHt = 0.7*max(histinfo$density)
> ROPEtextHt = 0.55*max(histinfo$density)
12
> # Display central tendency:
> mn = Mean
> med = Median
> mo = Mode
> text( mo , cenTendHt ,
bquote(mode ==.(signif(mo,3))) , adj=c(.5,0) , cex=1.5 )
> # Display the HDI.
> credMass <- 0.95
> cex <- 1.5
> HDItextPlace=0.7
> lines( HDI[1,2:3] , c(0,0) , lwd=4 )
> text( mean(as.numeric(HDI[1,2:3])) , 0 , bquote(.(100*credMass) * "% HDI" )
, adj=c(.5,-1.7) , cex=cex )
> text( HDI[1,2:3][1] , 0 , bquote(.(signif(HDI[1,2:3][1],3))) ,
adj=c(HDItextPlace,-0.5) , cex=cex )
> text( HDI[1,2:3][2] , 0 , bquote(.(signif(HDI[1,2:3][2],3))) ,
adj=c(1.0-HDItextPlace,-0.5) , cex=cex )
> #------------------------------------------------------------------------------> # comparing our simulated draws to the exact distributiob.
> #-----------------------------------------------------------------------------> shape1 <- sum(myData) + 1
> shape2 <- length(myData) - sum(myData) + 1
> plot.ecdf(x = acceptedTraj, verticals = TRUE, do.p = FALSE,
lwd = 2, panel.first = grid(), ylab = "Probability",+ xlab = "Theta", col = "
red", main = "EDF")
> abline(h = c(0,1))
> curve(expr = pbeta(q = x, shape1 = shape1, shape2 = shape2), add = T,
col = "blue", lwd = 2)
> legend(x = 0.01, y = 1, legend = c( "Metropolist","Beta"), bty = "n", lty =
1, col = c( "Red","blue"))
> #PDF
> hist(x = acceptedTraj, freq = FALSE, main = "", col = "red", ylim = c(0,4))
> curve(expr = dbeta(x = x, shape1 = shape1, shape2 = shape2), add = T,
col = "blue", lwd = 2, main = "CDF")
> legend(x = 0.2, y = 4, legend = c("Metropolist", "Beta"), bty = "n", lty =
1, col = c("red", "Blue"))
13
14
Appendix
Derivation of the posterior distribution: Let’s
assume y   y , y ,..., y  , y ~ Bernoulli 1, p  , since p is between
[0,1], we will use beta  ,   , where  ,  are known constants as
the prior distribution for p . The posterior
distribution is derived as follows
1
2
n
i
 n

 f  yi | p     p 

f  p | y    i 1
h y
 n

  f  yi | p     p 
 i 1

 1
 n
1  p 
1 y  p
  p yi 1  p  
 i 1
 beta  ,  
 1
n
f  p | y  p
 yi  1
i 1
1  p 
n
n
 yi   1
i 1
This is a kernel of a beta distribution. So the
posterior distribution is beta   y   , n   y   
n
 i1
n
i
i 1
i

distribution. Since the posterior distribution is a
distribution that we know, we can make inference
through exact formula of statistic of interest. For
n
instance
E  p | y 
 y 
i 1
i
n   
,
n
 n


y


n

yi   
yi    1

 i


i 1


mod e  i1
, var   i1
2
n     2
 n       n      1
n
. For this example,
we see that the posterior and the prior distribution
come from the same family of distribution (beta).
15
We say that beta distribution is a conjugate
distribution for Bernoulli (binomial as well).
Now let’s assume that


y   y1 , y2 ,..., yn  , yi ~ poisson    ,  ~ Normal  ,  2 ,  ,  2 are known hyperparameters.
We calculate the posterior distribution as follows:
 n

 f  yi |       

f   | y    i 1
h y
 n

  f  yi |       
 i 1


    2
 e   e 2
 

2
 i1 yi !  2

n
yi
    2 
    n  2  
    
n
2
f  | y   e

 yi
i 1
This is not a kernel of a known distribution.
References:
Books:
- Jim Albert: Bayesian Computation with R,
second edition;
- John K. Kruschke: Doing Bayesian Data Analysis:
A tutorial with R, JAGS, and Stan, second
edition
- Bradley P. Carlin, Thomas A. Louis: Bayesian
Methods for Data Analysis
Paper:
16
Siddhartha Chib and Edward Greenberg.
Understanding the Metropolis-hasting Algorithm.
The American Statistician, November 1995, Vol. 49,
No. 4.
17
Download