Introduction to Markov Chain Monte Carlo Fall 2013 By Yaohang Li, Ph.D.

advertisement
Introduction to Markov Chain
Monte Carlo
Fall 2013
By Yaohang Li, Ph.D.
Review
• Last Class
– Linear Operator Equations
– Monte Carlo
• This Class
– Markov Chain
– Metropolis Method
• Next Class
– Presentations
Markov Chain
• Markov Chain
– Consider chains whose transitions occur at discrete times
• States S1, S2, …
• Xt is the state that the chain is in at time t
• Conditional probability
– P(Xt=Sj|Xt1=Si1, Xt2=Si2, …, Xtn=Sin)
– The system is a Markov Chain if the distribution of Xt is
independent of all previous states except for its immediate
predecessor Xt-1
– P(Xt=Sj|X1=Si1, X2=Si2, …, Xt-1=Sit-1)=P(Xt=Sj|Xt-1=Sit-1)
Characteristics of Markov Chain
•Irreducible Chain
•Aperiodic Chain
•Stationary Distribution
– Markov Chain can gradually forget its initial state
– eventually converge to a unique stationary distribution
• invariant distribution
•Ergodic average
n
1
f 
f (Xt )

n  m t  m 1
Target Distribution
• Target Distribution Function
– (x)=ce-h(x)
• h(x)
– in physics, the potential function
– other system, the score function
• c
– normalization constant
» make sure the integral of (x) is 1
• Presumably, all pdfs can be written in this form
Metropolis Method
• Basic Idea
– Evolving a Markov process to achieve the sampling of 
• Metropolis Algorithm
– Start with any configuration x0, iterates the following two
steps
– Step 1: Propose a random “perturbation” of the current state,
i.e., xt -> x’, where x’ can be seen as generated from a
symmetric probability transition function Q(xt->x’)
• i.e., Q(x->x’)=Q(x’->x)
• Calculate the change h=h(x’)-h(x)
– Step 2: Generate a random number u ~ U[0, 1). Let xt+1=x’ if
u<=e-h, otherwise xt+1=xt
Simple Example for Hard-Shell Balls
• Simulation
– Uniformly distributed positions of K hard-shell balls in a box
– These balls are assumed to have equal diameter d
• (X, Y) ={(xi, yi), i=1, …, K}
– denote the positions of the balls
– Target Distribution (X, Y)
• = constant if the balls are all in the box and have no overlaps
• =0 otherwise
• Metropolis algorithm
– (a) pick a ball at random, say, its position is (xi, yi)
– (b) move it to a tentative position (xi’, yi’)=(xi+1, yi+2), where 1, 2 are
normally distributed
– (c) accept the proposal if it doesn’t violate the constraints, otherwise stay
out
Simulation of K-Hardballs
Hastings’ Generalization
• Metropolis Method
– A symmetric transition rule
• Q(x->x’)=Q(x’->x)
• Hastings’ Generalization
– An arbitrary transition rule
• Q(x->x’)
– Q() is called a proposal function
Metropolis-Hastings Method
•Metropolis-Hastings Method
– Given current state xt
– Draw U ~ U[0, 1) and update
 y, ifU   ( xt , y)
xt 1  
 xt Otherwise
• (x, y) = min{1, (y)Q(y->x)/((x)Q(x->y))}
Detailed Balance Condition
•Transition Kernel for the Metropolis-Hastings Algorithm
P( xt 1 | xt )  Q( xt 1   xt )  ( xt , xt 1 )  I ( xt  xt 1 )[1   Q( y   xt )  ( xt , y )dy ]
– I(.) denotes the indicator function
• Taking the value 1 when its argument is true, and 0 otherwise
– First term arises from acceptance of a candidate y=xt+1
– Second term arises from rejection, for all possible candidates y
•Detailed Balance Equation
– (xt)P(xt+1|xt) = (xt+1)P(xt|xt+1)
Gibbs Sampling
• Gibbs Sampler
– A special MCMC scheme
– The underlying Markov Chain is constructed by using a
sequence of conditional distributions which are so chosen that
 is invariant with respect to each of these “conditional” moves
• Gibbs Sampling
– Definition
• X-i={X1, …, Xi-1, Xi+1, … Xn}
– Proposal Distribution
• Updating the ith component of X
• Qi(Xi->Yi, X-i)= (Yi|X-i)
MCMC Simulations
• Two Phases
– Equilibration
• Try to reach equilibrium distribution
• Measurements of steps before equilibrium are discarded
– Production
• After reaching equilibrium
• Measurements become meaningful
• Question
– How fast the simulation can reach equilibrium?
Autocorrelations
• Given a time series of N measurements
from a Markov process
• Estimator of the expectation value is
• Autocorrelation function
– Definition
• where
Behavior of Autocorrelation Function
• Autocorrelation function
– Asymptotic behavior for large t
–
is called (exponential) autocorrelation time
– is related to the second largest eigenvalue of the transition
matrix
– Special case of the autocorrelations
Integrated Autocorrelation Time
• Variance of the estimator
• Integrated autocorrelation time
• When
Reaching Equilibrium
• How many steps to reach equilibrium?
Example of Autocorrelation
• Target Distribution
Small Step Size (d = 1)
Medium Step Size (d = 4)
Large Step Size (d = 8)
Microstate and Macrostate
• Macrostate
– Characterized by the following fixed values
• N: number of particles
• V: volume
• E: total energy
• Microstate
– Configuration that the specific macrostate (E, V, N) can be
realized
– accessible
• a microstate is accessible if its properties are consistent with
the specified macrostate
Ensemble
• System of N particles characterized by macro variables:
N, V, E
– macro state refers to a set of these variables
• There are many micro states which give the same values
of {N,V,E} or macro state.
– micro states refers to points in phase space
• All these micro states constitute an ensemble
Microcanonical Ensemble
• Isolated System
– N particles in volume V
– Total energy is conserved
– External influences can be ignored
• Microcanonical Ensemble
– The set of all micro states corresponding to the macro state
with value N,V,E is called the Microcanonical ensemble
• Generate Microcanonical Ensemble
– Start with an initial micro state
– Demon algorithm to produce the other micro states
The Demon Algorithm
• Extra degree of freedom the demon goes to every particle and
exchanges energy with it
• Demon Algorithm
– For each Monte Carlo step (for j=1 to mcs)
– for i=1 to N
• Choose a particle at random and make a trial change in its position
• Compute E, the change in the energy of the system due to the change
• If E<=0, the system gives the amount |E| to the demon, i.e., Ed=EdE, and the trial configuration is accepted
• If E>0 and the demon has sufficient energy for this change (Ed>= E),
the demon gives the necessary energy to the system, i.e., Ed=Ed- E,
and the trial configuration is accepted. Otherwise, the trial configuration
is rejected and the configuration is not changed
Monte Carlo Step
• In the process of changing the micro state the program
attempts to change the state of each particle. This is
called a sweep or Monte Carlo Step per particle mcs
• In each mcs the demon tries to change the energy of each
particle once.
• mcs provides a useful unit of ‘time’
MD and MC
• Molecular Dynamic
– a system of many particles with E, V, and N fixed by integrating Newton’s
equations of motion for each particle
– time-averaged value of the physical quantities of interest
• Monte Carlo
– Sampling the ensemble
– ergodic-averaged value of the physical quantities of interest
• How can we know that the Monte Carlo simulation of the
microcanonical ensemble yields results equivalent to the timeaveraged results of molecular dynamics?
– Ergodic hypothesis
• have not been shown identical in general
• have been found to yield equivalent results in all cases of interest
One-Dimensional Classical Ideal Gas
• Ideal Gas
– The energy of a configuration is independent of the positions
of the particles
– The total energy is the sum of the kinetic energies of the
individual particles
• Interesting physical quantity
– velocity
• A Java Program of the 1-D Classical Ideal Gas
– http://electron.physics.buffalo.edu/gonsalves/phy411506_spring01/Chapter16/feb23.html
– Using the demon algorithm
Physical Interpretation of Demon
• Demon may be thought of as a thermometer
• Simple MC simulation of ideal gas shows
– Mean demon energy is twice mean kinetic of gas
• The ideal gas and demon may be thought of
– as a heat bath (the gas) characterized by temp T
• related to its average kinetic energy
– and a sub or laboratory system (the demon)
– the temperature of the demon is that of the bath
• The demon is in macro state T
– the canonical ensemble of microstates are the Ed
Canonical ensemble
• Normally system is not isolated.
– surrounded by a much bigger system
– exchanges energy with it.
• Composite system of laboratory system and
surroundings may be consider isolated.
• Analogy:
– lab system <=> demon
– surroundings <=> ideal gas
• Surroundings has temperature T which also
characterizes macro state of lab system
Boltzmann distribution
•In canonical ensemble the daemon’s energy fluctuates about a mean
energy < Ed >
•Probability that demon has energy Ed is given by
– Boltzmann distribution
• Proved in statistical mechanics
• Shown by output of demon energy
P ( Ed ) 
1  Ed / kT
e
Z
•Mean demon energy

 Ed 


0
Ee E / kT dE

0
e
 E / kT
dE
 kT
Phase transitions
• Examples:
– Gas - liquid, liquid - solid
– magnets, pyroelectrics
– superconductors, superfluids
• Below certain temperature Tc the state of the
system changes structure
• Characterized by order parameter
– zero above Tc and non zero below Tc
– e.g. magnetisation M in magnets, gap in superconductors
Ising model
Consider a set of spins localized on the sites of a lattice.
Each spin si is allowed to point either up si  1 or down si  1.
Each spin i interacts with an external field Bi and with one
another vi a an exchange interactio n J ij . Simplest form of
N
Hamiltonia n is :
H   J  si s j  B  si
 ij 
i 1
We have assumed isotropic J and uniform B.
 ij  denotes only nearest neighbour pairs are to be counted
For J  0, the lowest energy or ground state is where all spins
point up. Thus positive J gives a ferromagnetic coupling.
For J  0 antiferrom agnetic ground state is expected.
Why Ising model?
• Simplest model which exhibit a phase transition in two
or more dimensions
• Can be mapped to models of lattice gas and binary alloy.
• Exactly solvable in one and two dimensions
• No kinetic energy to complicate things
• Theoretically and computationally tractable
– can make dedicated ‘Ising machine’
2-D Ising Model
• E=-J
• E=+J
Physical Quantity
•Energy
– Average Energy <E>
– Mean Square Energy Fluctuation <(E)2>=<E2>-<E>2
•Magnetization M
– Given by
N
M   Si
i 1
– mean square magnetization <(M)2>=<M2>-<M>2
•Temperature (Microcanonical)
kT / J 
4
ln( 1  4 J /  Ed )
Simulation of Ising Model
• Microcanonical Ensemble of Ising Model
– Demon Algorithm
• Canonical Ensemble of Ising Model
– Metropolis Algorithm
Metropolis for Ising Model
1. Establish an initial microstate.
2. Make a random trial change in the microstate.
3. Compute E  Etrial  Eold , change in energy due to trial change
4. If E  0 , accept the new microstate and go to step 8
5. If E  0 , compute w = exp(E / T )
6.Generate a random number r in the unit interval
7. If r  w , accept the new microstate; otherwise retain old state
8. Determine the value of the desired physical quantities.
9. Repeat steps (2) through (8)
10. Periodically compute averages over microstates
Simulating Infinite System
• Simulating Infinite System
– Periodic Boundaries
Energy Landscape
• Energy Landscape
– Complex biological or physical system has a
complicated and rough energy landscape
Local Minima and Global Minima
• Local Minima
– The physical system may transiently reside in the local minima
• Global Minima
– The most stable physical/biological system
– Usually the native configuration
• Goal of Global Optimization
– Escape the traps of local minima
– Converge to the global minimum
Local Optimization
• Local Optimization
– Quickly reach a local minima
– Approaches
• Gradient-based Method
• Quasi-Newton Method
• Powell Method
• Hill-climbing Method
• Simplex
Rosenbrock Function
f
1000
100
10
1
0.1
0.01
0.001
1.5
global minimum
1
y
• Global Optimization
– Find the global minima
– More difficult than local optimization
0.5
0
-0.5
-1.5
-1
-0.5
0.5
0
x
1
1.5
Consequences of the Occasional
Ascents
desired effect
Help escaping the
local optima.
adverse effect
Might pass global optima
after reaching it
(easy to avoid by
keeping track of
best-ever state)
Temperature is the Key
• Probability of energy changes as temperature raises
Boltzmann distribution
• At thermal equilibrium at temperature T, the
Boltzmann distribution gives the relative
probability that the system will occupy state A
vs. state B as:
P( A)
 E ( A)  E ( B)  exp( E ( B) / T )
 exp  

P( B)
T

 exp( E ( A) / T )
• where E(A) and E(B) are the energies
associated with states A and B.
Real annealing:
Sword
•He heats the metal, then
slowly cools it as he hammers
the blade into shape.
– If he cools the blade too
quickly the metal will form
patches of different
composition;
– If the metal is cooled slowly
while it is shaped, the
constituent metals will form a
uniform alloy.
Simulated Annealing Algorithm
• Idea: Escape local extrema by allowing “bad moves,” but
gradually decrease their size and frequency.
– proposed by Kirkpatrick et al. 1983
• Simulated Annealing Algorithm
– A modification of the Metropolis Algorithm
• Start with any configuration x0, iterates the following two steps
• Initialize T to a high temperature
• Step 1: Propose a random “perturbation” of the current state, i.e., xt ->
x’, where x’ can be seen as generated from a symmetric probability
transition function Q(xt->x’)
– i.e., Q(x->x’)=Q(x’->x)
– Calculate the change h=h(x’)-h(x)
• Step 2: Generate a random number u ~ U[0, 1). Let xt+1=x’ if u<=e-h/T,
otherwise xt+1=xt
• Step 3: Slightly reduce T and go to Step 1 Until T is your target
temperature
Note on simulated annealing: limit cases
• Boltzmann distribution: accept “bad move” with E<0 (goal is to
maximize E) with probability P(E) = exp(E/T)
• If T is large:
E < 0
E/T < 0 and very small Random walk
exp(E/T) close to 1
accept bad move with high probability
• If T is near 0:
E < 0
Deterministic
E/T < 0 and very large
down-hill
exp(E/T) close to 0
accept bad move with low probability
Simulated Tempering
•
Basic Idea
–
–
–
•
Allow the temperature to go up and down randomly
Proposed by Geyers 1993
Can be applied to a very rough energy landscape
Simulated Tempering Algorithm
–
–
–
At the current sampler i, update x using a Metropolis-Hastings update for
i(x).
Set j = i  1 according to probabilities pi,j, where p1,2 = pm,m-1 = 1.0
and pi,i+1 = pi,i-1 = 0.5 if 1 < i < m.
Calculate the equivalent of the Metropolis-Hastings ratio for the ST
method,
c( j ) j ( x) p j ,i
rST 
–
c(i ) i ( x) pi , j
where c(i) are tunable constants and accept the transition from i to j with
probability min(rst, 1). The distribution  is called the pseudo-prior
because the function ci(x)i(x) resembles formally the product of
likelihood and prior.
Parallel Tempering
• Basic Idea
– Allow multiple walks at different temperature levels
– Switch configuration between temperature levels
• Based on Metropolis-Hastings Ratio
– Also called replica exchange method
– Very powerful in complicated energy landscape
Multi-Transition Metropolis
• Multi-Transition Metropolis
– Allow steps of different size
– Proposed by J. Liu
• Basic Idea
– Large step size can escape the deep local minima trap more
easily
– However, the acceptance rate is much lower
Accelerated Simulated Tempering
• Accelerated Simulated Tempering [Li, Protopopescu, Gorin, 2004]
– A modified scheme of simulated tempering
• Allow larger transition steps at high temperature level
– Large transition steps at high temperature level
– Small transition steps at low temperature level
Rugged Energy
Function Landscape
• Lean the ladder
– Biased the transition probability between temperature levels
– Favor low temperature level
Hybrid PT/SA Method
• Hybrid PT/SA Method [Li, Protopopescu, Nikita, Gorin, 2009]
– System Setup
• Subsystem
– xi at different temperature levels
• Composite System
– Macrostate X = { x1, x2, …, xN }
– Transition Types
• Parallel Local Transitions
– Local Monte Carlo Moves at a specific temperature level
t 1
wLocal ( xit  xit 1 )  e  i i E  e  i ( E ( xi
)  E ( xit ))
– Acceptance Probability
min( 1, wLocal ( xit  xit 1 ))
• Replica Exchange
– Replica Exchange between two randomly chosen temperature levels
xit 1  xit11
– Acceptance Probability
t 1
min( 1, e  i E ( xi 1
)  i 1E ( xi t 1 )  i 1E ( xi 1t 1 )  i E ( xi t 1 )
– Cooling Scheme
• Examine the Macrostate for Equilibrium
– Boltzmann Distribution of the Subsystems
P( E ( xi )) 
exp(  E ( xi ) / Ti )
 exp(  E ( x j ) / T j )
j
)
Application of Global Optimization
VLSI Floorplan Design
• VLSI Design
– Goal: putting chips on the circuit floor without overlaps
Summary
•
•
•
•
•
•
•
Markov Chain
Markov Chain Monte Carlo
Metropolis Method
Hastings’ Generalization
Detail Balance Condition
Gibbs Sampler
Autocorrelation Function and Integrated Autocorrelation
Time
What I want you to do?
•
•
•
•
Review Slides
Review basic probability/statistics concepts
Prepare for your presentation topic and term paper
Work on your assignment 5
Download