Sampling - Translational Neuromodeling Unit

advertisement
BAYESIAN INFERENCE
Sampling techniques
Andreas Steingötter
Motivation & Background
Exact inference is intractable, so we have to resort
to some form of approximation
Motivation & Background
• variational Bayes
– deterministic approximation not exact in principle
Alternative approximation:
• Perform inference by numerical sampling, also
known as Monte Carlo techniques.
Motivation & Background
Posterior distribution ๐‘ ๐‘ง is required (primarily) for the
purpose of evaluating expectations ๐ธ(๐‘“).
E( f ) ๏€ฝ
๏ƒฒ
f ( z ) p ( z )d ( z )
• ๐‘“ ๐‘ง are predictions made by model with parameters ๐‘ง
• ๐‘ ๐‘ง is parameter prior and ๐‘“ ๐‘ง = ๐‘(๐‘ฆ|๐‘ง) is likelihood evaluate the marginal likelihood (evidence) for a model
Motivation & Background
E( f ) ๏€ฝ
๏ƒฒ
f ( z ) p ( z )d ( z )
approximation
fˆ ๏€ฝ
1
L
๏ƒฅ
L
f (z
(l )
)
Classical Monte Carlo approx
l ๏€ฝ1
๐‘ง (๐‘™) are random (not necessarily independent) draws from
๐‘ ๐‘ง , which converges to the right answer in the limit of large
numbers of samples, ๐ฟ.
Motivation & Background
fˆ ๏€ฝ
1
L
๏ƒฅ
L
l ๏€ฝ1
f (z
(l )
)
if ๐‘ง (๐‘™) are independent draws from ๐‘ ๐‘ง , then low
numbers suffice to estimate expectation
Problems:
1. How to obtain independent samples from ๐’‘ ๐’› ?
2. Expectation may be dominated by regions of small
probability -> large sample sizes will be required to
achieve sufficient accuracy
3. Monte Carlo ignores values of ๐‘ง (๐‘™) when forming
the estimate
How to do sampling?
1. Basic Sampling algorithms
– Restricted mainly to 1- / 2- dimensional problems
2. Markov chain Monte Carlo
– Very general and powerful framework
Basic sampling
Special cases
1. Model with directed graph
– Ancestral sampling:
•
p(z) ๏€ฝ
M
๏ƒ•
p ( z i | pa i )
i ๏€ฝ1
Easy sampling of joint distribution:
– Logic sampling:
•
Compare sampled value for ๐‘ง๐‘– with observed value at
node i. If NOT agree, then discard all previous samples
and start with first node
Random sampling
• Computers can generate only pseudorandom
numbers
– Correlation of successive values
– Lack of uniformity of distribution
– Poor dimensional distribution of output sequence
– Distance between where certain values occur are
distributed differently from those in a random
sequence distribution
Random sampling from the
Uniform Distribution
• Assumption:
– good pseudo-random generator for uniformly
distributed data is implemented
• Alternative: http://www.random.org
– “true” random numbers with randomness coming
from atmospheric noise
Random sampling from a standard
non-uniform distribution
Goal: Sample from non-uniform distribution ๐‘ ๐‘ฆ
which is a standard distribution, i.e.
given in analytical form
Suppose: we have uniformly distributed random
numbers from (0,1)
Solution: Transform random numbers ๐‘ง over (0,1)
using a function which is the inverse of the
indefinite integral of the desired distribution
Random sampling from a standard nonuniform distribution
• Step 1: Calculatey cumulative distribution function
z ๏€ฝ h (y) ๏€ฝ
๏ƒฒ
p ( x )d ( x )
๏€ญ๏‚ฅ
• Step 2: Transform samples ๐‘ˆ ๐‘ง 0,1 by
y ๏€ฝ h(z)
๏€ญ1
Rejection sampling
Suppose:
– direct sampling from ๐‘ ๐’› is difficult, but
– ๐‘ ๐’› can be evaluated for any given value of ๐’› up to
some normalization constant ๐‘
p(z) ๏€ฝ
1
Z
p(z)
p
– ๐‘๐‘ is unknown, ๐‘ ๐‘ง can be evaluated
Approach:
– Define simple proposal distribution ๐‘ž(๐‘ง) such that
๐‘˜๐‘ž ๐‘ง ≥ ๐‘(๐‘ง) for all ๐‘ง.
Rejection sampling
• Simple visual example
๐‘˜๐‘ž ๐‘ง
๐‘(๐‘ง)
• Constant k should be as small as possible.
• Fraction of rejected points depends on the ratio of the
area under the unnormalized distribution ๐‘ ๐‘ง to the
area under the curve ๐‘˜๐‘ž ๐‘ง .
Rejection sampling
• Rejection sampler
– Generate two random numbers
1. number ๐‘ง0 from proposal distribution ๐‘ž(๐‘ง)
2. generate a number ๐‘ข0 from uniform distribution
over [0, k๐‘ž ๐‘ง0 ]
– If ๐‘ข0 > ๐‘ (๐‘ง0 ) reject!
– Remaining pairs have
unifrom distribution under
๐‘ (๐‘ง)
Adaptive rejection sampling
Suppose: difficult to determine a suitable analytic form
for the proposal distribution ๐‘ž(๐‘ง)
Approach: construct envelope function “on the fly”
based on observed values of the distribution ๐‘ ๐‘ง
– if ๐‘ ๐‘ง is log concave (ln๐‘ ๐‘ง has non-increasing
derivatives) use derivatives to construct envelope
Adaptive rejection sampling
• Step 1: at initial set of grid points ๐‘ง1 , …, ๐‘ง๐‘€ evaluate
function ln๐‘ ๐‘ง๐‘– and its gradient and calculate
tangents at ๐‘ ๐‘ง๐‘– , i = 1, … , M.
• Step 2: sample from envelop distribution, if accepted
use it to calculate ๐‘(๐‘ง), otherwise refine grid.
๏€ญ Envelope distribution is a piecewise exponential
distribution q ( z ) ๏€ฝ k i ๏ฌ i exp{ ๏€ญ ๏ฌ i (z ๏€ญ z i ๏€ญ1 )}
Slope ๏ฌ
Offset k
z i ๏€ญ1 ๏€ผ z ๏€ผ z i
Adaptive rejection sampling
Problem of rejection sampling:
• Find a proposal distribution ๐‘ž(๐‘ง), which is close to
required distribution to minimize rejection rate.
• Therefore restricted mainly to univariate
distributions ๏ƒ curse of dimensionality
• However: potential subroutine
Importance sampling
• Framework for approximating expectations
๐ธ๐‘ (๐‘“ ๐‘ง ) directly with respect to ๐‘ ๐’›
• Does NOT provide ๐‘ ๐’›
E( f ) ๏€ฝ
๏ƒฒ
f ( z ) p ( z )d ( z )
fˆ ๏€ฝ
1
L
๏ƒฅ
L
f (z
(l )
)
l ๏€ฝ1
Suppose (again):
– direct sampling from ๐‘ ๐’› is difficult, but
– ๐‘ ๐’› can be evaluated for any given value of ๐’› up
to some normalization constant ๐‘.
Importance sampling
• As for rejection sampling, apply proposal distribution
๐‘ž ๐‘ง from which it is easy to draw samples
Importance sampling
• Expectation formula for un-normalized distributions with
importance weights ๐‘Ÿ๐‘™
L
E( f )
๏ƒฅw
l ๏€ฝ1
l
f (z
(l )
)
wl ๏€ฝ
rl
๏ƒฅ
L
r
m ๏€ฝ1 m
๏€ฝ
p(z
๏ƒฅ
L
m ๏€ฝ1
(l )
p(z
) / q(z
(m )
(l )
)
) / q(z
(m )
Key points:
– Importance weights correct bias introduced by sampling from
proposal distribution
– Dependence on how well ๐‘ž ๐‘ง approximates ๐‘(๐‘ง) (similar to
rejection sampling)
• Choose sample points in input space where ๐‘“ ๐‘ง ๐‘ ๐‘ง is large (or at
least where ๐‘ ๐‘ง is large)
• If ๐‘ ๐‘ง > 0 in same region, then ๐‘ž ๐‘ง > 0 necessary
)
Importance sampling
Attention:
• Consider none of the samples falls in the regions where
๐‘“ ๐‘ง ๐‘ ๐‘ง is large.
• In that case, the apparent variances of ๐‘Ÿ๐‘™ and ๐‘Ÿ๐‘™ ๐‘“(๐‘ง(๐‘™)) may
be small even though the estimate of the expectation may be
severely wrong.
• Hence a major drawback of the importance sampling method
is the potential to produce results that are arbitrarily in error
and with no diagnostic indication.
๏ƒ˜ ๐’’ ๐’› should NOT be small where ๐’‘ ๐’› may be significant!!!
Markov Chain Monte Carlo (MCMC)
sampling
• MCMC is a general framework, sampling from large
class of distributions, scales well with dimensionality
of sample space
Goal: Generate samples from distribution ๐‘(๐‘ง)
Idea: Build a machine which uses the current sample
to decide which next sample to produce in such a way
that the overall distribution of the samples will be ๐‘(๐‘ง).
Markov Chain Monte Carlo (MCMC)
sampling
Approach:
1. Generate a candidate sample ๐‘ง ∗ from a proposal distribution
๐‘ž(๐‘ง|๐‘ง (๐œ) ) that depends on the current state ๐‘ง (๐œ) and is
sufficiently simple to draw samples from directly.
2. Current sample ๐‘ง (๐œ) is known (i.e. maintain record of the
current state)
3. Samples ๐‘ง (1) , ๐‘ง (2) , ๐‘ง (3) , … form a Markov chain
4. Accept or reject the candidate sample ๐‘ง ∗ according to some
appropriate criterion
z
( ๏ด ๏€ซ 1)
๏ƒฌ๏ƒฏ z *
๏€ฝ๏ƒญ
(๏ด )
๏ƒฏ๏ƒฎ z
if accepted
if rejected
MCMC - Metropolis algorithm
Suppose:
– ๐‘ ๐’› can be evaluated for any given value of ๐’› up
to some normalization constant ๐‘.
p(z) ๏€ฝ
1
Z
p(z)
p
Algorithm:
• Step 1: Choose symmetric proposal distribution ๐‘ž ๐’›๐ด ๐’›๐ต =
๐‘ž ๐’›๐ต ๐’›๐ด
• Step 2: Candidate sample ๐‘ง ∗ is accepted with probability
MCMC - Metropolis algorithm
Algorithm (cont.):
• Step 2.1: Choose a random number ๐‘ข with uniform
distribution in (0,1)
• Step 2.2: Acceptance test for ๐‘ข <
• Step 3:
z
(๏ด ๏€ซ 1)
*
๏ƒฌ
z
๏ƒฏ
๏€ฝ๏ƒญ
(๏ด )
z
๏ƒฏ
๏ƒฎ
if accepted
if rejected
Metropolis algorithm
Notes:
• rejection of a points leads to the previous
sample (different from rejection sampling)
• If ๐‘ž ๐’›๐ด ๐’›๐ต > 0 for any values ๐’›๐ด, ๐’›๐ต, then ๐’›(๐œ)
tends to ๐‘ ๐’› for ๐œ -> ๏‚ฅ
• ๐‘ง (1) , ๐‘ง (2) , ๐‘ง (2) ... present no independent
samples from ๐‘ ๐’› - serial correlation. Instead
retain only every Mth sample.
Examples: Metropolis algorithm
Implementation in R:
• Elliptical distibution
๐‘(๐‘ง (๐œ) )
๐‘(๐‘ง ∗)
๐‘ข<
Update state ๐’›
๐œ+1
Keep old state ๐’›
= ๐’›∗
๐œ+1
= ๐’›(๐œ)
Examples: Metropolis algorithm
Implementation in R:
• Initialization [-2,2], step size = 0.3
n=1500
n=15000
Examples: Metropolis algorithm
Implementation in R:
• Initialization [-2,2], step size = 0.5
n=1500
n=15000
Examples: Metropolis algorithm
Implementation in R:
• Initialization [-2,2], step size = 1
n=1500
n=15000
Validation of MCMC
Properties of Markov chains:
z(1)
p(z
z(2)
( m ๏€ซ 1)
|z
(1)
, ... z
z(m)
(m )
) ๏€ฝ p (z
(m ๏€ซ 1)
|z
z(m+1)
(m )
) ๏€ฝ T m (z
(m )
Transition probabilities ๐‘‡๐‘š(๐‘ง ′ , ๐‘ง):
If ๐‘‡๐‘š is the same for all m
homogeneous
Invariant
(stationary)
p (z) ๏€ฝ
*
๏ƒฅ T (z
z'
(m )
,z
( m ๏€ซ 1)
*
) p (z
(m )
)
,z
(m ๏€ซ 1)
)
Validation of MCMC
Propoerties of Markov chains: ๐‘∗ ๐‘ง :
If ๐‘‡๐‘š is the same for all m
homogeneous
Invariant
(stationary)
p (z) ๏€ฝ
*
๏ƒฅ T (z
(m )
,z
( m ๏€ซ 1)
*
) p (z
(m )
)
z'
Sufficient
detailed
balance
๐‘‡๐‘š satisfy p * ( z )T (z, z') ๏€ฝ p * ( z ')T ( z ', z )
reversible
Validation of MCMC
Goal: invariant Markov chain that converges to
desired distribution ๐‘∗ ๐‘ง
๐‘∗ ๐‘ง = lim ๐‘ ๐‘งm
invariant
๐‘š→∞
!!! for any ๐‘(๐‘ง0)
ergodicity
• An ergodic Markov chain has only one equilibrium
distribution
Properties and validation of MCMC
Approach: Construct appropriate transition
probabilities ๐‘‡(๐‘ง ′ , ๐‘ง):
• ๐‘‡(๐‘ง ′ , ๐‘ง) from set of base transitions ๐‘ฉk
– Mixture form
K
T ( z ', z ) ๏€ฝ
๏ƒฅ๏ก
k ๏€ฝ1
– Successive application
k
B k ( z ', z )
๏กk - Mixing coefficients
Metropolis-Hastings algorithm
• Generalization of Metropolis algorithm
– No symmetric proposal distribution ๐‘ž(๐‘ง) required
If symmetry
– Choice of proposal distribution crititcal
Metropolis-Hastings algorithm
• Gaussian centered on current state
• Small variance -> high acceptance, slow walk,
dependent samples
• Large variance -> high rejection rate
Gibbs sampling
• Special case of Metropolis-Hastings algorithm
– the random value is always accepted,
Suppose: ๐‘ ๐‘ง1, ๐‘ง2, ๐‘ง3 ,
• Step 1: initial samples ๐‘ง1๏ด, ๐‘ง2๏ด, ๐‘ง3๏ด
• Step 2: (repeated)
–
–
–
๐‘ง1๏ด๏€ซ1~๐‘ ๐‘ง1 ๐‘ง2๏ด, ๐‘ง3๏ด
๐‘ง2๏ด๏€ซ1~๐‘(๐‘ง2|๐‘ง1๏ด๏€ซ1, ๐‘ง3๏ด)
๐‘ง3๏ด๏€ซ1~๐‘(๐‘ง3|๐‘ง1๏ด๏€ซ1, ๐‘ง2๏ด๏€ซ1)
1
• repeated by cycling
• randomly choose variable
to be updated
Gibbs sampling
• ๐‘ ๐’›\i is invariant (unchanged)
• Univariate conditional distribution ๐‘(๐‘ง๐‘–|๐’›\i) is
invariant (by definition)
๏ƒ˜ Joint distribution ๐‘ ๐’› is invariant
• Because
(fixed at each step)
Gibbs sampling
๏ƒ˜ Sufficient condition for ergodicity:
– None of the conditional distributions be
anywhere zero, i.e. any point in ๐’› space can be
reached from any other point in a finite number of
steps
z(2)
z(1)
z(3)
Gibbs sampling
Obtain m independent samples:
1. Sample MCMC during a «burn-in» period to
remove dependence on initial values
2. Then, sample at set time points (e.g. every Mth
sample)
• The Gibbs sequence converges to a stationary
(equilibrium) distribution that is independent of
the starting values,
• By construction this stationary distribution is the
target distribution we are trying to simulate.
Gibbs sampling
• Practicability dependent feasibility to draw
samples from conditional distributions ๐‘(๐‘ง๐‘–|๐’›\i).
• Directed graphs will lead to conditional
distributions for Gibbs sampling that are log
concave.
๏ƒ˜Adaptive rejection sampling methods
Download