BAYESIAN INFERENCE Sampling techniques Andreas Steingötter Motivation & Background Exact inference is intractable, so we have to resort to some form of approximation Motivation & Background • variational Bayes – deterministic approximation not exact in principle Alternative approximation: • Perform inference by numerical sampling, also known as Monte Carlo techniques. Motivation & Background Posterior distribution ๐ ๐ง is required (primarily) for the purpose of evaluating expectations ๐ธ(๐). E( f ) ๏ฝ ๏ฒ f ( z ) p ( z )d ( z ) • ๐ ๐ง are predictions made by model with parameters ๐ง • ๐ ๐ง is parameter prior and ๐ ๐ง = ๐(๐ฆ|๐ง) is likelihood evaluate the marginal likelihood (evidence) for a model Motivation & Background E( f ) ๏ฝ ๏ฒ f ( z ) p ( z )d ( z ) approximation fˆ ๏ฝ 1 L ๏ฅ L f (z (l ) ) Classical Monte Carlo approx l ๏ฝ1 ๐ง (๐) are random (not necessarily independent) draws from ๐ ๐ง , which converges to the right answer in the limit of large numbers of samples, ๐ฟ. Motivation & Background fˆ ๏ฝ 1 L ๏ฅ L l ๏ฝ1 f (z (l ) ) if ๐ง (๐) are independent draws from ๐ ๐ง , then low numbers suffice to estimate expectation Problems: 1. How to obtain independent samples from ๐ ๐ ? 2. Expectation may be dominated by regions of small probability -> large sample sizes will be required to achieve sufficient accuracy 3. Monte Carlo ignores values of ๐ง (๐) when forming the estimate How to do sampling? 1. Basic Sampling algorithms – Restricted mainly to 1- / 2- dimensional problems 2. Markov chain Monte Carlo – Very general and powerful framework Basic sampling Special cases 1. Model with directed graph – Ancestral sampling: • p(z) ๏ฝ M ๏ p ( z i | pa i ) i ๏ฝ1 Easy sampling of joint distribution: – Logic sampling: • Compare sampled value for ๐ง๐ with observed value at node i. If NOT agree, then discard all previous samples and start with first node Random sampling • Computers can generate only pseudorandom numbers – Correlation of successive values – Lack of uniformity of distribution – Poor dimensional distribution of output sequence – Distance between where certain values occur are distributed differently from those in a random sequence distribution Random sampling from the Uniform Distribution • Assumption: – good pseudo-random generator for uniformly distributed data is implemented • Alternative: http://www.random.org – “true” random numbers with randomness coming from atmospheric noise Random sampling from a standard non-uniform distribution Goal: Sample from non-uniform distribution ๐ ๐ฆ which is a standard distribution, i.e. given in analytical form Suppose: we have uniformly distributed random numbers from (0,1) Solution: Transform random numbers ๐ง over (0,1) using a function which is the inverse of the indefinite integral of the desired distribution Random sampling from a standard nonuniform distribution • Step 1: Calculatey cumulative distribution function z ๏ฝ h (y) ๏ฝ ๏ฒ p ( x )d ( x ) ๏ญ๏ฅ • Step 2: Transform samples ๐ ๐ง 0,1 by y ๏ฝ h(z) ๏ญ1 Rejection sampling Suppose: – direct sampling from ๐ ๐ is difficult, but – ๐ ๐ can be evaluated for any given value of ๐ up to some normalization constant ๐ p(z) ๏ฝ 1 Z p(z) p – ๐๐ is unknown, ๐ ๐ง can be evaluated Approach: – Define simple proposal distribution ๐(๐ง) such that ๐๐ ๐ง ≥ ๐(๐ง) for all ๐ง. Rejection sampling • Simple visual example ๐๐ ๐ง ๐(๐ง) • Constant k should be as small as possible. • Fraction of rejected points depends on the ratio of the area under the unnormalized distribution ๐ ๐ง to the area under the curve ๐๐ ๐ง . Rejection sampling • Rejection sampler – Generate two random numbers 1. number ๐ง0 from proposal distribution ๐(๐ง) 2. generate a number ๐ข0 from uniform distribution over [0, k๐ ๐ง0 ] – If ๐ข0 > ๐ (๐ง0 ) reject! – Remaining pairs have unifrom distribution under ๐ (๐ง) Adaptive rejection sampling Suppose: difficult to determine a suitable analytic form for the proposal distribution ๐(๐ง) Approach: construct envelope function “on the fly” based on observed values of the distribution ๐ ๐ง – if ๐ ๐ง is log concave (ln๐ ๐ง has non-increasing derivatives) use derivatives to construct envelope Adaptive rejection sampling • Step 1: at initial set of grid points ๐ง1 , …, ๐ง๐ evaluate function ln๐ ๐ง๐ and its gradient and calculate tangents at ๐ ๐ง๐ , i = 1, … , M. • Step 2: sample from envelop distribution, if accepted use it to calculate ๐(๐ง), otherwise refine grid. ๏ญ Envelope distribution is a piecewise exponential distribution q ( z ) ๏ฝ k i ๏ฌ i exp{ ๏ญ ๏ฌ i (z ๏ญ z i ๏ญ1 )} Slope ๏ฌ Offset k z i ๏ญ1 ๏ผ z ๏ผ z i Adaptive rejection sampling Problem of rejection sampling: • Find a proposal distribution ๐(๐ง), which is close to required distribution to minimize rejection rate. • Therefore restricted mainly to univariate distributions ๏ curse of dimensionality • However: potential subroutine Importance sampling • Framework for approximating expectations ๐ธ๐ (๐ ๐ง ) directly with respect to ๐ ๐ • Does NOT provide ๐ ๐ E( f ) ๏ฝ ๏ฒ f ( z ) p ( z )d ( z ) fˆ ๏ฝ 1 L ๏ฅ L f (z (l ) ) l ๏ฝ1 Suppose (again): – direct sampling from ๐ ๐ is difficult, but – ๐ ๐ can be evaluated for any given value of ๐ up to some normalization constant ๐. Importance sampling • As for rejection sampling, apply proposal distribution ๐ ๐ง from which it is easy to draw samples Importance sampling • Expectation formula for un-normalized distributions with importance weights ๐๐ L E( f ) ๏ฅw l ๏ฝ1 l f (z (l ) ) wl ๏ฝ rl ๏ฅ L r m ๏ฝ1 m ๏ฝ p(z ๏ฅ L m ๏ฝ1 (l ) p(z ) / q(z (m ) (l ) ) ) / q(z (m ) Key points: – Importance weights correct bias introduced by sampling from proposal distribution – Dependence on how well ๐ ๐ง approximates ๐(๐ง) (similar to rejection sampling) • Choose sample points in input space where ๐ ๐ง ๐ ๐ง is large (or at least where ๐ ๐ง is large) • If ๐ ๐ง > 0 in same region, then ๐ ๐ง > 0 necessary ) Importance sampling Attention: • Consider none of the samples falls in the regions where ๐ ๐ง ๐ ๐ง is large. • In that case, the apparent variances of ๐๐ and ๐๐ ๐(๐ง(๐)) may be small even though the estimate of the expectation may be severely wrong. • Hence a major drawback of the importance sampling method is the potential to produce results that are arbitrarily in error and with no diagnostic indication. ๏ ๐ ๐ should NOT be small where ๐ ๐ may be significant!!! Markov Chain Monte Carlo (MCMC) sampling • MCMC is a general framework, sampling from large class of distributions, scales well with dimensionality of sample space Goal: Generate samples from distribution ๐(๐ง) Idea: Build a machine which uses the current sample to decide which next sample to produce in such a way that the overall distribution of the samples will be ๐(๐ง). Markov Chain Monte Carlo (MCMC) sampling Approach: 1. Generate a candidate sample ๐ง ∗ from a proposal distribution ๐(๐ง|๐ง (๐) ) that depends on the current state ๐ง (๐) and is sufficiently simple to draw samples from directly. 2. Current sample ๐ง (๐) is known (i.e. maintain record of the current state) 3. Samples ๐ง (1) , ๐ง (2) , ๐ง (3) , … form a Markov chain 4. Accept or reject the candidate sample ๐ง ∗ according to some appropriate criterion z ( ๏ด ๏ซ 1) ๏ฌ๏ฏ z * ๏ฝ๏ญ (๏ด ) ๏ฏ๏ฎ z if accepted if rejected MCMC - Metropolis algorithm Suppose: – ๐ ๐ can be evaluated for any given value of ๐ up to some normalization constant ๐. p(z) ๏ฝ 1 Z p(z) p Algorithm: • Step 1: Choose symmetric proposal distribution ๐ ๐๐ด ๐๐ต = ๐ ๐๐ต ๐๐ด • Step 2: Candidate sample ๐ง ∗ is accepted with probability MCMC - Metropolis algorithm Algorithm (cont.): • Step 2.1: Choose a random number ๐ข with uniform distribution in (0,1) • Step 2.2: Acceptance test for ๐ข < • Step 3: z (๏ด ๏ซ 1) * ๏ฌ z ๏ฏ ๏ฝ๏ญ (๏ด ) z ๏ฏ ๏ฎ if accepted if rejected Metropolis algorithm Notes: • rejection of a points leads to the previous sample (different from rejection sampling) • If ๐ ๐๐ด ๐๐ต > 0 for any values ๐๐ด, ๐๐ต, then ๐(๐) tends to ๐ ๐ for ๐ -> ๏ฅ • ๐ง (1) , ๐ง (2) , ๐ง (2) ... present no independent samples from ๐ ๐ - serial correlation. Instead retain only every Mth sample. Examples: Metropolis algorithm Implementation in R: • Elliptical distibution ๐(๐ง (๐) ) ๐(๐ง ∗) ๐ข< Update state ๐ ๐+1 Keep old state ๐ = ๐∗ ๐+1 = ๐(๐) Examples: Metropolis algorithm Implementation in R: • Initialization [-2,2], step size = 0.3 n=1500 n=15000 Examples: Metropolis algorithm Implementation in R: • Initialization [-2,2], step size = 0.5 n=1500 n=15000 Examples: Metropolis algorithm Implementation in R: • Initialization [-2,2], step size = 1 n=1500 n=15000 Validation of MCMC Properties of Markov chains: z(1) p(z z(2) ( m ๏ซ 1) |z (1) , ... z z(m) (m ) ) ๏ฝ p (z (m ๏ซ 1) |z z(m+1) (m ) ) ๏ฝ T m (z (m ) Transition probabilities ๐๐(๐ง ′ , ๐ง): If ๐๐ is the same for all m homogeneous Invariant (stationary) p (z) ๏ฝ * ๏ฅ T (z z' (m ) ,z ( m ๏ซ 1) * ) p (z (m ) ) ,z (m ๏ซ 1) ) Validation of MCMC Propoerties of Markov chains: ๐∗ ๐ง : If ๐๐ is the same for all m homogeneous Invariant (stationary) p (z) ๏ฝ * ๏ฅ T (z (m ) ,z ( m ๏ซ 1) * ) p (z (m ) ) z' Sufficient detailed balance ๐๐ satisfy p * ( z )T (z, z') ๏ฝ p * ( z ')T ( z ', z ) reversible Validation of MCMC Goal: invariant Markov chain that converges to desired distribution ๐∗ ๐ง ๐∗ ๐ง = lim ๐ ๐งm invariant ๐→∞ !!! for any ๐(๐ง0) ergodicity • An ergodic Markov chain has only one equilibrium distribution Properties and validation of MCMC Approach: Construct appropriate transition probabilities ๐(๐ง ′ , ๐ง): • ๐(๐ง ′ , ๐ง) from set of base transitions ๐ฉk – Mixture form K T ( z ', z ) ๏ฝ ๏ฅ๏ก k ๏ฝ1 – Successive application k B k ( z ', z ) ๏กk - Mixing coefficients Metropolis-Hastings algorithm • Generalization of Metropolis algorithm – No symmetric proposal distribution ๐(๐ง) required If symmetry – Choice of proposal distribution crititcal Metropolis-Hastings algorithm • Gaussian centered on current state • Small variance -> high acceptance, slow walk, dependent samples • Large variance -> high rejection rate Gibbs sampling • Special case of Metropolis-Hastings algorithm – the random value is always accepted, Suppose: ๐ ๐ง1, ๐ง2, ๐ง3 , • Step 1: initial samples ๐ง1๏ด, ๐ง2๏ด, ๐ง3๏ด • Step 2: (repeated) – – – ๐ง1๏ด๏ซ1~๐ ๐ง1 ๐ง2๏ด, ๐ง3๏ด ๐ง2๏ด๏ซ1~๐(๐ง2|๐ง1๏ด๏ซ1, ๐ง3๏ด) ๐ง3๏ด๏ซ1~๐(๐ง3|๐ง1๏ด๏ซ1, ๐ง2๏ด๏ซ1) 1 • repeated by cycling • randomly choose variable to be updated Gibbs sampling • ๐ ๐\i is invariant (unchanged) • Univariate conditional distribution ๐(๐ง๐|๐\i) is invariant (by definition) ๏ Joint distribution ๐ ๐ is invariant • Because (fixed at each step) Gibbs sampling ๏ Sufficient condition for ergodicity: – None of the conditional distributions be anywhere zero, i.e. any point in ๐ space can be reached from any other point in a finite number of steps z(2) z(1) z(3) Gibbs sampling Obtain m independent samples: 1. Sample MCMC during a «burn-in» period to remove dependence on initial values 2. Then, sample at set time points (e.g. every Mth sample) • The Gibbs sequence converges to a stationary (equilibrium) distribution that is independent of the starting values, • By construction this stationary distribution is the target distribution we are trying to simulate. Gibbs sampling • Practicability dependent feasibility to draw samples from conditional distributions ๐(๐ง๐|๐\i). • Directed graphs will lead to conditional distributions for Gibbs sampling that are log concave. ๏Adaptive rejection sampling methods