MCMC Methods in Harmonic Models Simon Godsill Signal Processing Laboratory Cambridge University Engineering Department sjg@eng.cam.ac.uk www-sigproc.eng.cam.ac.uk/~sjg Overview MCMC Methods Metropolis-Hastings and Gibbs Samplers Design Considerations Case Study: Gabor Regression Models MCMC Methods MCMC methods are sophisticated and general methods for simulation from a complex probability distribution, say p(x) – x may be high dimensional, p() highly non-Gaussian, multimodal: Given a set of samples from p(x) we can compute Monte Carlo expectations for any quantities of interest by ergodic averages: MCMC Contd. In a Bayesian setting p(x) will typically be the posterior distribution: • Underlying concept is to construct an irreducible, aperiodic Markov chain having p(x) as its stationary distribution and transition kernel K(dx’;x) • Initialise chain at arbitrary state x(0) (say, random) and simulate repeatedly from K(dx’;x) until convergence achieved • Convergence in distribution is guaranteed under mild conditions, easily verified for most models MCMC, contd. Rates of convergence are hard to compute – lots of theory, but not typically applicable in practice. However, many models, e.g. many harmonic modelling cases, can be proven to have geometric convergence rates. MCMC Algorithms • MCMC schemes are constructed to satisfy the detailed balance condition •The most basic scheme satisfying detailed balance is the MetropolisHastings (M-H) method • At each iteration of M-H, propose to move from the current state x with a proposal density q(x’|x). This proposal is accepted randomly with probability • Otherwise remain at x and go on to next iteration Componentwise M-H In most cases this won’t be feasible as x is high dimensional -> low acceptance rates, poor convergence Instead, split x into components: Then perform M-H on each component k=1,…,N: Propose Accept with probability Gibbs Sampler Possibly the simplest form of MCMC – choose (the `full conditional’ distribution of xk) Acceptance probability is 1 – i.e. all moves accepted. Other types of MCMC Reversible Jump MCMC – extension of M-H to cases where x can have varying dimension (e.g. in sparsity estimation) – see Green (1995) – Biometrika Perfect simulation – special MCMC schemes that achieve exact samples from p(x) – highly desirable, but slow and not yet practical for many cases Design Issues and Recommendations A basic understanding of MCMC is relatively easy, but it is not so easy to construct effective and efficient samplers Some of the main considerations are: How to partition x into components (need not be same size, and usually aren’t) What algorithms to use – M-H, Gibbs, something else? In general Gibbs should only be used if the full conditionals are straightforward to sample from, e.g. Gaussian, gamma, etc., otherwise use M-H. (Blocking) – it’s nearly always best to group large numbers of components of x into single partitions xk, provided efficient M-H or Gibbs steps can be constructed for the partitions (Rao-Blackwellisation) – a related issue is marginalisation – it is better (in terms of estimator variance) to integrate out parameters analytically – again, subject to being able to construct efficient samplers on the remaining space: References for MCMC MCMC in Practice – Gilks et al – Chapman and Hall (1996) Monte Carlo Statistical Methods – Robert and Casella – Springer (1999) MCMC Case study – Gabor Regression models Now consider design of a sampler for harmonic models. Full details forthcoming as Wolfe, Godsill and Ng (2004) - Bayesian variable selection and regularisation for time-frequency surface estimation – Journal of Royal Statistical Society (Series B – methodological) (See also Wolfe and Godsill (NIPS 2002)) See http://www.eecs.harvard.edu/~patrick/research/ Gabor Regression Models Consider models of the form G is a matrix of Gabor atoms – here we chose an overcomplete dictionary with 2* redundancy We will seek sparse representations with time-frequency structure – encoded through prior distributions on ck’s For the moment consider case of fixed, known se and sck Gabor regression models Likelihood function is Posterior probability density for c is… Posterior for c: [Conditioning on se and sc implicit] So, in fact no MC is required for this case, since we have the full mean and Covariance matrix for c Gibbs Sampler – blocking structures However, for large Gabor models, the matrix inversion will be very slow, and here we could look at reduceddimension blocking structures Then Gibbs sampler would proceed as follows, for k=1,…,K: It’s instructive to look at the form of this conditional pdf: Full conditional for ck [Gk contains columns of G corresponding to partition k, and G-k the remaining columns.] This term is the residual error when ck=0 Note relationship to Basis Pursuit residual terms This form of Gibbs sampler can be very cheap computationally The interest in this work is to extend the modelling capabilities provided by other algorithms – giving new forms of sparsity and structure. The extra steps are added in modular fashion, retaining the conditionally Gaussian structure of the coefficients and the efficient implementation Sampling se First, we allow estimation of the noise floor by sampling se, assuming an inverted-gamma (IG) prior p(se2): Under this prior (conjugate) the full conditional takes the same form, which is easily sampled by standard methods (e.g. MATLAB) : Sampling coefficient parameters Next, place a structured prior distribution on the Gabor coefficients. First make them heavy-tailed to match real audio signals. This is done using Scale Mixtures of Normals (see Godsill and Rayner (IEEE Tr. Sp. And Audio –1998) for an audio restoration example). Simply assign a prior to the variance of each ck: Implies a non-Gaussian heavy-tailed distribution for ck Priors for sck Choice of p(sck2) determines the implied heavy-tailed distribution p(ck) In simplest case adopt the IG prior as this is conjugate. Then implied p(ck) is Student’s t – distributed: IG prior has Jeffreys and exponential limiting cases, so the family can encompass many of the sparseness-inducing cases. Again, the IG prior is conjugate and Leads to a simple Gibbs sampler step: Direct Sparsity Modelling Other choices of p(sck2) lead to other heavy-tailed distributions, e.g. it is possible to get a-stable or Generalised Gaussian coefficients with other choices. In these cases M-H would be used to do the sampling, see e.g. Godsill and Kuruoglu (1999 – CUED Tech. Rep.). A further addition that is easily encorporated into the MCMC is direct estimation of sparsity. This is an important addition to the models and does not compromise the guaranteed convergence properties of the methods. We can achieve this by allowing finite probability mass at zero in p(sck2): Direct Sparsity Modelling Prior with point mass at zero: Where gk 2{0,1} is a binary indicator variable specifying whether coefficient ck is active or inactive. Structure is introduced at this point, through priors on the timefrequency indicator field {gk} We use Markov chain or Markov random field priors to encourage continuity across time (tones), frequency (transients), or both: The indicator field is also sampled using Gibbs sampling – details not given here – no time left… Final Details We also sample the parameters of requiring one Gibbs and one M-H step. Interpreting the MCMC output Assume that the MCMC has converged and initial `burn-in’ deleted: Coefficient estimation: Noise reduction: Estimating the sparsity coefficients: How many coefficients are active? Results Results, contd. Typical output from the program Final iteration Noisy data MMSE Estimate Convergence of parameters See http://www.eecs.harvard.edu/~patrick/research/ for examples and Matlab code Conclusion Why use MCMC methods in harmonic models? Extend the range of models computable Guaranteed convergence (in the limit) Computations can be quite cheap Code would contain same building blocks as EM, IRLS or basis pursuit for similar models – easy to modify to MCMC for baseline comparison It’s really not as complicated or slow as people think!! Why not use MCMC methods? Can be computationally expensive Convergence diagnostics unreliable You may not want to explore new models References C.P. Robert and G. Casella, Monte Carlo StatisticalMethods, New York: Springer Verlag, 1999 W. R. Gilks and S. Richardson and D. J. Spiegelhalter, Markov chain Monte Carlo in practice, London: Chapman and Hall, 1996 P. J. Green, Reversible Jump Markov-chain Monte Carlo computation and Bayesian model determination, Biometrika, 82(4), pp. 711-732, 1995 Harmonic models and MCMC – SJG references – see www-sigproc.eng.cam.ac.uk/~sjg P. J. Wolfe, S. J. Godsill, and W.J. Ng. Bayesian variable selection and regularisation for time-frequency surface estimation Journal of the Royal Statistical Society, Series B, 2004. Read paper (with discussion). To Appear. M.Davy and S. J. Godsill. Bayesian harmonic models for musical signal analysis (with discussion). In J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, editors, Bayesian Statistics VII. Oxford University Press, 2003. P. J. Wolfe and S. J. Godsill. Bayesian modelling of time-frequency coefficients for audio signal enhancement. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, Cambridge, MA. MIT Press, 2002. S. J. Godsill and P. J. W. Rayner. Digital Audio Restoration: A Statistical ModelBased Approach. Berlin: Springer, ISBN 3 540 76222 1, September 1998. S. J. Godsill and P. J. W. Rayner. Robust reconstruction and analysis of autoregressive signals in impulsive noise using the Gibbs sampler. IEEE Trans. on Speech and Audio Processing, 6(4):352-372, July 1998. S. J. Godsill and E. E. Kuruoglu. Bayesian inference for time series with heavytailed symmetric alpha -stable noise processes. In Proc. Applications of heavy tailed distributions in economics, engineering and statistics, June 1999. Washington DC, USA. CUED Tech. Rep.