MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Michael I. Jordan INRIA University of California, Berkeley May 11, 2013 Acknowledgments: Brian Kulis, Tamara Broderick Statistical Inference and Big Data • Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data Statistical Inference and Big Data • Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data • In Bayesian inference the focus is on the models – burgeoning literature on Bayesian nonparametrics provides stochastic processes for representing flexible data structures – but the algorithmic choices are limited Statistical Inference and Big Data • Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data • In Bayesian inference the focus is on the models – burgeoning literature on Bayesian nonparametrics provides stochastic processes for representing flexible data structures – but the algorithmic choices are limited • So Big Data research hasn’t made much use of Bayes, and is instead optimization-based – but the model choices tend to be limited Bayesian Nonparametric Modeling • Examples of stochastic processes used in Bayesian nonparametrics include distributions on: – – – – – – – – directed trees of unbounded depth and unbounded fan-out partitions grammars Markov processes with unbounded state spaces infinite-dimensional matrices functions (smooth and non-smooth) copulae distributions • Power laws arise naturally in these distributions • Hierarchical modeling uses these stochastic processes as building blocks The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design – e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design – e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer • Where does the loss function come from? The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design – e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer • Where does the loss function come from? – Gauss, Huber, Fisher, … The Optimization Perspective • Write down a loss function and a regularizer • Find scalable algorithms that minimize the sum of these terms • Prove something about these algorithms • The connection to model-based inference is often in the analysis, but is sometimes in the design – e.g., Bayesian ideas are sometimes used to inspire the design of the regularizer • Where does the loss function come from? – Gauss, Huber, Fisher, … • It’s all very parametric, and the transition to nonparametrics is a separate step This Talk • Bayesian nonparametrics meets optimization – flexible, scalable modeling framework – gives rise to new loss functions and regularizers that are naturally nonparametric – no recourse to MCMC, SMC, etc This Talk • Bayesian nonparametrics meets optimization – flexible, scalable modeling framework – gives rise to new loss functions and regularizers that are naturally nonparametric – no recourse to MCMC, SMC, etc • Inspiration: the venerable, scalable K-means algorithm can be derived as the limit of an EM algorithm for fitting a mixture model This Talk • Bayesian nonparametrics meets optimization – flexible, scalable modeling framework – gives rise to new loss functions and regularizers that are naturally nonparametric – no recourse to MCMC, SMC, etc • Inspiration: the venerable, scalable K-means algorithm can be derived as the limit of an EM algorithm for fitting a mixture model • We do something similar in spirit, taking limits of various Bayesian nonparametric models: – Dirichlet process mixtures – hierarchical Dirichlet process mixtures – beta processes and hierarchical beta processes K-means Clustering • Represent the data set in terms of K clusters, each of which is summarized by a prototype • Each data is assigned to one of K clusters – Represented by allocations for all data indices i we have • Example: 4 data points and 3 clusters such that K-means Clustering • Cost function: the sum-of-squared distances from each data point to its assigned prototype: • The K-means algorithm is coordinate descent on this cost function Coordinate Descent • Step 1: Fix values for and minimize w.r.t – assign each data point to the nearest prototype • Step 2: Fix values for and minimize w.r.t – this gives • Iterate these two steps • Convergence guaranteed since there are a finite number of possible settings for the allocations • It can only find local minima, so we should start the algorithm with many different initial settings From Gaussian Mixtures to K-means • A Gaussian mixture model: • Set the mixing proportions to • Write down the EM algorithm for fitting this model • Take to zero and recover the K-means algorithm – the E step of EM is Step 1 of K-means – the M step of EM is Step 2 of K-means The K in K-means • What if K is not known? – a challenging model selection problem – the algorithm itself is silent on the problem • The Gaussian mixture model perspective brings the tools of Bayesian model selection to bear in principle, but not in the limit • How about starting with Dirichlet process mixtures and Chinese restaurant processes instead of finite mixture models? Chinese Restaurant Process (CRP) • A random process in which customers sit down in a Chinese restaurant with an infinite number of tables – first customer sits at the first table – th subsequent customer sits at a table drawn from the following distribution: – where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated The CRP and Clustering • Data points are customers; tables are mixture components – the CRP defines a prior distribution on the partitioning of the data and on the number of tables • This prior can be completed with: – a likelihood---e.g., associate a parameterized probability distribution with each table – a prior for the parameters---the first customer to sit at table chooses the parameter vector, , for that table from a prior • We want to write out all of these probabilities and then take a scale parameter to zero CRP Prior, Gaussian Likelihood, Conjugate Prior The CRP Prior • Let denote a partition of the integers 1 through N, and let • Then, under the CRP, we have: • This function (the EPPF) is a function only of the cardinalities of the partition; this implies exchangeability The Joint Probability • Encode the partition with allocation variables • The joint probability of the allocations and the data is the product of the EPPF and the usual mixture model likelihood, where is now random • I.e., we obtain a joint probability • And we can find the MAP estimate of a clustering via: Small Variance Asymptotics • Now let the likelihood be Gaussian and take the variance to zero • We do this analytically by picking a rate constant and reparameterizing: • And letting go to zero we get: • I.e., a penalized form of the K-means objective Coordinate Descent: DP Means • Reassign a point to the cluster corresponding to the closest mean, unless the closest cluster has squared Euclidean distance greater than . In this case, start a new cluster. • Given the cluster assignments, perform Gibbs moves on all the means, which amounts to sampling from the posterior based on and all observations in a cluster. The CRP and Exchangeability • The CRP is a distribution on partitions; it is an exchangeable distribution on partitions • By De Finetti’s theorem, there must exist an underlying random measure such that the CRP is obtained by integrating out that random measure • That random measure turns out to be the Dirichlet process (e.g., Blackwell & MacQueen, 1972) The De Finetti Theorem • An infinite sequence of random variables is called infinitely exchangeable if the distribution of any finite subsequence is invariant to permutation • Theorem: infinite exchangeability if and only if for some random measure Random Measures and Their Marginals • The De Finetti random measure is known for a number of interesting combinatorial stochastic processes: – Dirichlet process => Chinese restaurant process (Polya urn) – Beta process => Indian buffet process – Hierarchical Dirichlet process => Chinese restaurant franchise – HDP-HMM => infinite HMM – Nested Dirichlet process => nested Chinese restaurant process Completely Random Measures (Kingman, 1967) • Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of – e.g., Poisson processes, gamma processes, beta processes, compound Poisson processes and limits thereof • The Dirichlet process is not a completely random measure – but it's a normalized gamma process • Completely random processes are discrete wp1 (up to a possible deterministic continuous component) • Completely random measures are random measures, not necessarily random probability measures Completely Random Measures (Kingman, 1967) x x x x x x x x x x x x x x x • Assigns independent mass to nonintersecting subsets of Completely Random Measures (Kingman, 1967) • Consider a Poisson random measure on with rate function specified as a product measure • Sample from this Poisson process and connect the samples vertically to their coordinates in x x x x x x x x x Gamma Process • The gamma process is a CRM for which the rate function is given as follows (on ): • Draw a sample from a Poisson random measure with this rate measure • And the resulting random measure can be written simply as: Dirichlet Process • The Dirichlet process is a normalized gamma process x x x x x x x x x x x x x x x Dirichlet Process • The Dirichlet process is a normalized gamma process x x x x x x x x x x x x x x x Dirichlet Process Marginals • Consider the following hierarchy: • The variables are clearly exchangeable • The partition structure that they induce is exactly the Chinese restaurant process Dirichlet Process Mixture Models Multiple Estimation Problems • We often face multiple, related estimation problems • E.g., multiple Gaussian means: • Maximum likelihood: • Maximum likelihood often doesn't work very well – want to “share statistical strength” Hierarchical Bayesian Approach • The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable • Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups Hierarchical Modeling • The plate notation: • Equivalent to: Hierarchical Dirichlet Process Mixtures Application: Protein Modeling • A protein is a folded chain of amino acids • The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) • Empirical plots of phi and psi angles are called Ramachandran diagrams Application: Protein Modeling • We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms • We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood • We thus have a linked set of density estimation problems Protein Folding (cont.) • We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood Protein Folding (cont.) Chinese Restaurant Franchise (CRF) Global menu: Restaurant 1: Restaurant 2: Restaurant 3: Small Variance Asymptotics • Define group-specific allocation variables for groups • As before, let the variance in the likelihood go to zero; the resulting posterior is asymptotically: • Leads to a simple coordinate descent algorithm with two levels of clustering decisions Dirichlet Processes vs. Beta Processes • The Dirichlet process yields a classification of each data point into a single class (a “cluster”) • The beta process allows each data point to belong to multiple classes (a “feature vector”) • Essentially, the Dirichlet process yields a single coin with an infinite number of sides • Essentially, the beta process yields an infinite collection of coins with mostly small probabilities, and the Bernoulli process tosses those coins to yield a binary feature vector Beta Processes • For the beta process, the rate function is given as follows (on the space ): degenerate Beta(0,c) distribution Base measure • And the resulting random measure can be written simply as: Beta Processes Beta Process and Bernoulli Process Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) • Indian restaurant with infinitely many dishes in a buffet line • Customers through enter the restaurant – the first customer samples dishes – the th customer samples a previously sampled dish with probability then samples new dishes Beta Process Marginals (Thibaux & Jordan, 2007) • Theorem: The beta process is the De Finetti mixing measure underlying the Indian buffet process (IBP) Hierarchical Beta Processes • A hierarchical beta process is a beta process whose base measure is itself random and drawn from a beta process Towards the BP-Means Algorithm • Let denote the Indian buffet matrix • We again need to compute • And perform small-variance asymptotics on • To obtain a cost function to which we can apply coordinate descent The Exchangeable Feature Probability Function (EFPF) • How do we compute ? • Broderick, Jordan and Pitman (2013) develop a general approach to computing such exchangeable feature probability functions (EFPFs) • Applied to the Indian buffet process, it yields: The Likelihood • What about ? • Many possible choices; a popular one is: Small-Variance Asymptotics • Now take to zero; this yields • This yields a coordinate descent algorithm that we refer to as “BP-Means”