MAD-Bayes - Big data: theoretical and practical challenges

advertisement
MAD-Bayes: MAP-based Asymptotic
Derivations from Bayes
Michael I. Jordan
INRIA
University of California, Berkeley
May 11, 2013
Acknowledgments: Brian Kulis, Tamara Broderick
Statistical Inference and Big Data
• Two major needs: models with open-ended
complexity and scalable algorithms that allow
those models to be fit to data
Statistical Inference and Big Data
• Two major needs: models with open-ended
complexity and scalable algorithms that allow
those models to be fit to data
• In Bayesian inference the focus is on the models
– burgeoning literature on Bayesian nonparametrics
provides stochastic processes for representing flexible
data structures
– but the algorithmic choices are limited
Statistical Inference and Big Data
• Two major needs: models with open-ended
complexity and scalable algorithms that allow
those models to be fit to data
• In Bayesian inference the focus is on the models
– burgeoning literature on Bayesian nonparametrics
provides stochastic processes for representing flexible
data structures
– but the algorithmic choices are limited
• So Big Data research hasn’t made much use of
Bayes, and is instead optimization-based
– but the model choices tend to be limited
Bayesian Nonparametric Modeling
• Examples of stochastic processes used in Bayesian
nonparametrics include distributions on:
–
–
–
–
–
–
–
–
directed trees of unbounded depth and unbounded fan-out
partitions
grammars
Markov processes with unbounded state spaces
infinite-dimensional matrices
functions (smooth and non-smooth)
copulae
distributions
• Power laws arise naturally in these distributions
• Hierarchical modeling uses these stochastic processes as
building blocks
The Optimization Perspective
• Write down a loss function and a regularizer
• Find scalable algorithms that minimize the sum of
these terms
• Prove something about these algorithms
The Optimization Perspective
• Write down a loss function and a regularizer
• Find scalable algorithms that minimize the sum of
these terms
• Prove something about these algorithms
• The connection to model-based inference is often in
the analysis, but is sometimes in the design
– e.g., Bayesian ideas are sometimes used to inspire the
design of the regularizer
The Optimization Perspective
• Write down a loss function and a regularizer
• Find scalable algorithms that minimize the sum of
these terms
• Prove something about these algorithms
• The connection to model-based inference is often in
the analysis, but is sometimes in the design
– e.g., Bayesian ideas are sometimes used to inspire the
design of the regularizer
• Where does the loss function come from?
The Optimization Perspective
• Write down a loss function and a regularizer
• Find scalable algorithms that minimize the sum of
these terms
• Prove something about these algorithms
• The connection to model-based inference is often in
the analysis, but is sometimes in the design
– e.g., Bayesian ideas are sometimes used to inspire the
design of the regularizer
• Where does the loss function come from?
– Gauss, Huber, Fisher, …
The Optimization Perspective
• Write down a loss function and a regularizer
• Find scalable algorithms that minimize the sum of
these terms
• Prove something about these algorithms
• The connection to model-based inference is often in
the analysis, but is sometimes in the design
– e.g., Bayesian ideas are sometimes used to inspire the
design of the regularizer
• Where does the loss function come from?
– Gauss, Huber, Fisher, …
• It’s all very parametric, and the transition to
nonparametrics is a separate step
This Talk
• Bayesian nonparametrics meets optimization
– flexible, scalable modeling framework
– gives rise to new loss functions and regularizers that are
naturally nonparametric
– no recourse to MCMC, SMC, etc
This Talk
• Bayesian nonparametrics meets optimization
– flexible, scalable modeling framework
– gives rise to new loss functions and regularizers that are
naturally nonparametric
– no recourse to MCMC, SMC, etc
• Inspiration: the venerable, scalable K-means
algorithm can be derived as the limit of an EM
algorithm for fitting a mixture model
This Talk
• Bayesian nonparametrics meets optimization
– flexible, scalable modeling framework
– gives rise to new loss functions and regularizers that are
naturally nonparametric
– no recourse to MCMC, SMC, etc
• Inspiration: the venerable, scalable K-means
algorithm can be derived as the limit of an EM
algorithm for fitting a mixture model
• We do something similar in spirit, taking limits of
various Bayesian nonparametric models:
– Dirichlet process mixtures
– hierarchical Dirichlet process mixtures
– beta processes and hierarchical beta processes
K-means Clustering
• Represent the data set in terms of K clusters, each
of which is summarized by a prototype
• Each data is assigned to one of K clusters
– Represented by allocations
for all data indices i we have
• Example: 4 data points and 3 clusters
such that
K-means Clustering
• Cost function: the sum-of-squared distances
from each data point to its assigned prototype:
• The K-means algorithm is coordinate descent on
this cost function
Coordinate Descent
• Step 1: Fix values for
and minimize w.r.t
– assign each data point to the nearest prototype
• Step 2: Fix values for
and minimize w.r.t
– this gives
• Iterate these two steps
• Convergence guaranteed since there are a finite
number of possible settings for the allocations
• It can only find local minima, so we should start the
algorithm with many different initial settings
From Gaussian Mixtures to K-means
• A Gaussian mixture model:
• Set the mixing proportions
to
• Write down the EM algorithm for fitting this model
• Take
to zero and recover the K-means algorithm
– the E step of EM is Step 1 of K-means
– the M step of EM is Step 2 of K-means
The K in K-means
• What if K is not known?
– a challenging model selection problem
– the algorithm itself is silent on the problem
• The Gaussian mixture model perspective brings the
tools of Bayesian model selection to bear in
principle, but not in the
limit
• How about starting with Dirichlet process mixtures
and Chinese restaurant processes instead of finite
mixture models?
Chinese Restaurant Process (CRP)
• A random process in which customers sit
down in a Chinese restaurant with an infinite
number of tables
– first customer sits at the first table
–
th subsequent customer sits at a table drawn from
the following distribution:
– where is the number of customers currently at table
and where
denotes the state of the restaurant
after
customers have been seated
The CRP and Clustering
• Data points are customers; tables are mixture
components
– the CRP defines a prior distribution on the partitioning of the
data and on the number of tables
• This prior can be completed with:
– a likelihood---e.g., associate a parameterized probability
distribution with each table
– a prior for the parameters---the first customer to sit at table
chooses the parameter vector, , for that table from a prior
• We want to write out all of these probabilities and then
take a scale parameter to zero
CRP Prior, Gaussian Likelihood,
Conjugate Prior
The CRP Prior
• Let
denote a partition of the integers 1
through N, and let
• Then, under the CRP, we have:
• This function (the EPPF) is a function only of the
cardinalities of the partition; this implies
exchangeability
The Joint Probability
• Encode the partition with allocation variables
• The joint probability of the allocations and the
data is the product of the EPPF and the usual
mixture model likelihood, where
is now random
• I.e., we obtain a joint probability
• And we can find the MAP estimate of a clustering via:
Small Variance Asymptotics
• Now let the likelihood
be Gaussian and
take the variance
to zero
• We do this analytically by picking a rate constant
and reparameterizing:
• And letting
go to zero we get:
• I.e., a penalized form of the K-means objective
Coordinate Descent: DP Means
• Reassign a point to the cluster corresponding to
the closest mean, unless the closest cluster has
squared Euclidean distance greater than . In
this case, start a new cluster.
• Given the cluster assignments, perform Gibbs
moves on all the means, which amounts to
sampling from the posterior based on
and all
observations in a cluster.
The CRP and Exchangeability
• The CRP is a distribution on partitions; it is an
exchangeable distribution on partitions
• By De Finetti’s theorem, there must exist an
underlying random measure such that the CRP is
obtained by integrating out that random measure
• That random measure turns out to be the
Dirichlet process (e.g., Blackwell & MacQueen,
1972)
The De Finetti Theorem
• An infinite sequence of random variables
is called infinitely exchangeable if
the distribution of any finite subsequence is
invariant to permutation
• Theorem: infinite exchangeability if and only if
for some random measure
Random Measures and Their Marginals
• The De Finetti random measure is known for a
number of interesting combinatorial stochastic
processes:
– Dirichlet process => Chinese restaurant process (Polya
urn)
– Beta process => Indian buffet process
– Hierarchical Dirichlet process => Chinese restaurant
franchise
– HDP-HMM => infinite HMM
– Nested Dirichlet process => nested Chinese restaurant
process
Completely Random Measures
(Kingman, 1967)
• Completely random measures are measures on a set
that assign independent mass to nonintersecting subsets
of
– e.g., Poisson processes, gamma processes, beta processes,
compound Poisson processes and limits thereof
• The Dirichlet process is not a completely random measure
– but it's a normalized gamma process
• Completely random processes are discrete wp1 (up to a
possible deterministic continuous component)
• Completely random measures are random measures, not
necessarily random probability measures
Completely Random Measures
(Kingman, 1967)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
• Assigns independent mass to nonintersecting subsets of
Completely Random Measures
(Kingman, 1967)
• Consider a Poisson random measure on
with rate
function specified as a product measure
• Sample from this Poisson process and connect the samples
vertically to their coordinates in
x
x
x
x
x
x
x
x
x
Gamma Process
• The gamma process is a CRM for which the rate
function is given as follows (on
):
• Draw a sample
from a Poisson
random measure with this rate measure
• And the resulting random measure can be written
simply as:
Dirichlet Process
• The Dirichlet process is a normalized gamma
process
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Dirichlet Process
• The Dirichlet process is a normalized gamma
process
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Dirichlet Process Marginals
• Consider the following hierarchy:
• The variables
are clearly
exchangeable
• The partition structure that they induce is exactly
the Chinese restaurant process
Dirichlet Process Mixture Models
Multiple Estimation Problems
• We often face multiple, related estimation problems
• E.g., multiple Gaussian means:
• Maximum likelihood:
• Maximum likelihood often doesn't work very well
– want to “share statistical strength”
Hierarchical Bayesian Approach
• The Bayesian or empirical Bayesian solution is to view the
parameters
as random variables, related via an
underlying variable
• Given this overall model, posterior inference yields
shrinkage---the posterior mean for each combines data
from all of the groups
Hierarchical Modeling
• The plate notation:
• Equivalent to:
Hierarchical Dirichlet Process Mixtures
Application: Protein Modeling
• A protein is a folded chain of amino acids
• The backbone of the chain has two degrees of
freedom per amino acid (phi and psi angles)
• Empirical plots of phi and psi angles are called
Ramachandran diagrams
Application: Protein Modeling
• We want to model the density in the
Ramachandran diagram to provide an energy
term for protein folding algorithms
• We actually have a linked set of Ramachandran
diagrams, one for each amino acid neighborhood
• We thus have a linked set of density estimation
problems
Protein Folding (cont.)
• We have a linked set of Ramachandran
diagrams, one for each amino acid neighborhood
Protein Folding (cont.)
Chinese Restaurant Franchise (CRF)
Global menu:
Restaurant 1:
Restaurant 2:
Restaurant 3:
Small Variance Asymptotics
• Define group-specific allocation variables
for
groups
• As before, let the variance in the likelihood go to
zero; the resulting posterior is asymptotically:
• Leads to a simple coordinate descent algorithm with
two levels of clustering decisions
Dirichlet Processes vs. Beta Processes
• The Dirichlet process yields a classification of each data
point into a single class (a “cluster”)
• The beta process allows each data point to belong to
multiple classes (a “feature vector”)
• Essentially, the Dirichlet process yields a single coin with
an infinite number of sides
• Essentially, the beta process yields an infinite collection
of coins with mostly small probabilities, and the Bernoulli
process tosses those coins to yield a binary feature
vector
Beta Processes
• For the beta process, the rate function is given as
follows (on the space
):
degenerate Beta(0,c) distribution
Base measure
• And the resulting random measure can be written
simply as:
Beta Processes
Beta Process and Bernoulli Process
Indian Buffet Process (IBP)
(Griffiths & Ghahramani, 2002)
• Indian restaurant with infinitely many dishes in a
buffet line
• Customers through enter the restaurant
– the first customer samples
dishes
– the th customer samples a previously sampled dish
with probability
then samples
new dishes
Beta Process Marginals
(Thibaux & Jordan, 2007)
• Theorem: The beta process is the De Finetti
mixing measure underlying the Indian buffet
process (IBP)
Hierarchical Beta Processes
• A hierarchical beta process is a beta process whose base measure
is itself random and drawn from a beta process
Towards the BP-Means Algorithm
• Let
denote the Indian buffet matrix
• We again need to compute
• And perform small-variance asymptotics on
• To obtain a cost function to which we can apply
coordinate descent
The Exchangeable Feature Probability
Function (EFPF)
• How do we compute
?
• Broderick, Jordan and Pitman (2013) develop a
general approach to computing such exchangeable
feature probability functions (EFPFs)
• Applied to the Indian buffet process, it yields:
The Likelihood
• What about
?
• Many possible choices; a popular one is:
Small-Variance Asymptotics
• Now take
to zero; this yields
• This yields a coordinate descent algorithm that we
refer to as “BP-Means”
Download