Uploaded by lemming_parodia.0p

Advanced Statistical Inference Course Notes

advertisement
11/03/2022
ASI
Advanced Statistical
Inference
1
11/03/2022
ASI - Advanced Statistical Inference
Index
1 - Bayesian Linear Regression _________________________________________________6
Recap - Probabilities ______________________________________________________________6
Recap - Expectations ______________________________________________________________6
The Gaussian Distribution __________________________________________________________8
The Multivariate Gaussian Distribution _______________________________________________8
Expectations Gaussians ____________________________________________________________9
Working example _________________________________________________________________9
Definitions _____________________________________________________________________10
Linear Regression Definitions ______________________________________________________10
Linear Models for Regression ______________________________________________________10
Linear Regression as Loss Minimization ______________________________________________10
Probabilistic Interpretation of Loss Minimization______________________________________11
Probabilistic Interpretation of Loss Minimization______________________________________11
Properties of the Maximum-Likelihood Estimator _____________________________________11
Properties of the Maximum-Likelihood Estimator _____________________________________12
Model Selection _________________________________________________________________12
Validation on "unseen" data _______________________________________________________13
How should we choose which data to hold back (as unseen data)? _______________________13
Cross-validation _________________________________________________________________13
Leave-one-outCross-validation _____________________________________________________13
Computational issues_____________________________________________________________13
Bayesian Inference ______________________________________________________________14
Bayesian Linear Regression ________________________________________________________14
When can we compute the posterior? _______________________________________________15
Why is this important? ____________________________________________________________15
Bayesian Linear Regression Finding posterior parameters ______________________________15
Bayesian Linear Regression Example ________________________________________________16
Predictive Distribution ___________________________________________________________17
Introducing basis functions ________________________________________________________17
Predictions _____________________________________________________________________18
Computing posterior: recipe_______________________________________________________18
Marginal likelihood ______________________________________________________________19
Choosing a prior _________________________________________________________________19
Summary _______________________________________________________________________19
2
11/03/2022
2 - Gaussian Process _________________________________________________________20
Gaussian Process ________________________________________________________________20
Bayesian Linear Regression as a Kernel Machine ______________________________________20
Bayesian Linear Regression as a Kernel Machine ______________________________________20
Kernels ________________________________________________________________________20
Gaussian Processes ______________________________________________________________21
Gaussian Processes Prior over Functions _____________________________________________21
Kernel _________________________________________________________________________21
Gaussian Processes Prior over Functions _____________________________________________21
Gaussian Processes Regression example _____________________________________________23
Optimization of Gaussian Process parameters ________________________________________24
Summary _______________________________________________________________________25
3 - Bayesian Logistic Regression and the Bayesian Classifier ________________________26
Classification ___________________________________________________________________26
Probabilistic v non-probabilistic classifiers ___________________________________________26
Classification syllabus ____________________________________________________________26
Some data ______________________________________________________________________26
Logistic regression _______________________________________________________________27
Bayesian logistic regression _______________________________________________________27
Defining a prior _________________________________________________________________27
Defining a likelihood _____________________________________________________________28
Posterior _______________________________________________________________________28
What can we compute? ___________________________________________________________28
MAP estimate (Maximum A Posteriori) _______________________________________________28
Decision boundary _______________________________________________________________29
Predictive probabilities ___________________________________________________________29
Roadmap _______________________________________________________________________30
Laplace approximation ___________________________________________________________30
Laplace approximation 1D example_________________________________________________30
Laplace approximation for logistic regression ________________________________________31
Predictions with the Laplace approximation _________________________________________31
Summary roadmap _______________________________________________________________32
MCMC sampling _________________________________________________________________33
Back to the script: Metropolis-Hastings _____________________________________________33
MH proposal ____________________________________________________________________33
MH acceptance __________________________________________________________________33
MH flowchart ___________________________________________________________________34
MH walkthrough _________________________________________________________________34
3
11/03/2022
What do the samples look like? ____________________________________________________35
Predictions with MH ______________________________________________________________35
Summary _______________________________________________________________________35
3.1 - Bayesian Classifier ______________________________________________________36
Bayes classifier__________________________________________________________________36
Bayes classifier likelihood _________________________________________________________36
Bayes classifier prior _____________________________________________________________36
Naive-Bayes ____________________________________________________________________36
Bayes classifier, example 1 ________________________________________________________36
Step 1: fitting the class-conditional densities ________________________________________37
Compute predictions _____________________________________________________________38
Bayes classifier, example 2 ________________________________________________________38
Fit the class conditionals… ________________________________________________________39
Compute predictions _____________________________________________________________39
Bayes classifier summary _________________________________________________________39
3.2 Performance Evaluation __________________________________________________40
Performance evaluation __________________________________________________________40
0/1 Loss _______________________________________________________________________40
ROC Analysis ____________________________________________________________________41
ROC curve ______________________________________________________________________41
AUC ___________________________________________________________________________42
Confusion matrices ______________________________________________________________42
Confusion matrices example ______________________________________________________42
Summary _______________________________________________________________________43
4 - Variational Inference _____________________________________________________44
Where are we? __________________________________________________________________44
Refresh: Kullback-Leibler divergence _______________________________________________44
Logistic Regression as a working example____________________________________________45
Inference ______________________________________________________________________45
Variational Inference _____________________________________________________________45
Visual illustration of Variational Inference ___________________________________________46
Variational Inference Form of the approximation _____________________________________46
Variational Inference Objective ____________________________________________________47
Variational Inference: Reparameterization trick ______________________________________49
Variational Inference Reparameterization trick (Derivation) ____________________________49
Variational Inference Reparameterization trick (Properties) ____________________________50
Variational Inference with Stochastic Optimization ___________________________________50
Stochastic Gradient Optimization __________________________________________________50
4
11/03/2022
Results on Classification __________________________________________________________51
Extensions ______________________________________________________________________51
Mini-batching ___________________________________________________________________51
Better approximation with Normalizing Flows ________________________________________52
5 - K-means, Kernel K-means, and Mixture models _______________________________53
Unsupervised learning ____________________________________________________________53
Aims ___________________________________________________________________________53
Clustering ______________________________________________________________________53
K-means _______________________________________________________________________53
How do we find ? ________________________________________________________________54
When does K-means break? _______________________________________________________54
Kernelizing K-means _____________________________________________________________55
Kernel K-means _________________________________________________________________55
K-means summary _______________________________________________________________56
Mixture models thinking generatively _______________________________________________56
A generative model ______________________________________________________________57
Mixture model likelihood _________________________________________________________57
Jensen's inequality_______________________________________________________________58
Optimizing lower bound __________________________________________________________59
Gaussian mixture model __________________________________________________________59
Update for qnk __________________________________________________________________60
Updates for and ________________________________________________________________60
Mixture model optimization algorithm ______________________________________________60
Mixture model clustering _________________________________________________________61
Mixture model issues _____________________________________________________________61
What can we do? (A: cross validation…) _____________________________________________61
Mixture models other distributions _________________________________________________61
Binary example _________________________________________________________________62
Summary _______________________________________________________________________62
6 - Feature Selection, PCA and probabilistic PCA _________________________________63
5
Bayesian Linear Regression
11 / 03 / 2022
1 - Bayesian Linear Regression
Recap - Probabilities
Consider two continuous random variables x and y:
• Sum rule:
p(x) = p(x, y)d y
∫
• Product rule:
p(x, y) = p(x ∣ y)p(y) = p(y ∣ x)p(x)
In the product rule, writing |y means that y is not a random variable anymore. It has a fixed
particular value. What is this y? It depends.
The interesting thing is that the order doesn’t matter.
• Bayes' rule:
p(y ∣ x) =
p(x ∣ y)p(y)
p(x)
It’s a consequence of what happen with the product rule.
Recap - Expectations
Consider a random variable with density p(x). Imagine wanting to know the average value of x ,
x̃. This consists simply in generate S sample and do the average:
x̃ ≈
1 S
xs
S∑
s=1
This is called empirical estimate.
Our sample based approximation to x̃ will get better as we take more samples.
We can also (sometimes) compute it exactly using expectations.
• Discrete:
x̃ = Ep(x)(x) =
xp(x);
∑
• Continuous:
x
6
Bayesian Linear Regression
11 / 03 / 2022
x̃ = Ep(x)(x) = xp(x)d x
∫
Example:
• X is outcome of rolling dice. P(X=x) = 1/6.
All we need to do is go though all these values, sum them together and multiply its value by the
probability so eventually we get 3.5:
x̃ =
∑
xP(X = x) = 3.5
x
• X is uniform distributed RV between a and b:
x̃ =
In general:
x=b
∫x=a
xp(x)d x = (b + a)/2
Ep(x)[ f (x)] = f (x)p(x)d x
∫
Some important properties:
Ep(x)[ f (x)] ≠ f (Ep(x)[x])
Ep(x)[k f (x)] = k Ep(x)[ f (x)]
The expectation of a random variable is commonly called “Mean”.
Mean and variance:
μ = Ep(x)[x]
σ 2 = Ep(x) [(x − μ)2 ] = Ep(x) [x 2 ] − μ 2
The square of the distance between the random variable x and the mean is a measure of the
dispersion of a random variable. So, in expectation, how far are we from the mean. If in
expectation we are very far from the mean — and we use the square because we don’t care
about which side of the mean we are, we just care about the distance — it means we have a
large variance.
NB: the expectation of the square is always bigger than the square of the expectation.
What we said also apply for vectors of random variables:
Ep(x)[ f (x)] = f (x)p(x)d x
∫
We want to make inference about high dimensional spaces and functions.
Mean and covariance:
μ = Ep(x)[x]
cov(x) = Ep(x) [(x − μ)(x − μ)⊤]
= Ep(x) [xx⊤] − μ μ ⊤
Usually when we discuss about vector we consider a matrix of Nx1, where N is the size of the
vector.
7
Bayesian Linear Regression
11 / 03 / 2022
In the covariance product we get an NxN matrix. What the variance tells is how a random
variable is far from the mean. In the diagonal of this matrix we are going to have the variance
for all the components of the vector. And all the diagonal elements are going to say “for this
particular component, how much does the other components vary”. It captures the correlations
between random variables.
The Gaussian Distribution
Consider a continuous random variable V. The Gaussian probability density function is:
p (v ∣ μ, σ 2) =
1
σ
exp
2π
1
(v − μ)2
{ 2σ 2
}
−
μ is the mean and it controls the location (more on the left or more on the right) of the kernel,
σ 2 is the variance and controls the “flatness”.
The first term is used as a scaler, it has to control that the area inside the curve is equal to 1
and depends on the variance.
The Multivariate Gaussian Distribution
It’s the distribution of a random vector. It is the most useful one, the most flexible one to use in
Machine learning.
Consider v = (v1, . . . , vD )T with joint Gaussian distribution p(v | μ, Σ) = 𝒩(v | μ, Σ):
𝒩(v ∣ μ, Σ) =
The inverse covariance
1
(2π) D/2 | Σ |1/2
exp
1
− (v − μ)⊤Σ−1(v − μ)
{ 2
}
Σ−1 — the object that controls the variance in the diagonal and the
1
1
, now we have . It normalizes the scalar
2
2σ
Σ
product according to the covariance and this is going to tell us the “orientation” of the gaussian.
Now the normalization of the integral is harder to compute than before but if we do it we get:
interaction — acts like a scaler: before we had
1
(2π) D/2 | Σ |1/2
Numerically this may be challenges. Here we have a vision from the top:
Using 9 in the diagonal will scale each dimension by a factor which is 1/9. This will give the
same results as take two gaussian and multiply them together. The second case and the third
cases are more interesting. The interesting here is that -4 is going to tell us that the fact that
one variable is on average larger then the mean, what is going to do to the average value of the
second variables compared to the mean. So here the fact that we have -4 means that whenever
we are in v − μ for the first value is positive on average we are going to have a negative effect
for the second one.
The interesting thing about that is that if we take the eigenvectors of the covariance we are
doing to get a basis onto which the covariance is actually the diagonal. If we rotate the axis v1
8
Bayesian Linear Regression
11 / 03 / 2022
and v2 in a way that they are aligned with the major axis of the gaussian, this rotation allows to
see the covariance which is perfectly diagonal.
Expectations Gaussians
For the gaussian, the introduction of these parameters that we chose in order to locate and play
around the location of this distribution are going to be the mean and variance of the distribution
and the covariance in the multivariate case.
Univariate:
p (x ∣ μ, σ 2) = 𝒩 (x ∣ μ, σ 2)
Mean: Ep(x)[x] = μ
Variance: Ep(x) [(x − μ)2 ] = σ 2
Multivariate:
p (x ∣ μ, Σ) = 𝒩(x ∣ μ, Σ)
Mean: Ep(x)[x] = μ
Variance: Ep(x) [(x − μ)(x − μ)⊤] = Σ
Working example
We need to reverse the way this data was obtained and try to figure out which was the function
that generated this data.
The application of Bayesian Principles allows to obtain a distribution over function that can
model this data. This is powerful because this picture is telling us that there is not just one
single function that interpolates the data but according to the knowledge that we have we can
use Bayesian allows me to do assumptions.
9
Bayesian Linear Regression
11 / 03 / 2022
Definitions
Features, inputs, covariates, or attributes x:
x ∈ ℝD
X = (x1, . . . , xN )T
Labels, outputs, or responses:
y ∈ ℝO
Y = (y1, . . . , yN )T
Linear Regression Definitions
Data is a set of N pairs feature vectors and labels:
D
O
𝒟 = {(xi, yi )}i=1,...N
GOAL: Estimate a function f (x) : ℝ → ℝ
For simplicity, we will assume O = 1 (univariate labels) y
so we aim to estimate: f (x) : ℝD → ℝ.
= (y1, . . . , yN )T
Linear Models for Regression
Implement a linear combination of basis functions
f (x) =
D
∑
wi φi(x)
i=1
⊤
= w φ(x)
with
φ(x) = (φ1(x), …, φD(x)) .
⊤
For simplicity we will start with linear functions
f (x) =
D
∑
wi xi
i=1
⊤
=w x
Where w controls the “angle”.
Linear Regression as Loss Minimization
Definition of the quadratic loss function:
ℒ=
N
y − w ⊤ xi]
∑[ i
2
i=1
= ∥y − Xw∥2
We want that the difference between this y and the evaluation of my function to be as small as
possible. This is what we call the loss: for all the data I have I want my function to be very close
to the observation I have.
One way to put this computation in a matrix form is: we can think at the sum of squared, we can
think of a sum of a norm, a vector, and this vector has components yi and w T xi.
Solution to the regression problem is:
∇w ℒ = 0
⟹
w ̂ = ( X⊤ X)
−1
X⊤ y
10
Bayesian Linear Regression
11 / 03 / 2022
Probabilistic Interpretation of Loss Minimization
Consider a simple transformation of the loss function
Instead of minimizing the first function we can take the function on the right (which looks like a
gaussian) and try to maximize that one. I want to find w such that my model is most likely to
explain this data. So instead of minimizing a loss we maximize a likelihood.
Minimizing the quadratic loss equivalent to maximizing the Gaussian likelihood function.
exp(−γℒ) = exp (−γ∥y − Xw∥2)
1
∝ 𝒩 y ∣ Xw,
(
2γ )
Probabilistic Interpretation of Loss Minimization
If we analyze the likelihood a little more we assume that our y is a distribution around a mean
which is Xw with a certain variance which is 1/2γ. So somehow we are assuming that there is a
model which puts some sort of gaussian distribution over the observation around the mean.
So the likelihood
1
hints to the fact that we are assuming:
𝒩 y ∣ Xw,
(
2γ )
yi = w ⊤ xi + εi
The epsilon is goin to be some sort of noise which is going to have variance 1/2γ. In vectorial
form:
y = Xw + ϵ
With ϵ ∼ 𝒩 (ε ∣ 0,σ 2 I).
Remark: the likelihood is not a probability! It’s a probability density function over y and this is
controlled by w.
If the ML solution is
the optimal σ 2:
w ̂ = (X ⊤ X )
−1
X ⊤y now we can also maximize the log-likelihood to obtain
∂ log [p (y ∣ x, w, σ 2)]
∂σ 2
yielding
σ2̂ =
=0
1
(y − X w )̂ ⊤(y − X w )̂
N
Properties of the Maximum-Likelihood Estimator
Are there any useful properties for the estimator ŵ ?
11
Bayesian Linear Regression
11 / 03 / 2022
The estimator w is unbiased! That is:
Ep(y∣X,w)[ w ]̂ =
∫
w p(y
̂
∣ X, w)dy = w
An estimator is unbiased when the expectation of the estimator under the distribution of
p(y | X, w) is actually w.
But what is this expectation Ep(y∣X,w)[ w ]̂ ? Imagine the process of how to compute the
expectation. An expectation can be approximated as an average, we sample from the
distribution we have and we take an average. So imagine we generate many datasets, so we
generate many ys. For each of this dataset we estimate ŵ . So we are going to have a family of w
hat. Then we take the average of these ws. This average is going to be exactly w in the limit of
infinte estimations.
So if the data are generated from the model that I am assuming we are doing the best we can.
This is an important property.
Properties of the Maximum-Likelihood Estimator
Unfortunately, the estimate of the optimal σ 2 is biased!
Ep(y∣X,w)
(
σ2̂
)
=
1
Ep(y∣X,w) [(y − X w )̂ ⊤(y − X w )̂ ]
N
D
= σ2 1 −
(
N)
Model Selection
How can we prefer one model over another? Lowest loss N highest likelihood? NO! Higher model
complexity yields lower loss M higher likelihood but it usually does not generalize well on test
data. So we have to avoid overfitting and to do so we have to define another way to select our
model.
Model Selection Effect of increasing model complexity
Consider polynomial functions:
f (x) =
k
∑
wi x i
i=0
The training loss decreases with k but test loss increases:
But on the test the error decreases and then eventually goes up again. So we could use a portion
of our data as “unseen data”, as our validation set. And this is the
12
Bayesian Linear Regression
11 / 03 / 2022
Validation on "unseen" data
Cross-validation is a safe way to do model selection
Predictions evaluated using validation loss:
ℒv =
1
Ntest
∑
i∈ℐtest
⊤
(yi − w xi)
2
We take this loss as a measure of how well our model is. We could also take the validation loglikelihood, in that case we would like it to be big.
log [p (ytest ∣ Xtest , w ,̂ σ 2)] = −
1
2σ 2
∑
i∈ℐtest
⊤
(yi − w xi)
2
How should we choose which data to hold back (as unseen data)?
In some applications it will be clear but in many cases pick it randomly.The best thing to do is to
do it more than once and then average the results.
So the cross-validation is made by splitting the data into C equal sets; train on C-1, test on
remaining.
Cross-validation
If we do this C times we have to learn C time our model in order just to do our validation and
this could be problematic!
Leave-one-outCross-validation
Extreme example is when C = N SO each fold includes one input-label pair. We call it “Leaveone-out” (LOO) CV.
Computational issues
CV and LOOCV let us choose from a set of models based on predictive performance.
This comes at a computational cost:
- For C-fold CV. need to train our model C times.
- For LOO-CV, need to train out model N times.
13
Bayesian Linear Regression
11 / 03 / 2022
Bayesian Inference
⊤
Inputs : X = (x1, …, xN)
Labels :
Weights
y = (y1, …, yN)
⊤
: w = (w1, …, wD)
⊤
The essence of how we are going to apply this to machine learning is as follow: we are going to
turn the condition p(y | w) into p(w | y) . We are going to use p(y | X, w) , imagining X is fixed.
Why is this important? Because we are going to turn something we know, our likelihood function,
into something powerful: a distribution over w given data. This is going to tell us what are the
values of w which are compatible with the observation we have. And it’s not going to be just a
value, there is no argmax, no optimization. In Bayesian theorem there is no maximization. So
this problem is just an identity distribution. The way we can turn p(y | w) into p(w | y) is:
p(w ∣ y, X ) =
p(y ∣ X, w)p(w)
∫ p(y ∣ X, w)p(w)d w
But what is p(w)? Is some sort of distribution we have over the parameter and it’s not conditional
data, it’s something that we know about parameter before we look at any data. In our case it
can be any distribution. Now, the process of multiplying by the likelihood is going to give us
something which is a distribution over w which is constraint by the fact that we have observed
data, we have evidence now. Not all the values of w are good to model our data, some of them
are very bad functions to model our data and so those are going to get a very low value in the
distribution of p(w|y,X) because the likelihood is going to be very small.
Now we can focus on the denominator, which is a normalization constant. This is going to be the
problem. Bayesian inference is nice but we have to solve this integral. It’s a quite difficult
problem because it’s an integral in D dimension. Integrals are messy and especially here where
we have a product of functions. In the gaussian case this is going to be easy because product of
gaussian is going to be an exponential of a quadratic form. But in general this is not the case and
we have to approximate this integral.
Bayesian Linear Regression
So here we are formalizing what we said. This is the likelihood function we talked about before
and this is what we call likelihood. p(y|Xw) is going to me a sort of gaussian, centered on Xw
and variance σ 2:
p (y ∣ w, X, σ 2) = 𝒩 (y ∣ Xw, σ 2 I)
14
Bayesian Linear Regression
11 / 03 / 2022
Now we are going to have a gaussian distribution over the parameters before looking at any
data. We call it Gaussian proper over model parameters because this is prior to observing any
data:
p(w) = 𝒩(w ∣ 0, S)
We can expand the integral into something which is equal to p(y|X):
p(w ∣ X, y) =
p(y ∣ X, w)p(w)
∫ p(y ∣ X, w)p(w)d w
=
p(y ∣ X, w)p(w)
p(y ∣ X)
Now, thanks to the multiplication by the likelihood we can turn the prior, p(w), into a posterior,
which is the distribution over parameters after observing data.
So these are the actors of the Bayes rule:
Posterior density: p(w|X,y), the distribution over parameters after observing data;
Likelihood: p(y|X,w), the measure of “fitness”;
Prior density: p(w), anything we know about parameters before we see any data.
Marginal likelihood: p(y|X), a normalization constant that ensures
∫
p(w ∣ X, y)d w = 1.
We are integrating the likelihood wrt the prior, so if we sample from the prior how many times
we get a good likelihood? If p(w) has few parameters, for example just one parameter, we are
going to have a simple one gaussian. But the likelihood is not very good because data it’s more
complex but at least I have some sort of support of p(w) with a small likelihood, which is not
very good.
Very complex models are going to have spread p(w) across huge dimensional space and there are
some p(y|X,w) which are really, really good because the model is very complex and the value of
the likelihood is very high.
Of course we will need a trade off, a model that cover the space in a reasonable way with p(w)
in a way that the likelihood is also good. This is the meaning of this marginal likelihood.
When can we compute the posterior?
Conjugacy (definition): A prior p(w) is said to be conjugate to a likelihood if results in a posterior
of the same type of density as the prior.
Example:
Prior: Gaussian;
Likelihood: Gaussian;
Posterior: Gaussian
Prior: Beta;
Likelihood: Binomial;
Posterior: Beta
Many others…
If we know that the posterior is going to have a certain form we are just going to match the
parameters directly.
Why is this important?
Bayes rule:
p(w ∣ X, y) =
p(y ∣ X, w)p(w)
p(y ∣ X )
If prior and likelihood are conjugate, we know the form of p(w|X, y); Therefore, we know the
form of the normalizing constant; Therefore, we don't need to compute p(y|X)! So in the case of
the mean and the variance we just have to understand what is the mean and what is the
variance of the product. And a way to do this is…
Bayesian Linear Regression Finding posterior parameters
Back to our model… The posterior must be Gaussian, ignoring normalizing constants, the
posterior is:
15
Bayesian Linear Regression
11 / 03 / 2022
p (w ∣ X, y, σ 2) ∝ exp
1
− (w − μ)⊤Σ−1(w − μ)
{ 2
}
= exp
1
− (w ⊤Σ−1w − 2w ⊤Σ−1 μ + μ ⊤Σ−1 μ)
{ 2
}
∝ exp
1
− (w⊤Σ−1w − 2w⊤Σ−1 μ}
}
{ 2
Ignoring non-w terms, the prior multiplied by the likelihood is:
p (y ∣ w, X, σ 2)
∝ exp
1
1
−
(y − Xw)⊤(y − Xw) exp − w ⊤S−1w
{ 2σ 2
}
{ 2
}
∝ exp
1
1
2
−
w ⊤ 2 x ⊤ X + S−1 w − 2 w ⊤ x ⊤ y
[
]
σ
σ
{ 2(
)}
Posterior (from previously):
∝ exp
1
− (w ⊤Σ−1w − 2w ⊤Σ−1 μ)
{ 2
}
Equate individual terms on each side. Covariance:
w⊤Σ−1w = w⊤
Σ=
1 ⊤
X X + S−1 w
[ σ2
]
1 ⊤
X X + S−1
( σ2
)
Mean:
−1
2 ⊤ ⊤
w X y
σ2
1
μ = 2 Σ⊤ y
σ
2w ⊤Σ−1 μ =
Bayesian Linear Regression Example
Imagine we have some data. If we take as model assumption a linear model with two parameters
f(x): W0 + W1x
This is the kind of family of function that are compatible with the data, under the modeling
assumption:
16
Bayesian Linear Regression
11 / 03 / 2022
This illustration explains the fact that we get a concentration around certain values of
parameters:
Predictive Distribution
We have a family of functions that fit our data so if we take a new input called x* and want to
predict the value of y we are not going to get just one value from y but a family of function that
goes through x* so a family of value for y*. This is what we call Predictive Distribution. In order
to do this we apply the sum rule and the product rule and this allows us to derive a pretty
powerful expression:
p (y* ∣ x, y, x*, σ 2) = p (y* ∣ x*, w, σ 2) p (w ∣ x, y, σ 2) d w
∫
Consider this p(y*|x*, w, σ 2). This is the likelihood function that we impose on the model. So this
is giving me what is p(y*) given x*, w, σ 2.
So if you give me w I know how to predict on x*, I just evaluate my function at x*. And this is
going to be my mean of the distribution over y*.
But now I have the posterior over w, the distribution over w given data. So what I can do is to
assign a weight to each of my prediction according to how good my parameters are, and for good
I mean something tells me how large is the posterior of p(w), of w given data. And if I do that
something magical happens! My prediction does not contain parameters anymore. My prediction
on y* is going to be a distribution where there is no w anymore so w disappears and this is very
elegant. We are doing prediction parameter free! We don’t need to optimize parameter. And the
way it is done is thanks to this predict distribution, this simple expression:
p (y* ∣ x, y, x*, σ 2) = p (y* ∣ x*, w, σ 2) p (w ∣ x, y, σ 2) d w
∫
Same tedious exercise as before yields:
p (y* ∣ X, y, x*, σ 2) = 𝒩 (y* ∣ x⊤* μ, σ 2 + x⊤* Σx*)
Here we can see that the mean of the prediction is going to be centered on the mean of w while
the covariance is going to have the covariance of w, which is sigma, and we are going to have
this quadratic form with x*.
Introducing basis functions
Now we can transform our input through basis functions and apply the same machinery. Instead
of working with x we will work with ϕ(x)
x → φ(x) = (φ1(x), …, φD(x))
⊤
17
Bayesian Linear Regression
11 / 03 / 2022
Somehow if we think of the matrix phi instead of the matrix x we are going to have N x D where
we evaluate for each xi we evaluate the function from ϕ1 to ϕD and do it for all the N:
Applying Bayesian Linear Regression on the transformed features gives:
Covariance:
Σ=
p (w ∣ X, y, σ 2) = 𝒩(w ∣ μ, Σ)
1 ⊤
Φ Φ + S−1
( σ2
)
−1
1
ΣΦ⊤ y
σ2
⊤
⊤
Predictions: p (y* ∣ X, y, x*, σ 2) = 𝒩 (y* ∣ φ (x*) μ, σ 2 + φ (x*) Σφ (x*))
Mean:
μ=
The important thing about this is that now we increase the flexibility of our model tremendously
because now we don’t need to work with hyperplanes, we can work with something which is
more involved, non-linear functions, but the beautiful thing about this is that we still have a
model which is linear in the parameters because we still have a combination of this functions.
The parameters still enter linearly in the equations and that’s what makes this a linear model
even when we have non-linear functions. We can do polynomial, sine, cosine, log, exp, whatever
we want provided that we combine them linearly we can still apply this machinery, we can still
be Bayesian, we can still use this posterior distribution over parameters, we can still do
prediction and everything is going to be gaussian, the posterior is going to be Gaussian, the
predicted distribution is going to be gaussian.
So Bayesian linear regression is a solved problem! The only problem is how to choose these basis
functions.
Predictions
Here we can see polynomial of order 2:
Which is a family of functions generated by sampling from the posterior distribution of the
parameters.
Computing posterior: recipe
(Assuming prior conjugate to likelihood)
I. Write down prior times likelihood (ignoring any constant terms)
II. Write down posterior (ignoring any constant terms)
III. Re-arrange them so the look like one another
IV. Equate terms on both sides to read off parameter values.
18
Bayesian Linear Regression
11 / 03 / 2022
Marginal likelihood
So far, we've ignored p(y|X, σ 2), the normalizing constant in Bayes rule.
We stated that it was equal to:
p (y ∣ X, σ 2) = ∫ p (y ∣ X, w, σ 2) p(w)d w
We're averaging over all values of w to get a value for how good the model is, how likely is y
given X and the model.
We can use this to compare models and to optimize σ 2!
When prior is
𝒩(μ0, Σ0) and likelihood is𝒩(Xw, σ 2 I), marginal likelihood is:
p (y ∣ X, y, σ 2, μ0, Σ0) = 𝒩 (y ∣ Xμ0, σ 2 I + XΣ0 X⊤)
i.e. an-dimensional Gaussian evaluated at y.
If we use the marginal likelihood as a criterion of model selection we can see that 2 is the best
choice:
Choosing a prior
How should we choose the prior?
I. Prior effect will diminish as more data arrive;
II. When we don't have much data, prior is very important.
Some influencing factors:
I. Data type: real, integer, string, etc.
II. Expert knowledge: 'the coin is fair', 'the model should be simple’;
III. Computational considerations (not as important as it used to be!)
IV. If we know nothing, can use a broad prior e.g. uniform density.
Summary
I. Moved away from a single parameter value.
II. Saw how predictions could be made by averaging over all possible parameter values
Bayesian.
III. Saw how Bayes rule allows us to get a density for W conditioned on the data (and other
stuff).
IV. Computing the posterior is hard except in some cases....
V. ....we can do it when things are conjugate.
VI. Can also (sometimes) compute the marginal likelihood....
VII. ...and use it for comparing models
I. No need for costly cross-validation.
19
Gaussian Process
18/03/2022
2 - Gaussian Process
Gaussian Process
Linear models requires specifying a set of basis functions: Polynomials, Trigonometric, …
Can we use Bayesian inference to let data tell us?
Gaussian Processes kinda do that, they work implicitly with an infinite set of basis functions and
learn a probabilistic combination of these. Also, as we increase the number of data, the model
capacity grows with the number of data, so the more data we use the more complex the model
become automatically. If we have an infine set of basis functions as we increase the number of
data our posterior is going to keep changing because we add data to the problem. And somehow
because there is an infinite number of this basis function what happen is that if we really use a
lot of data the posterior really can capture all these data without any problem because there is
an infinite number of basis functions. When we have a linear model in which we just have the
identity function as basis function, the capacity of this model can’t improve, we just have a
basis function which is fixed.
Bayesian Linear Regression as a Kernel Machine
We are going to show that predictions can be expressed exclusively in terms of scalar products as
follows:
k (x, x′) = ψ (x)⊤ψ (x′)
This allows us to work with either k( ⋅ , ⋅ ) or ψ ( ⋅ )
Why is this useful? Because a scalar product it’s just a function that takes two input and gives a
scalar. What are the properties of this number? If it’s a scalar product we know that the function
must be positive definite. If we now choose the basis functions we will end up with a
formulation of this kind: k (x, x′) = ψ (x)⊤ ψ (x′) . But if we think about it as a function that
takes two argument and spits out a scaler the other way we can see this is that whenever we
have this scalar product we can replace it with the function k. Now as long as we choose
function k(x,x’) that is positive definite I may also do the other way around: instead of
specifying the basis function I could specify k and that k may induce a psi of any kind. Also, if I
choose a certain k I can assure that psi is infinite dimensional and I don’t need to know what psi
is, all I need to know is what the scalar product is. We can choose k such that psi is infinte
dimensional and all I need to know is the scala product between the two vector psi. This is really
what people refers to the kernel trick in SVM for example.
Bayesian Linear Regression as a Kernel Machine
This is one reason why k (x, x′) = ψ (x)⊤ ψ (x′) is useful. The other one is the following: if we
work with basis functions what we need to do is to be able to store the covariance matrix
inverted for the posterior over the parameters.
We have to remember that matrix is DxD so we have to compute a quadratic function and if we
want to factorize the matrix it’s going to cost D3 time.
So, working with ψ ( ⋅ ) costs O(D2) storage, O(D3) time.
Things are different with k: working with k( ⋅ , ⋅ ) costs O(N2) storage, O(N3) time.
So gaussian processes are nice but this constraints are pretty bad.
One way to do that, if we have a lot of features and few data, it makes sense to work with this
formulation of kernel because we only evaluate things which are NxN instead of DxD. So this is
one reason why we would use gaussian processes, when we have more features than
observation.
But what if we could pick k( ⋅ , ⋅ ) so that ψ ( ⋅ ) is infinite dimensional?
Kernels
It is possible to show that for
k (x, x′) = exp −
(
x − x′
2
2
)
20
Gaussian Process
18/03/2022
there exists a corresponding ψ ( ⋅ ) that is infinite dimensional! Of course there are other kernels
satisfying this property.
To show that Bayesian Linear Regression can be formulated through scalar products only, we
need Woodbury identity:
(A + UCV)-1 = A-1 A-1 U(C-1 + VA-1 U)-1 VA-1
Do not memorize this!
Gaussian Processes
Gaussian Processes can be explained in two ways
1. Weight Space View (Bayesian linear regression with infinite basis functions);
2. Function Space View (Defined as priors over functions)
Gaussian Processes Prior over Functions
Consider an infinite number of Gaussian random variables: think of them as indexed by the real
line and as independent and denote them as f(x).
If we look at the covariance of all these random variables we are going to get something which is
infinte dimensional and diagonal.
Kernel
2
Consider the Gaussian kernel again: k (x, x′) = α exp(−β | | x − x′| | ) . We introduced some
parameters for added flexibility. We can see how multiplying for alpha and beta effetely we
multiply the overall shape of the polynomial where alpha > 0 -> stretch vertically the basis
function.
NB: if we multiply x and x’ by different coefficient we will have trouble because it’s difficult to
prove it’s a kernel.
Gaussian Processes Prior over Functions
Now imagine that we use this function here so say something of this kind: this function is going
to be a gaussian (it’s going to have a bump in x-x’ = 0). Imagine now to use this function to
impose covariance on the random variables that are around 0, around the middle point of the
plot.
If we do that, if we fix x=0, and we say that all the random variables around 0 should behave in
a way that they covary with x=0 according to the function that we decided.
This can be used as a prior over functions instead of parameters, who multiply parameters by
basis functions, and combination of basis function and so on.
21
Gaussian Process
18/03/2022
Now we can play around with alpha and beta. These are Infinite Gaussian random variables with
parameterized and input-dependent covariance:
We can also use all this for model selection and if we can have access to the marginal likelihood
of the model we can optimize the marginal likelihood w.r.t. alpha and beta to get something
nice.
But how can we deal with infinity? We are still talking about an infinite number of random
variables, all of them are Gaussian. How can I handle them?
If I think of N random variables joined into a gaussian, if I only care about a few of them I don’t
really need to know what the others do. If I look at the covariance I have an infinite number of
variable and they are all gaussian so I have this infinite by infinite covariance! But if I only care
about what happens at a certain point, at certain random variables then the only thing I need to
do is to take this big covariance, select the rows and the columns corresponding to the random
variable I am looking at and throw away anything else. If I do this I will obtain a matrix which is
going to be NxN.
Let’s have a look formally at what we have just done. We have the vector f, which is the vector
of realization of these random variable f(x1), … ,f(xN). Then the distribution over these guys
based on the construction we have just done tells us that they have 0 mean and K covariance.
So, The marginal distribution of f=(f(x1), … ,f(xN))T is p(f I X) = N(0, K), with:
m̄ = k⊤* K−1f
s̄2 = k** − k⊤* K−1k*
22
Gaussian Process
18/03/2022
The definition of k* and K are the same as before: k* is the vector of evaluating the kernel
between x* and all the other data points in the training; K is going to be the matrix obtained by
evaluating the covariance across all pairs of training points; k** is the evaluation of K between x*
and itself.
So now we have a prior. What do we do? Let’s introduce a likelihood. So that we may will find a
way to get posterior over this functions.
Remember that when we modeled labels y in the linear model we assumed noise with variance
sigma2 around wTx. Let’s do the same thing in Gaussian process but now the likelihood maybe we
want to center it across f. We are going to put a prior in f we have a likelihood and then we are
going to be able to find perhaps a posterior over f. The way we do it is as before: we say that
the likelihood is independent across observations (which doesn’t mean that it’s completely
random) and that
p(y ∣ f) = ∏i=1 p (yi ∣ fi)
N
with
p (yi ∣ fi) = 𝒩 (yi ∣ fi, σ 2)
It’s just as before excepts that now fi is not computed as wTxi but its’ just our fi that we put a
gaussian process prior over.
Likelihood and prior are both Gaussian conjugate! Because we have a gaussian prior over f and
the likelihood is gaussian.
And so we can compute the marginal likelihood and the posterior.
We can integrate out the Gaussian process prior over f:
p(y ∣ X) = p(y ∣ f)p(f ∣ X)d f
∫
The marginal likelihood is going to be something which is p(y|X) and it’s going to give us
something pretty simple because when we have the integral like that we can just sum the
variances:
p(y ∣ X) = 𝒩 (0, K + σ 2 ∣ )
We can derive the predictive distribution as follows:
p ( f* ∣ y, x*X) = p ( f* ∣ f, x*, X) p(f ∣ y, X)d fd f* = 𝒩 (m, s 2)
∫
With
m = k⊤* (K + σ 2 I)
s 2 = k** − k⊤* (K + σ 2 I)
−1
−1
y
k*
Same expression as in the "Weight-Space View" section.
Let’s use this posterior with the gaussian condition we can get the probability of f* given data.
Again f disappears because we remove the dependence of w because we sum over all possibile
values of w. Here we do the same summing over all possibile value of f weight by how good they
are given data.
Gaussian Processes Regression example
Some data generated as a noisy version of some function
23
Gaussian Process
18/03/2022
Draws from the posterior distribution over f. on the real line
Optimization of Gaussian Process parameters
The kernel has parameters that have to be tuned. Alpha and beta control the kind of family of
function that can be used:
k (x, x′) = α exp (−β
x − x′
2
)
2
and there is also the noise parameter σ in the sense that we can thing of this sigma square
controlling the variance of the likelihood and also as a modeling choice. So let’s put them all
into a parameter vector called theta:
θ = (α, β, σ 2)
For simplicity let define C=K+σ 2I. Maximize the logarithm of the likelihood:
p(y ∣ X, θ ) = 𝒩(0, C)
that is
−
1
1
log | C | − y⊤C−1y + const
2
2
Derivatives can be useful for gradient-based optimization
∂ log[ p(y ∣ x, θ )]
1
∂C
1
∂C −1
= − Tr C−1
+ y⊤C−1
C y
∂θi
2 (
∂θi ) 2
∂θi
24
Gaussian Process
18/03/2022
Summary
Introduced Gaussian Processes
- Weight space view
- Function space view
Gaussian processes for regression
Optimization of kernel parameters
To think about:
- Gaussian processes for classification?
- Scalability?
25
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
3 - Bayesian Logistic Regression and the Bayesian
Classifier
Classification
A set of N objects with attributes (usually vector) xn:
Each object has an associated response (or label) yn.
Binary classification: yn ∈ {0,1} or y, ∈ {-1,1} (depends on algorithm).
Multi-class classification: yn ∈ {1,2,... K}.
Probabilistic v non-probabilistic classifiers
⊤
⊤
Classifier is trained on X = (x1, …, xN) and y = (y1, …, yn) and then used to classify x*.
In the same kind of vain of what we saw in regression in the end what we really want is to end
up with an expression of this kind:
P (y* = k ∣ x*, x, y)
So we want to know the probabilty of a class membership y* equal to k given x* (obviously
because it’s the new feature vector we want to know the label) and obviously previous data. So
we want to use information from training data to say something about the label of a new test
point.
So we wanto to estimate the distribution over the class label y* given the information from
training data.
So for example for binary calssification, P (y* = 1 ∣ x*, X, y) and P (y* = 1 ∣ x*, X, y).
For non-probabilistic classifiers, instead, we would have something that does not give us a full
probability distribution over the class labels but just something that says whether y is equal to 1
or 0.
But probabilities provide us with more information P(y* = 1) = 0.6 is more useful than y* = 1. It
tells us how sure the algorithm is. Particularly important where cost of misclassification is high
and imbalanced. e.g. Diagnosis: telling a diseased person they are healthy is much worse than
telling a healthy person they are diseased. Extra information (probability) often comes at a cost.
Classification syllabus
We will study two probabilistic classifiers:
• Bayes classifier;
• Logistic regression.
Some data
Squares and circles are the two classes. We
want to find a boundary to separate these two
classes so that when we have a new data point
we know with what probability this is going to
be classified.
26
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
Logistic regression
Similarly to regression, we could think about modeling P(y*= k|x*, w) through some f(×*; w) with
parameters w.
Before we saw this w T x, which is a linear combination of the features. Can we use this here? No
output is unbounded and so can't be a probability. Also, the sum of probability needs to be 1 and
each of the P of y* equal to any class has to be something between 0 and 1. We cannot have
anything grater than 1. So really w T x doesn’t works. Unless…
We think of something that really squashes these values in order for them to lie in the interval
0-1. So if we apply a transformation h to this linear function then we can get something that is
always between 0 and 1:
P (y* = k ∣ x*, w) = h ( f (x*; w)) where h(.) squashes f(x*;w) to lie between 0 and 1.
h( ⋅ )
For logistic regression (binary), we use the sigmoid function:
P (y* = 1 ∣ x*, w) = h (w ⊤ x*) =
1
1 + exp (−w ⊤ x*)
When wTx is negative and large we get something which is exp( something large ), which is
something huge. So 1/something huge is close to 0.
When I go left things become close to 0, on the right become closer to 1. When wTx is 0 (center)
we end up with 1/2.
So for any x* that we give to this function we know that the output is going to be between 0 and
1. This could really be something that models the probability of for example class 1.
Bayesian logistic regression
Recall the Bayesian ideas from two weeks ago…. In theory, if we place a prior on w and define a
likelihood we can obtain a posterior:
p(w ∣ X, y) =
p(y ∣ X, w)p(w)
p(y ∣ X )
Then what we can do with this posterior is make prediction so we can take the expectation
under the posterior of the predicted distribution:
P (y* = 1 ∣ x*, X, y) = Ep(w∣X,y) [P (y* = 1 ∣ x*, w)]
Defining a prior
We can define a prior, so for example choose a Gaussian prior:
27
04 - Bayesian Logistic Regression and the Bayesian Classifier
p(w ∣ s) =
08/04/2022
D
∏
𝒩(0,s)
d=1
Prior choice is always important from a data analysis point of view. Previously, it was also
important 'for the maths’. This isn't the case today could choose any prior no prior makes the
maths easier!
Defining a likelihood
First assume independence:
p(y ∣ X, w) =
N
p y ∣ x ,w
∏ ( n n )
n=1
So knowing what happens for one x doesn’t tell me anything about the outcome for another x. In
the regression case the noise terms, y, they are all independent.
We have already defined this! If y n = 1:
P (yn = 1 ∣ xn, w) =
and if y n = 0:
1
1 + exp (−w ⊤ xn)
P(yn = 0 | xn, w) = 1 − P(yn = 1 | xn w)
Posterior
p(w ∣ y, X, s) =
p(y ∣ X, w)p(w ∣ s)
p(y ∣ X, s)
Now we have a likelihood — p(y|X,w) — a prior — p(w|s) — and so the posterior is also given s
somehow but then we will drop this s because we don’t really need it. Although we can do model
selection with this s if we could compute the marginal likelihood of y|X,s. Remember that the
denominator gives a way to do the model selection, we can think of s as controlling a continuum
of an infinite number of models that are all similar but vary the way these variance is tuned.
We can't compute p(w|y, X, s) analytically. Prior is not conjugate to likelihood. No prior is! This
means we don't know the form of p(w|y, X,s), and we can't compute the marginal likelihood:
p(y ∣ X, s) = p(y ∣ X, w)p(w ∣ s)d w
∫
Because the integral is not analytically available.
What can we compute?
For simplicity, let's drop the dependence on s. We can compute p(y|X, w)p(w). The product it’s
not a problem at all. What we cannot compute is the normalization in the denominator.
Define g(w) = p(y|X, w)p(w) for notation. Armed with this, we have three options:
1. If I can find g(x) I can also optimize it. And if so, I can find the most likely value of w, a
point estimate;
2. Approximate p(w | y, X) with something easier. The gaussian here is one of the main
character;
3. Sample from p(w|y, X). Even though we cannot normalize it!
We'll cover examples of each of these in turn… These are not the only ways of approximating/
sampling! They are also general not unique to logistic regression.
MAP estimate (Maximum A Posteriori)
Out first method is to find the value of w that maximizes p(w|y, x) (call it w). Since g(w) is
proportional to p(w|y, X), ŵ therefore also maximizes g(w). This is similar to maximum
likelihood but has an additional effect of prior.
28
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
Once we have ŵ we make predictions with:
P (y* = 1 ∣ x*, ŵ ) =
1
1 + exp (− ŵ ⊤ x*)
When we met maximum likelihood, we could find ŵ exactly with some algebra.
We can't do that here (can't solve ∇w log g(w) = 0) but we can resort to numerical
optimization:
1. Guess ŵ ;
2. Change it a bit in a way that increases g(w);
3. Repeat until no further increase is possible.
Many algorithms exist that differ in how they do step 2. e.g. Newton-Raphson. [Not covered in
this course. You just need to know that sometimes we can't do things analytically and there are
methods to help us!].
Decision boundary
Once we have ŵ , we can classify new examples. Decision boundary is a useful visualization:
We can now start playing around with this classification rule. What we see here is what
correspond to P(y*=1|x*,ŵ )=0.5.
Predictive probabilities
But the most interesting thing is looking at the predictive distribution as a whole. What is the set
of points for which my predictive distribution is a certain value between 0 and 1?
Do these boundaries look sensible? Not really! The classifier it’s quite poor. But we are going to
see that being Bayesian we can bend the contours.
29
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
Roadmap
Find the most likely value of w a point estimate. Approximate p(w|y, X) with something easier.
Sample from p(w|y, X).
Laplace approximation
Approximating p(w|y, X) with another distribution. i.e. Find a distribution q(w|y, X) which is
similar. What is ‘similar’?
• Mode (highest point) in same place.
• Similar shape?
• Might as well choose something that is easy to manipulate!
Approximate p(w|y, X, s) with a Gaussian:
q(w ∣ y, X ) = 𝒩(μ, Σ)
Where
μ = w,̂ Σ−1 = − ∇w ∇w log[g(w)]
ŵ
And
ŵ = argmax log[g(w)]
w
We already know ŵ because it is the maximum a posteriori.
What is the justification for this obscure expression for the covariance of q? It is based on Taylor
expansion of log[g(w)] around mode (ŵ ). Means approximation will be best at mode. Expansion
up to 2nd order terms 'looks' like a Gaussian.
Laplace approximation 1D example
Laplace approximation of the Gamma density function:
p(y ∣ α, β ) ∝ y α−1 exp(−β y)
α−1
ŷ=
β
∂ log y
α−1
=
−
∂y 2
y2
∂ log y
∂y 2
ŷ
=−
q(y ∣ α, β ) = 𝒩
α−1
y 2̂
α − 1 y 2̂
,
α − 1)
( β
30
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
Solid: true density. Dashed: approximation.
Left: a = 20, B = 0.5
Right: a = 2, B = 100
Approximation is best when density looks like a Gaussian (left).
Approximation deteriorates as we move away from the mode (both).
Laplace approximation for logistic regression
• Not going into the details here;
• p(w ∣ y, X ) ≈ q(w ∣ y, X ) = 𝒩(w ∣ μ, Σ);
• Find μ = ŵ (that maximizes g(w)) by Newton-Raphson (already done it MAP).
• Find:
−1
• Σ = − ∇w ∇w log[g(w)]
ŵ
• How good an approximation is it?
Black - approximation. Grey - p(w|y,X). Approximation is OK. As expected, it gets worse as we
move away from the mode.
Predictions with the Laplace approximation
We have 𝒩(μ, Σ) as an approximation to p(w|y, X).
Can we use it to make predictions?
Need to evaluate:
P (y* = 1 ∣ x*, x, y) = E𝒩( μ,Σ) [P (y* = 1 ∣ x*, w)]
1
= 𝒩( μ, Σ)
dw
∫
1 + exp (−w⊤ x*)
Cannot do this! So, what was the point? Sampling from
an expectation with samples!
𝒩(μ, Σ) is easy. And we can approximate
31
04 - Bayesian Logistic Regression and the Bayesian Classifier
Draw S samples
08/04/2022
w1, …, wS from z𝒩( μ, Σ)
1 S
1
E𝒩( μ,Σ) [P (y* = 1 ∣ x*, w)] ≈
S∑
1 + exp (−w⊤s x*)
s=1
Contours of P(y* = 1|x*, y, X).
Better than those from the point prediction?
Because in one case we use 1 decision boundary, 1 value of w. It’s still not perfect.
Summary roadmap
• Defined a squashing function that meant we could model P (y* = 1 ∣ x*, w) =
• Wanted to make 'Bayesian predictions': average over all posterior values of w.
• Couldn't do it exactly.
• Tried a point estimate (MAP) and an approximate distribution (via Laplace).
• Laplace probability contours looked more sensible (to me at least!)
• Next:
• Find the most likely value of w a point estimate.
• Approximate p(w|y, X) with something easier.
• Sample from p(w|y, X).
h (w ⊤ x*)
32
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
MCMC sampling
Laplace approximation still didn't let us exactly evaluate the expectation we need for
predictions. Good news! If we're happy to sample, we can sample directly from p(w|y, X) even
though we can't compute it! i.e. don't need to use an approximation like Laplace.
Various algorithms exist we'll use Metropolis-Hastings.
Back to the script: Metropolis-Hastings
Produces a sequence of samples - w1, w2, . . . , ws; . . .
Imagine we've just produced ws-1.
MH firsts proposes a possible ws (call it w̃s) based on ws-1.
MH then decides whether or not to accept ws
• If accepted, ws = w̃s
• If not, ws = ws-1.
Two distinct steps: proposal and acceptance.
MH proposal
Treat ws as a random variable conditioned on ws-1. i.e. need to define p(w̃s|ws-1)
Note that this does not necessarily have to be similar to posterior we're trying to sample from.
Can choose whatever we like! e.g. use a Gaussian centered on ws-1 with some covariance:
p ( w̃ s ∣ ws−1, Σp) = 𝒩 (ws−1, Σp)
MH acceptance
Choice of acceptance based on the following ratio: the posterior in the point where we are going
over the posterior where we are currently. So the acceptance rate is going to be higher if we
move to a point whit higher posterior density compared to when we move to a point where the
posterior density is worst.
r =
p ( w̃ s ∣ y, X) p (ws−1 ∣ w̃ s, Σp)
p (ws−1 ∣ y, X) p w̃ ∣ w , Σ
s−1
p)
( s
So this r is going to be greater than 1 when we move to a point where density is improved and is
going to be less than 1 when decrease the posterior density.
Which simplifies to (all of which we can compute):
r =
g ( w̃s ; y, X) p (ws−1 ∣ w̃ s, Σp)
g (ws−1; y, X) p w̃ ∣ w , Σ
s−1
p)
( s
33
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
What does this ratio tell us? It tells us with what probability we should accept the move.
Whenever we go to a point where the density improves we always accept, but if we did reject
every move whenever we go downward this is just optimization in the end, so the algorithm will
just end up with a local optimum. But what we want is samples for the posterior. For this to
happen we have to allow to also move to values of the posterior which are lower. So that’s the
reason why we have this probabilistic acceptance of the move. When we go downwards we still
accept with the probability given by the ratio. This is the key of this algorithm, this is what
makes MH algorithm converge and give us overall samples from the true posterior distribution.
But we can see that there is another term which tells us the “opposite”: from w̃s to here, to
ws−1 compared going from ws−1 to w̃s. The reason for that is that it could be more likely to go
from B to A instead of going from A to B and somehow we have to account that. That’s why the
second term is necessary.
We now use the following rules:
If r ≥ 1, accept: ws =w̃s .
If r < 1, accept with probability r.
If we do this enough, we'll eventually be sampling from p(w|y, X), no matter where we started!
i.e. for any w1.
MH flowchart
MH walkthrough
34
04 - Bayesian Logistic Regression and the Bayesian Classifier
08/04/2022
What do the samples look like?
Predictions with MH
MH provides us with a set of samples W1, ... , Ws
These can be used like the samples from the Laplace approximation.
Summary
• Introduced logistic regression a probabilistic binary classifier.
• Saw that we couldn't compute the posterior.
• Introduced examples of three alternatives:
• Point estimate MAP solution.
• Approximate the density Laplace.
• Sample Metropolis-Hastings.
• Each is better than the last (in terms of predictions)....
• …but each has greater complexity!
• To think about:
• What if posterior is multi-modal?
35
05 - Bayesian Classifier
08/04/2022
3.1 - Bayesian Classifier
Bayes classifier
The Bayesian classifier uses Bayes rule as follows:
P (y* = k ∣ X, y, x*)
=
p (x* ∣ y* = k, X, y) P (y* = k)
∑j p (x* ∣ y* = j, X, y) P (y* = j)
We need to define a likelihood and a prior and we're done! What is this p(x*|y*)? This is the
“opposite” as before: instead of modeling y we are modeling x! So we are looking at the
distribution of the inputs for a given class.
Bayes classifier likelihood
p (x* ∣ y* = k, X, y) = p (x* ∣ y* = k, θ(X, y))
How likely is x* if it is in class k? (not necessarily a probability…)
Free to define this class-conditional distribution as we like. Will depend on data type:
• D-dimensional vectors of real values Gaussian likelihood.
• Number of heads in N coin tosses Binomial likelihood.
Training data with y = k used to determine parameters of likelihood for class k (e.g. Gaussian
mean and covariance).
The parameters θ encode information from data X and y.
Bayes classifier prior
P(y* = k)
Used to specify prior probabilities for different classes. e.g:
• There are far fewer instances of class 0 than class 1: P(Y* = 1) > P(y; = 0).
• No prior preference: P(y* = 0) = P(y*=1).
• Class 0 is very rare: P(y*= 0) << P(y* = 1).
Naive-Bayes
Naive-Bayes makes the following additional likelihood assumption:
The components of x* are independent for a particular class:
p (x* ∣ y* = k, θ) =
D
p (x ) ∣ y = k, θ)
∏ ( *d *
d=1
Where D is the number of dimensions and (x*)d is the value of the dth component.
Often used when D is high:
Fitting D uni-variate distributions is easier than fitting one D-dimensional one.
Bayes classifier, example 1
Each object has two attributes: x
= [x1, x2]T;
36
05 - Bayesian Classifier
08/04/2022
K = 3 classes;
We'll use Gaussian class-conditional distributions (with Naive-Bayes assumption).
P(y* = k) =1/K uniform prior.
Step 1: fitting the class-conditional densities
p(x ∣ y = k, X, y) = p(x ∣ y = k, θ ) =
μkd =
1
x
∑ nd
Nk n:y
=k
n
2
∏
d=1
2
σkd
=
2
𝒩 (μkd , σkd
)
1
2
(xnd − μkd)
∑
Nk n:y =k
n
What we need to do here is to estimate p(x|y=k) and we start with k=class red. Information from
the dataset X and y is incapsulated in the theta parameter which is the mean and the variance
for the two component. The mean and the variance are estimated using the “maximum
likelihood” if we want.
Keeping things simple let’s assume that this is a Bayesian classifier from the fact that we are
using Bayesian theorem to express p(y|x) as p(x|y).
Once we have estimated one mean vector for each of them (green, red, blue) and one diagonal
covariance for each of them, we are ready to make our prediction:
37
05 - Bayesian Classifier
08/04/2022
It’s going to be a product of gaussians evaluated for various dimensions and so we can calculate
this density for each of the three classes:
p (x* ∣ y * = k, X, y) =
D
∏
d=1
𝒩 (μkd , σ 2 k d )
Compute predictions
Once we have done it, what we want is to normalize this value of the densities that we
calculated for all of them in a way that according to the proportion of how much each
component supports the fact that that point comes from that class is just a ratio of the
densities.
Remember that we assumed P(y* = k) = 1/K.
P (y* = k ∣ x*, θ) =
p (x* ∣ y* = k, θ) p (y* = k)
∑j p (x* ∣ y* = j, θ) P (y* = j)
Remember that here we are not really completely Bayesian.
Bayes classifier, example 2
Data are number of heads in 20 tosses (repeated 50 times for each) from one of two coins:
Coin 1 (Y n = 0): *n = 4, 7, 7, 7. 4, ...
Coin 2 (Y n = 1): Xp = 18, 16, 18, 14, 17, ...
Use binomial class conditional densities:
Where 0 = {rk} k=1,2 is the probability that coin k lands heads on any particular toss.
Problem: predict the coin, y* given a new count, x*.
(Again assume P(y* = k) = 1/K)
38
05 - Bayesian Classifier
08/04/2022
Fit the class conditionals…
Fitting is just finding rk:
rk =
r0 = 0.287, r1 = 0.706.
Compute predictions
P (y* = k ∣ x*, θ) =
1
x
∑ n
20Nk n:y
=k
n
p (x* ∣ y* = k, θ) P (y* = k)
∑j p (x* ∣ y* = j, θ) P (y* = j)
Bayes classifier summary
• Decision rule based on Bayes rule.
• Choose and fit class conditional densities.
• Decide on prior.
• Compute predictive probabilities.
• Naive-Bayes:
• Assume that the dimensions of x are independent within a particular class.
• Our Gaussian used the Naive Bayes assumption (could have written p(x|y = k,...) as product
of two independent Gaussians).
39
05 - Bayesian Classifier
08/04/2022
3.2 Performance Evaluation
Performance evaluation
How do we choose a classifier? Which algorithm? Which parameters?
We need performance indicators.
We'll cover:
• 0/1 loss.
• ROC analysis (sensitivity and specificity)
• Confusion matrices
0/1 Loss
0/1 loss: proportion of times classifier is wrong. Consider a set of predictions y1,..., yN and a set
of true labels y*1,…,y*N. Mean loss is defined as:
1 N
δ(yn ≠ yn∗)
∑
N n=1
(δ(a) is 1 if a is true and 0 otherwise)
Advantages:
• Can do binary or multiclass classification.
• Simple to compute.
• Single value.
Disadvantage: Doesn't take into account class imbalance:
• We're building a classifier to detect a rare disease.
• Assume only 1% of population is diseased.
• Diseased: = 1
• Healthy: V = 0
• What if we always predict healthy? (y = 0)
• Accuracy 99% but classifier is rubbish!
Sensitivity and specificity:
We'll stick with our disease example. Need to define 4 quantities. The numbers of:
1. True positives (TP) the number of objects with y*n=1 that are classified as = 1 (diseased
people diagnosed as diseased).
2. True negatives (TN) the number of objects with y*n=0 that are classified as yn = 0 (healthy
people diagnosed as healthy).
3. False positives (FP) the number of objects with y*n=0 that are classified as yn = 1 (healthy
people diagnosed as diseased).
4. False negatives (FN) the number of objects with y*n=1 that are classified as yn = 0 (diseased
people diagnosed as healthy).
We can now define the sensitivity: Se
=
TP
TP + FN
The proportion of diseased people that we classify as diseased. The higher the better. In our
example, Se = 0.
But there is also another actor, the specificity: Sp
=
TN
T N + FP
The proportion of healthy people that we classify as healthy. The higher the better. In our
example, Sp = 1.
40
05 - Bayesian Classifier
08/04/2022
We would like both to be as high as possible. Often increasing one will decrease the other.
Balance will depend on application:
e.g. diagnosis:
We can probably tolerate a decrease in specificity (healthy people diagnosed as diseased)....
if it gives us an increase in sensitivity (getting diseased people right).
How can we find the right spot between sensitivity and specificity?
ROC Analysis
One way to do this is choosing a point where the classifier is maximally uncertain around and
let’s use this as a threshold. For example 0.5 So, classification rules involve setting a threshold
and for a probabilistic classifier we can say:
p(y∗ ∣ x∗, y, x) = 0.5
However, we could use any threshold we like…. The Receiver Operating Characteristic (ROC)
curve shows how Se and 1 − Sp vary as the threshold changes.
ROC curve
As we move the threshold we change our point, our sensitivity and specificity to different
values. So in the bottom left we have everything classified as 0 (sensitivity = 0 and specificity =
1). On the other hand, the top right is where everything is classified as 1.
Goal: get the curve to the top left corner perfect classification (Se = 1, Sp = 1) . So we would
have the curve as much as possible close to the top left corner.
So a better classifier could be this one:
41
05 - Bayesian Classifier
08/04/2022
It reaches pretty soon a sensitivity of 1 for a specificity of more than 0.8.
Even a better one is this one:
AUC
We can quantify performance by computing the Area Under the ROC Curve (AUC). The higher this
value, the better. For the three classifier we saw before we have:
First: AUC=0.8348
Second: AUC= 0.9551
Third: AUC=0.9936
AUC is generally a safer measure than 0/1 loss.
Confusion matrices
The quantities we used to compute Se and Sp can be neatly summarized in a table:
We want that the value on the diagonal is as much as possible and ideally we would like to have
0 in the false positive and false negative.
This is known as a confusion matrix It is particularly useful for multi-class classification. It tells
us where the mistakes are being made. Note that normalising columns gives us Se and Sp.
Confusion matrices example
• 20 newsgroups data.
• Thousands of documents from 20 classes (newsgroups)
• Use a Naive Bayes classifier (~ 50000 dimensions (words)!)
• Details in book Chapter.
• ~ 7000 independent test documents.
• Summarise results in 20 x 20 confusion matrix:
42
05 - Bayesian Classifier
08/04/2022
Here we can see that we have large numbers on the diagonal, which is good, so our classifier
works. But also we have some mixed results that tell us that for example that we predict 17 but
it’s actually 19. So Algorithm is getting 'confused' between classes 20 and 16, and 19 and 17.
• 17: talk politics. guns
• 19: talk. politics. misc
• 16: talk.religion.misc
• 20: soc. religion. christian
Maybe these should be just one class? Maybe we need more data in these classes?
Summary
Introduced two different performance measures:
• 0/1 loss
• ROC/AUC
Introduced confusion matrices a way of assessing the performance of a multi-class classifier.
43
4 - Variational Inference
15/04/2022
4 - Variational Inference
Where are we?
We did regression, gaussian prior + linear model (Gaussian likelihood) and we were able to get an
analytical solution for the posterior.
Then we worked with the classification. We took a gaussian prior on the parameters but now we
have a non-linear model due to the fact that we squeezed the output of the linear function with
a Sigmoid function or a commutative gaussian. So the posterior is not tractable anymore so we
need to do an approximation.
The solutions we saw were:
• Sample from the intractable posterior: Markov-Chain Monte-Carlo (random walk with
Metropolis-Hastings). Remember that the random walk consists in random walking through the
space of the parameters and then it has a mechanism to either accept or reject move. If we
only accept when we improve our posterior then it would be optimization but here the
randomness indicates that we accept even lower values of the posterior.
• Approximate the intractable posterior: Collapse the posterior on the most likely value
(Maximum-a-Posteriori or MAP), this is not being really Bayesian, it’s more like being almost
maximum likelihood with some regularization of the prior. We saw that we use the Laplace
approximation (2nd order Taylor expansion around the MAP). The best possible gaussian we can
find to approximate a distribution is to look at the mode and try to find the best curvature.
Now we are going to see how to use variational inference for this approximation.
Refresh: Kullback-Leibler divergence
The main ingredient of the variational inference is that Kullback-Leibler divergence. The KL
divergence is a measure of "similarity" between probability distributions.
So if we take q and p, where p is our posterior we want to approximate and q is our
approximating distribution, the way KL is computed is:
q(z)
q(z)
KL[q(z) ∥ p(z)] = q(z)log
dz = Eq(z) log
[ p(z) ]
∫
p(z)
It’s some sort of expectation under q of the log of the ratio.
If q is equal to p then the ratio is 1, log of 1 is 0 so the expectation is going to be 0. This KL is
going to give 0 when q=p. But when they are different we are going to get something different
to 0.
There is a positivity constraint that the Kullback-Leibler satisfy. That’s why we talk about
dissimilarity, because it’s never negative.
Also, it’s not symmetric so we are going to get different values for KL(q||p) and KL(p||q).
KL divergence between two Gaussians:
KL[𝒩(μ1, σ12)
∥
𝒩(μ2, σ22)]
σ22
σ12 (μ1 − μ2)
1
=
log
−1+ 2 +
2
σ2
σ22
( σ12 )
2
44
4 - Variational Inference
15/04/2022
We talk about divergence because it’s not symmetric and doesn’t even satisfy the triangular
inequality so it’s not a distance. But for many purposes it’s got enough because it’s a measure of
how dissimilar two distributions are and it’s enough to say that we can use this as an objective
to say that we want to reduce this divergence. Ideally if we could optimize q, we could bring
this divergency to 0 and so solve the problem of Bayesian inference.
Logistic Regression as a working example
We will use logistic regression as example.
Likelihood:
p(y ∣ X, w) =
N
p y ∣ x ,w
∏ ( n n )
n=1
If yn=1:
P(yn = 1 ∣ xn, w) =
1
1 + exp(−w ⊤ xn)
And if yn=0:
P(yn = 0 ∣ xn, w) = 1 − P(yn = 1 ∣ x, w)
Inference
Sometimes the likelihood is also defined as:
p(yn ∣ xn, w) = Ber(yn ∣ pn),
Using Bayes theorem
p(w ∣ y, X ) =
with
pn =
1
1 + exp(−w ⊤ xn)
p(y ∣ X, w)p(w)
p(y ∣ X )
There is no prior which is conjugate to the likelihood. This means we don't know the form of
p(w|y,X).
So we can't compute the marginal likelihood because it’s an integral that we can’t compute
analytically:
p(y ∣ X) = p(y ∣ X, w)p(w)d w
∫
and we can't compute the predictive distribution:
p(y∗ ∣ x∗, y, X) = p(y∗ ∣ x∗, w)p(w ∣ y, X)d w
∫
Variational Inference
Main idea: instead of trying to solve intractable integrals, let's solve an optimization problem.
Try to optimize the position and the shape of our approximation in a way that it matches as
close as possible the posterior. We are trying to move the mean and change the covariance in a
way to get as close as possible to the posterior.
A very general recipe:
1. Introduce a set Q of distributions q(w);
45
4 - Variational Inference
15/04/2022
2. Define an objective which measures the "distance" between an arbitrary distribution q(w) ∈
Q and p(w | y, X). We talked about KL and that’s what we are going to use;
3. In the set of possible solutions Q, find the best q(w) that minimizes the "distance" to p(w|y,
X). Q can be anything and depending on the family of Q we use we have different parameters
to tune;
4. Interpret q(w) as a distribution that approximates the intractable p(w|y,X).
Visual illustration of Variational Inference
We have some sort of p(w|y,X) which is our posterior that we want to approximate. We choose a
set of approximating distribution Q, which is a subset in the set of all possibile distribution and
we will try to minimize the distance between q(w) and the posterior. And our optimization is
going to move our distribution within the space of possibile distribution and we are going to get
as close as possibile to p(w|y,X). If the posterior does not have the same form that we chose for
the q we are going to get close but not able to get 0 out this distance.
Variational Inference Form of the approximation
What form should q(w) have? We will work a lot with this mean-field approach, that imposes
independent distributions for each component of w:
q(w) =
D−1
q w
∏ ( i)
i=0
This is a fancy way to say something simple: we factorize the distribution across the various
dimensions of our parameters space. Which means that we are not going to be able to capture
covariances in our posterior.
For simplicity, we are going to work with Gaussian distributions.
q(w) =
D−1
D−1
i=0
i=0
q w =
𝒩 w ∣ μ , σ2
∏ ( i) ∏ ( i i i )
μi and σ are parameters we are going to optimize in order to change the position and the shape
of our q. They are called variational parameters (for notation, they are collected into ν). We are
going to have many of them, we are going to have Dx2 because we are going to have mean and
variance for each parameter. So we double the number of things we need to optimize now
because before if we wanted to do optimization of w (so not be Bayesian) we had D parameters
to optimize. Now, because we want to be Bayesian, we do it with this simple approximation
what we are going to get is that we have to optimize twice the number of parameters: the
means and the variances.
Now, in order to find the best distribution of q(w) we are going to find the best variational
parameters ν.
46
4 - Variational Inference
15/04/2022
Variational Inference Objective
How do we define a "distance" between q(w;ν) and the posterior p(w|y, X)?
We will use the KL divergence to measure the "distance" between the two distributions.
q(w; ν)
q(w; ν)
KL[q(w; ν) ∥ p(w ∣ y, X )] = q(w; ν)log
dw = Eq(w;ν) log
[ p(w ∣ y, X ) ]
∫
p(w ∣ y, x)
This is a problematic part because in this expression there is something we are not able to deal
with: the posterior because we don’t know how to write it analytically. So the difficulty is that
p(w|y, X) is intractable! But we can do something claver.
Start with rewriting things:
q(w; ν)
KL[q(w; ν) ∥ p(w ∣ y, × )] = Eq(w;ν) log
[ ( p(w ∣ y, X ) )]
A tractable objective to optimize q(w; v) is obtained by manipulating the KL divergence.
We can rewrite:
KL[q(w; ν) ∥ p(w ∣ y, X )] = Eq(w;ν)[logq(w; ν)] − Eq(w;ν)[logp(w ∣ y, X )]
We can split the KL in entropy and cross entropy (the two terms separated by the minus). The
second term is the problematic one because the posterior is intractable. But what we are going
to do is say:
Focusing on the intractable term
Eq(w;ν)[logp(w ∣ y, x)] = q(w; ν)[logp(w ∣ y, x)]d w
∫
we can expand the posterior:
[logp(w ∣ y, X )]q(w; ν)d w = q(w; ν)log
[
∫
∫
p(y ∣ X, w)p(w)
dw
]
p(y ∣ X )
Obtaining:
Eq(w;ν)[logp(y ∣ × , w)] + Eq(w;ν)[logp(w)] − logp(y ∣ × )
Putting everything together in the original KL objective:
KL[q(w; ν) ∥ p(w ∣ y, × )] =
= Eq(w;ν)[logq(w; ν)] − Eq(w;ν)[logp(w ∣ y, X)] =
= Eq(w;ν)[logq(w; ν)] − Eq(w;ν)[logp(y ∣ X, w)] − Eq(w;ν)[logp(w)] + logp(y ∣ X)
Rearranging:
KL[q(w; ν) ∥ p(w ∣ y, X )] = − Eq(w;ν)[logp(y ∣ X, w)] + KL[q(w; ν) ∥ p(w)] + logp(y ∣ × )
This is an important equation for variational inference. Because now we can somehow rearrange
again. Manipulating the previous expression:
logp(y ∣ X) − KL[q(w; ν) ∥ p(w ∣ y, X)] = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)]
The right hand side is computable, so we can use it as an objective!
ℒELBO
47
4 - Variational Inference
15/04/2022
What I’m doing here is saying that something that I can’t compute is relate to something I can
compute and the gap among the two is the KL. If q=p then this is 0! And this is always >=0. So
there is a gap between what I can compute and what I cannot compute, which is due to this KL.
But when the KL is 0, so when I can manage to get q=p, the thing I can compute is going to be
exactly the marginal likelihood or the “evidence”. This is why we call this expression LELBO
(Evidence, Lower, Bound).
So somehow if I push this KL to 0 what I’m doing is to increase as much as possibile this lower
bound. Minimizing the KL is equivalent to maximize LELBO. And this is very important because now
it says that if I optimize and can make q=p then I solve the problem, I get exactly the value of
marginal likelihood (or the evidence). This is the key: rearranging the expression of the KL
between q and p allows us to derive a criterion which is computable. The more I optimize, the
more I maximize wrt q the more I get close to p(y|X). Eventually if q=p KL is 0 and I have
approximate exactly log p(y|X).
So now the problem becomes optimize this expression wrt q.
ℒELBO = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)]
First term is a model fitting term:
Eq(w;ν)[logp(y ∣ × , w)]
the higher the better the parameters drawn from q(w;ν) are at modeling the labels. When q is
actually close to the posterior, our samples are going to look good and so our average log
likelihood is going to be good. This is really a model fitting term.
Second term is a regularization term:
−K L[q(w; ν) ∥ p(w)]
which penalizes q(w;ν) deviating too much from the prior p(w). If we want to fit our data nicely
we should maximize wrt q but at the same time we can’t go far away from the prior p otherwise
we’ll be penalized. In that way we have a model fitting and regularization. So everywhere in
machine learning when we have these two we are protected from overfitting.
So what is the best q to maximize the log likelihood? A gaussian whit a very narrow variance
(actually 0 variance) which has a big spike on that value, so we fall back to maximum likelihood.
So the regularization term is actually nice because it really prevents from this to happen
because if we do that we become so different from the prior that we’ll be penalized.
It turned out that if we try this on a neural network it doesn’t work because we are going to
have a sum of millions parameters. And if we have much more parameters than data what
happens is that the KL term will dominate and we’ll get q=p which is not very useful.
But going back to the main problem: How to compute the objective? How to optimize it?
ℒELBO = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)]
Recall the assumption on q(w;ν)
q(w; ν) =
q w =
𝒩 w ∣ μ , σ2
∏ ( i) ∏ ( i i i )
i
i
Optimizing wrt to q(w;ν) means optimizing wrt to μi, σi2.
The second term, −KL[q(w; ν)∥p(w)] , can be expressed analytically by using the expression
of the KL divergence between Gaussians.
In our case, if p(w) = Πi 𝒩 (wi ∣ 0,s 2) we obtain:
KL[q(w; ν)∥p(w)] =
1
2
∑i log s 2 − 1 +
[ ( σi )
2
σi2
s2
+
μi2
s2
]
So in the computation of the variational objective the KL is the easy part. Now the difficult part
is the expectation. How do we optimize the expectation wrt q? This is an integral:
48
4 - Variational Inference
15/04/2022
Eq(w;ν)[log p(y ∣ × , w)] = ∫ log p(y ∣ × , w)q(w; ν)d w
And this is possible only if q is gaussian and log p is quadratic for example or gaussian. But that’s
not why we use variational inference! We use variational inference in cases where log p is
something messed up and q is gaussian. So we can’t do this integral analytically. But…we can do
Montecarlo! What we can do is to write it as an expectation.
Eq(w;ν)[log p(y ∣ X, w)] ≈
1
N MC
MC log p y ∣ × , w̃(h) , w̃(h) ∼ q(w; ν)
∑i=1
(
)
N
Where w̃(h) is sampled from q.
Remember: this estimator is unbiased and its variance shrinks with 1/NMC (NMC= Montecarlo
samples) independently of the dimensionality.
So somehow when we write it like this ν disappear. And With q(w; ν) fixed, when we resample w
from q(w;ν) we obtain a different value! So how do we do now? How can we make gradient
updates to the μi, σi2 parameters of q(w;ν)?
Answer: freeze the randomness within Monte Carlo (Reparameterization Trick)!
Variational Inference: Reparameterization trick
Idea: Samples of w can be obtained by a deterministic transformation f of a random variable
ϵ ∼ p(ϵ), such that p(ϵ) has no tunable parameters.
The variational parameters ν are parameters of the function f. The chain rule tells us that if I
take samples from p(ϵ) and we pass it though f, I get samples from q and then I can use this to
compute log p(y|X,w). So we use the chain rule of differentiation to push the gradient though
this function f.
For example, with q (wi) = 𝒩 (wi ∣ μi, σi2) we have:
ε ∼ 𝒩(0, 1)
wi = f(ε; ν) = μi + εσi
We can prove that by building wi in this way, we recover the original q(wi).
We can do a key observation: when we want to take the gradient of the expectation we can turn
the gradient of an expectation into an expectation of a gradient:
∇ν Eq(w;ν) logp(y ∣ X, w) = ∇ν Ep(ε) logp(y ∣ X, w)
.
w=f (ε;ν)
Variational Inference Reparameterization trick (Derivation)
∇νEq(w;ν)logp(y ∣ X, w) = ∇ν Ep(ε) logp(y ∣ X, w)
= Ep(ε) ∇ν logp(y ∣ X, w)
[
w=f (ε;ν)
w=f (ε;ν)]
= Ep(ε) ∇w logp(y ∣ X, w)
[
∇νf(ε; ν)
w=f (ε;ν)
N
MC
1
≈
∇w logp(y ∣ X, w)
NMC ∑
h=1
]
∇νf(~ε (h); ν),
~ε (h) ∼ p(ε)
w=f (~ε (h); ν)
49
4 - Variational Inference
15/04/2022
This looks awful!
Good news: if we use any auto-diff tool (PyTorch, Tensorflow, JAX, NumPyro, etc.), we will never
compute this gradients manually. All we need to do is to compute the objective. All we need to
know is that the samples w are constructed in a deterministic way from mean and variance of q.
Variational Inference Reparameterization trick (Properties)
N
1 MC
∇ν Eq(w;ν)logp(y ∣ X, w) ≈
∇w logp(y ∣ X, w)
NMC ∑
h=1
w=f (~ε (h); ν)
∇ν f (~ε (h); ν),
~ε (h) ∼ p(ε)
• Estimation of the gradients is unbiased
• The likelihood p(y|X, w) must be differentiable
• Transformation f must be differentiable
• Need to be able to sample from p(ϵ), but not from q(w;ν)
• Often has low variance; even a single-sample estimation is OK (NMC = 1).
Variational Inference with Stochastic Optimization
So, we can use stochastic gradient optimization of our approximate variational objective.
ℒ^
N
ELBO
with
1 MC
~ (h) − KL[q(w; ν) ∥ p(w)]
=
logp(y ∣ X, w
)
∑
NMC h=1
~ (h) = f ~ε (h); ν and ~ε (h) ∼ p(ε)
w
(
)
We have guarantees of convergence to the original variational objective!
ℒELBO = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)]
Stochastic Gradient Optimization
When we are stochastic when we interrogate our routine to compute the gradiente we are going
to end up to a different place. To converge we need that our steps have to be small.
Stochastic gradient-based optimization has good theory about convergence. Optimizing using
stochastic updates reaches a local optimum if step-size αi goes to zero with a certain rate:
∑
i
αi2 < ∞
∑
αi = ∞
i
50
4 - Variational Inference
15/04/2022
Price to pay: convergence in O(1/√t) rather than O(1/t) for gradient-based optimization (t is #
iterations), so we are going to be slower.
Results on Classification
Results with fully factorized Gaussian posterior:
Results with full covariance Gaussian posterior
In the Gaussian likelihood case, the optimization makes q(w;ν) converge to the true posterior!
Extensions
We can also extend to Mini-batch-based formulation because we have an expectation of a log
likelihood, log of a likelihood is the log of a product, which becomes the sum of the logs.
We can also use more general forms of q(w;ν), implicit q(w; v), any likelihood and any prior.
Mini-batching
Once we have the objective
~
objective =
N
1 MC
~ (h) − KL[q(w; ν) ∥ p(w)]
logp(y ∣ X, w
)
∑
NMC h=1
the only term depending on data is the first term
When the likelihood factorizes
N
~ (h) =
~ (h)
logp(y ∣ X, w
) ∑ logp(yi ∣ xi, w )
i=1
We can get unbiased estimate by selecting M out of N data
~ (h) ≈ N
logp(y ∣ X, w
) M
∑
i∈ minibatch
~ (h)
logp(yi ∣ xi, w
)
Double source of stochasticity: Monte Carlo and mini-batch.
51
4 - Variational Inference
15/04/2022
Better approximation with Normalizing Flows
Key idea
Build complex distributions from simple distributions via a flow of successive (invertible)
transformations:
52
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
5 - K-means, Kernel K-means, and Mixture models
Unsupervised learning
Everything we've seen so far has been supervised. We were given a set of xn and associated yn.
What if we just have xn?
Aims
• Understand what clustering is.
• Understand the K-means algorithm.
• Understand the idea of mixture models.
• Be able to derive the update expression for mixture model parameters.
Clustering
Example: each sample has two attributes, xn=[xn1, xn2]T. And we are going to have n of these
guys. It’s a completely differente set up because before we kew the “color”, we had the labels
and we had to figure out how to separate the classes. But in unsupervised we just have the black
data, the input. So our algorithms are going to be able to determine some grouping. This is a
very arbitrary problem. The reason is that we may see 3 clusters here but someone else can see
just 2. Some people has formalized this problem giving some properties to the clusters. For
example, if we scale the data of a factor alpha we would like our algorithm to give the same
result. So we want this algorithm to be consistent across different scaling. Then we may want to
satisfy other properties but we will be never be able to satisfy all of them.
Left: data.
Right: data after clustering (points coloured according to cluster membership).
K-means
Assume that there are K clusters, so we know the number of clusters. How do we know how
many clusters are there? We will see strategies to solve this problem.
Each cluster is defined by a position in the input space:
μk = [μk1, μk2]T
Each xn is assigned to its closest cluster, to the closest mean (mu):
53
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
This is a mechanism of “compression” if we want.
Distance is normally Euclidean distance: dnk = (xn − μk )T (xn − μk ) but obviously we are free to
use another distance.
How do we find μk?
What we would like to do is to place our mean in a strategic way so that somehow we can
compress our data in a nice way. But there is no analytical solution, we can't write down μk as a
function of X. We have to use an iterative heuristic algorithm that for sure is going to converge:
1. Guess μ1, μ2, . . . , μK ;
2. Assign each xn to its closest μk ;
3. Then we use this indicator, znk, which is a big matrix of 0 except for ones when the point is
associated to cluster k: znk = 1 if xn assigned to μk (0 otherwise) ;
4. Update μk to average of xns assigned to μk:
N
μk =
∑n=1 znk xn
N
∑n=1 znk
5. Then we do this iterative by returning to 2 until assignments do not change.
Algorithm will converge...it will reach a point where the assignments don’t change, but this is
going to cost us n*k distances to compute. But it’s linear in n so not so bad (gaussian are
quadratic).
When does K-means break?
Outer cluster cannot be represented as a single point. The reason is that k-means would do
something like this:
54
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
Why? Because k-means use the euclidean distance. And doing so there is no way we can achieve
separations that are not linear unless we use 100 clusters, not very elegant and interpretable
because we use 100 clusters to represent 2.
So what if we could change this boundaries to be non-linear? One way could be to change the
distance among the clusters but an easier way is to thing in the same way we did with gaussian
processes. We said: we are good with linear regression, and now instead of just doing linear
regression with linear function what if we transform the problem by including many new basis
functions and we do linear regression by using these basis functions that are not linear?
For this particular problem is very easy: imagine that I would take each xi and construct a new
2
variable hi, and saying that hi = | | xi | | . What do I get? I get all the points separated and this
will lead to an easy separation of the clusters by the k-means.
What if we can use an infinite number of basis function? We are going to have an infinite way to
separate the points. So the idea is to do a transformation that brings us into another space, do
the k-means there and then go back to the original space. Obviously we don’t like infinite stuff.
Kernelizing K-means
Maybe we can kernelize K-means? Let’s start with the distances: dnk = (xn − μk )T (xn − μk ).
This tells us if a point belong to a cluster. So we compute this for all the different clusters, we
say which one is the minimum and we associate the point to the cluster for which the distance is
minimum. The means are calculated like this, so we take the “barycenter” of the cluster:
N
μk =
∑m=1 zmk xm
N
∑m=1 zmk
And now if we just expand this expression of the distances by plugging the mean inside we get:
(xn − μk) (xn − μk) =
⊤
(
xn − Nk−1
N
⊤
∑
N
−1
2Nk
z x ⊤x
∑ mk m n
m=1
+ Nk−2
m=1
N
xn − Nk−1
z x
∑ mk m)
) (
m=1
zmk xm
Multiply out:
xn⊤ xn
−
∑
zmk zlk xm⊤ xl
m,l
But in linear regression we said that we can express our prediction just in terms of scaler
product among the points. If we want to determine wether a point belongs to a cluster we have
to compute these distances and these distances to the means. But because the distances are
themselves average of point, when we do the scalar product we end up with is that all the terms
that appear contain scalar product between the inputs. So we can say: “what if I change the
representation where these x are mapped in some other dimensional space with some sort of phi
function and then this product will be replaces with a kernel function:
k (xn, xn) −
N
−1
2Nk
z k x ,x
∑ mk ( n m)
m=1
+
Nk−2
N
∑
m,l=1
zmk zlk k (xm, xl)
So what we are doing is again to map this problem into another space and the scalar product is
going just to be some kernel. If we do this we are going to work in a new space induced by the
kernel K. Choosing K is equivalent to choose some mapping between our points.
Kernel K-means
Algorithm:
1. Choose a kernel and any necessary parameters.
2. Start with random assignments znk.
55
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
3. For each xn assign it to the nearest 'center' where distance is defined as:
k (xn, xn) −
N
−1
2Nk
z k x ,x
∑ mk ( n m)
m=1
+
4. If assignments have changed, return to 3.
Nk−2
N
∑
m,l=1
zmk zlk k (xm, xl)
It’s very important to notice that we will never compute the means because the K induce some
sort of mapping of the points but we don’t want to work with that mapping and that’s why we
work with this implicit representation where we can actually compute the distances in the map
space, in the high-dimensional space, but we don’t really care about this mapping, we don’t
really care about computing the mean in this new space and we don’t need to because the only
thing we need is to compute the distance between the points and the means.
And this is just what we did in linear regression, we dind’t compute the weights in the infinite
dimension representation because we can’t. And we don’t need to because all the prediction in
that space can be expressed simply in the form of scalar product of this function in the highdimensional space. And the scalar product is just K. Here is the same: we want to use the
mapping but not compute it and all we need is be able to compute the distance between the
points and the means in the infinite-dimensional space.
So Kernel K-means:
• Makes simple K-means algorithm more flexible.
• But, have to now set additional parameters.
• Very sensitive to initial conditions because of lots of local optima.
K-means summary
• Simple (and effective) clustering strategy.
• Converges to (local) minima of:
z x − μk) (xn − μk)
∑ ∑ nk ( n
⊤
n
k
• Sensitive to initialization.
• How do we choose K?
• Tricky: Quantity above always decreases as K increases.
• Can use CV (cross validation) if we have a measure of 'goodness'.
• For clustering these will be application specific.
Mixture models thinking generatively
When we looked at Bayesian way to do linear regression we were thinking about how to generate
data, so we were thinking in a generative fashion: “let’s take some parameters, let’s construct a
function, add some noise and that’s how data are generated”. Thanks to that we could design a
likelihood, design a prior for the parameter and then we did posterior inference so we get a
posterior over the parameters. So can we do the same here?
The idea is to start with hypothesis on how we would generate data like this:
56
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
A generative model
Assumption: Each xn comes from one of different K distributions.
To generate x, for each n:
1. Pick one of the K components.
2. Sample xn from this distribution.
We repeat this n times and we have our data. But in practice we have data and we would like to
know how these data are generated somehow. This is exactly how linear regression works. So we
are going to define parameters of all these distributions as Δ and we'd like to reverse-engineer
this process learn Δ which we can then use to find which component each point came from.
And this is what we have seen at the beginning and we are going to try to do the same here, that
is maximize the likelihood.
Mixture model likelihood
For the likelihood what we need is some sort of p of our data, p(x) given parameters. If so we
can maximize this p(x) given parameters. So let the kth distribution have pdf: p(xn|znk = 1, Δk).
This is what is going to determine the probability of each xn when xn belongs to cluster k and
cluster k is going to have parameters Δk.
What we want is the likelihood:
p( × | Δ)
The first assumption we are going to make is the likelihood factorize:
p( × ∣ Δ) =
N
p × ∣ Δ) 1
∏ ( n
i=1
Then, un-marginalize k:
p( × ∣ Δ) =
=
N
K
p × , z = 1 ∣ Δ)
∏ ∑ ( n nk
i=1 k=1
N K
p × ∣ z = 1,Δk) p (znk = 1 ∣ Δ)
∏ ∑ ( n nk
i=1 k=1
Why is this useful? Because these guys here are gaussians. So this is basically our likelihood and
we want to find delta:
argmax
Δ
N
K
p x ∣ z = 1,Δk) p (znk = 1 ∣ Δ)
∏ ∑ ( n nk
i=1 k=1
But this is going to look very bad already because we have a product of N terms of something
which is a sum of K terms! On the other hand, using logs we obtain this:
argmax
Δ
N
∑
n=1
log
K
p x ∣ z = 1,Δk) p (znk = 1 ∣ Δ)
∑ ( n nk
k=1
But now we have the log of a sum… so we need to do something else.
1
There is some noise in the generation that we can model and we try to capture this uncertainty around that with this
parameter delta.
57
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
Jensen's inequality
log Ep(x)[ f (x)] ≥ Ep(x)[log f (x)]
How does this help us?
Our log likelihood:
L=
N
∑
log
n=1
K
p x ∣ z = 1,Δk) p (znk = 1 ∣ Δ)
∑ ( n nk
k=1
Add a (arbitrary looking) distribution
L=
N
∑
log
n=1
q (znk = 1)
(
s.t.
q z = 1) = 1 :
∑ ( nk
)
k
q (znk = 1)
p × ∣ z = 1,Δk) p (znk = 1 ∣ Δ)
∑ q z = 1 ( n nk
(
)
nk
k=1
K
q is going to act as out posterior over the membership of points of clusters. So this is going to be
an approximation of the posterior. So let’s see it as an expectation under q(znk = 1).
L=
N
∑
n=1
log Eq(znk = 1)
[ q (znk = 1)
1
p ( ×n ∣ znk = 1,Δk) p (znk = 1 ∣ Δ)
]
So, using Jensen’s:
L≥
=
N
1
Eq(znk = 1) log
p (xn ∣ znk = 1,Δk) p (znk = 1 ∣ Δ)
∑
[
]
q
z
=
1
( nk
)
n=1
N
K
∑∑
n=1 k=1
q (znk = 1) log
{ q (znk = 1)
1
p (xn ∣ znk = 1,Δk) p (znk = 1 ∣ Δ)
}
There are only sums! What are we going to do with this?
L≥
=
N
K
q z = 1) log
∑ ∑ ( nk
n=1 k=1
N K
{ q (znk = 1)
1
p (xn ∣ znk = 1,Δk) p (znk = 1 ∣ Δ)
}
q z = 1) log p (znk = 1 ∣ Δ) + …
∑ ∑ ( nk
n=1 k=1
N K
q z = 1) log p (xn ∣ znk = 1,Δk) − …
∑ ∑ ( nk
n=1 k=1
N K
q z = 1) log q (znk = 1)
∑ ∑ ( nk
n=1 k=1
And these two terms are a loss (because it measures the log likelihood so the model fitting if we
want) and a regularization, so how the posterior deviates from the prior. So if we take these two
terms we get the same result we got in the variational inference: some sort of log-likelihood and
a regularization.
58
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
Now we have to optimize this wrt to delta and maybe other parameters, like q! So let’s take the
derivative also w.r.t. that.
And another thing we could also try to optimize p(znk = 1).
So eventually we define qnk = q (znk = 1), πk = p (znk = 1 ∣ Δ) (both just scalars).
So what we do now is to derive all the parameters w.r. we want to optimize which are qnk, set
them to 0 and get the solution. Ergo, differentiate lower bound w.r.t qnk, πk and Δ k and set to
zero to obtain iterative update.
Optimizing lower bound
And in particular updates for Δk, πk will depend on qnk
Update Ink and then use these values to update Δk and πk etc. This is a form of the ExpectationMaximization algorithm (EM) but we've derived it differently.
Let’s take an example…
Gaussian mixture model
Assume component distributions are Gaussians with its own mean and variance, with diagonal
covariance:
p ( ×n ∣ znk = 1,μk , σk2) = 𝒩 (μ, σ 2 ∣ )
Update for πk. The only relevant bit of bound is:
∑
n,k
Now, we have a constraint:
∑
qnk log (πk)
πk = 1. And a technique we can use when we have constraint is
k
to use a Lagrangian, which is just to add a term called Lagrangian multiplier λ:
∑
qnk log πk − λ
n,k
(∑
k
πk − 1
)
Now if we take the derivative and we set to zero we end up with:
∂
1
=
qnk − λ = 0
∂πk
πk ∑
n
Which allows us to obtain the Lagrange multiplier and then we substitute it back. So if we
rearrange we end up with something like this:
∑
qnk = λπk
n
Then we sum both sides over k to find lambda:
∑
qnk = λ × 1
n,k
Finally we substitute and re-arrange again:
πk =
∑n qnk
∑n, j qnj
=
1
qnk
N∑
n
59
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
In the end this πk should be just the average of the posteriors. This is not surprising because if
we take the prior which is flexible as the posterior and we allow the prior to be optimized, the
prior only enter in the kl regularization term so that means that the kl can be 0 if we make the
prior equal to the posterior. And it’s looks like a nonsense because this is telling us that the prior
has to be equal to the posterior. Here is kind of the same thing except that the prior now it’s
cluster specific, it’s not going to be a prior for each point individually otherwise we would have
πk = qnk. But since we said that πk is the prior for a given cluster πk has so be the average of the
posterior memberships.
Update for qnk
Now for qnk . Whole bound is relevant because all the bound depends on qnk.
So take the derivative, set it to 0, then again add the Lagrange multiplier
−λ
(∑
k
qnk − 1
)
associate to this because qnk have to sum to 1 over k. So we do some rearrangement and we end
up with something like this (don’t need to remember the demonstration, take the below result
directly):
qnk =
πk p ( ×n ∣ znk = 1,Δk)
∑j=1 πj p ( ×n ∣ znj = 1,Δj)
K
And this is basically the same thing we got from bayesian classifier.
Updates for μk and σk2
These are easier no constraints. Differentiate the following and set to zero (D is dimension of
xn):
∑
n,k
qnk log
1
(2πσk2)
D/2
1
⊤
exp − 2 ( ×n − μk) ( ×n − μk)
( 2σk
)
Result:
μk =
∑n qnk ×n
∑n qnk
∑ qnk ( ×n − μk) ( ×n − μk)
σk2 = n
D ∑n qnk
⊤
It is just like what we had with the k-means but now we have this extra computation that we
need to do for the variances because now we actually are assuming that each cluster is going to
have different variance but for the variance we get something very similar to the sample
variance that we would get if we want to estimate the variance of some data.
Mixture model optimization algorithm
Following optimization algorithm:
1. Guess μk , σj2, πk
2. Compute qnk,
3. Update μl, σk2,
4. Update πk
5. Return to 2 unless parameters are unchanged.
60
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
Guaranteed to converge to a local maximum of the lower bound. Note the similarity with
kmeans except that now we have something a bit more elaborate to compute not only the
means but also the variances and also the πk determining the priors that we associate to each
clusters, so how much mass we want to put in each cluster.
In the literature it has been proven that it converges.
Basically what we got is a special case of the k-means where we fixed the σk to be the same and
also πk to be equal to 1/k. Because of that — because of k-mean is a special case of EM
(expectation maximization) — then we know also the k-means is guaranteed to converge.
Mixture model clustering
Imagine now we want to know which points came from which distribution. This qnk is our
posterior, the probability that xn came from distribution k.
qnk = P (znk = 1 ∣ xn, X )
So this is condition on the data that we have. And so now what we can do is stick with
probabilities or assign each Xn to it's most likely component.
Mixture model issues
How do we choose K?
What happens when we increase it? The likelihood is going to improve. In mixture models if we
have one cluster for each data point now we have the extra flexibility that the variance can be
learned, is optimized. And so, what the models like to do is to put 1 mean for each point and
shrink the variance to zero. By doing that the likelihood becomes infinity. So we get this
degeneracy if we start increasing the number of clusters and it can happen that one component
is attracted to one data point and doesn’t take any other data point and that point starts to
make the variance smaller and smaller and eventually infinite likelihood. So increase the number
of clusters is not a good idea in general. Likelihood always increases as σ 2 decreases.
What can we do? (A: cross validation…)
We live out some data and we check the log-likelihood on unseen data. If we do that we end up
with something like the usual plot that we see when we do model selection where we see some
sweet spots, in this case around 5 components.
Mixture models other distributions
We've seen Gaussian distributions.
Can actually use anything…. As long as we can define p(xn|znk = 1,
ΔK) e.g. Binary data
61
5 - K-means, Kernel K-means, and Mixture models
13/05/2022
Binary example
xn = [0, 1, 0, 1, 1,.. . ,0,1]T (D dimensions)
x nd
p (xn ∣ znk = 1,Δk) = ∏d=1 pkd
(1 − pkd)
D
1−x nd
Updates for pkd are:
pkd =
∑n qnk Xnd
∑n qnk
qnk and πk are the same as before…
Initialize with random pkd (0 ≤ pkd ≤ 1)
K = 5 clusters.
Clear structure present.
Summary
• Introduced two clustering methods.
• K-means
• Very simple.
• Iterative scheme.
• Can be kernelized.
• Need to choose K.
• Mixture models
• Create a model of each class (similar to Bayes classifier)
• Iterative sceme (EM)
• Can use any distribution for the components.
• Can set K by cross-validation (held-out likelihood)
• State-of-the-art: Don't need to set K treat as a variable in a Bayesian sampling scheme.
62
Download