11/03/2022 ASI Advanced Statistical Inference 1 11/03/2022 ASI - Advanced Statistical Inference Index 1 - Bayesian Linear Regression _________________________________________________6 Recap - Probabilities ______________________________________________________________6 Recap - Expectations ______________________________________________________________6 The Gaussian Distribution __________________________________________________________8 The Multivariate Gaussian Distribution _______________________________________________8 Expectations Gaussians ____________________________________________________________9 Working example _________________________________________________________________9 Definitions _____________________________________________________________________10 Linear Regression Definitions ______________________________________________________10 Linear Models for Regression ______________________________________________________10 Linear Regression as Loss Minimization ______________________________________________10 Probabilistic Interpretation of Loss Minimization______________________________________11 Probabilistic Interpretation of Loss Minimization______________________________________11 Properties of the Maximum-Likelihood Estimator _____________________________________11 Properties of the Maximum-Likelihood Estimator _____________________________________12 Model Selection _________________________________________________________________12 Validation on "unseen" data _______________________________________________________13 How should we choose which data to hold back (as unseen data)? _______________________13 Cross-validation _________________________________________________________________13 Leave-one-outCross-validation _____________________________________________________13 Computational issues_____________________________________________________________13 Bayesian Inference ______________________________________________________________14 Bayesian Linear Regression ________________________________________________________14 When can we compute the posterior? _______________________________________________15 Why is this important? ____________________________________________________________15 Bayesian Linear Regression Finding posterior parameters ______________________________15 Bayesian Linear Regression Example ________________________________________________16 Predictive Distribution ___________________________________________________________17 Introducing basis functions ________________________________________________________17 Predictions _____________________________________________________________________18 Computing posterior: recipe_______________________________________________________18 Marginal likelihood ______________________________________________________________19 Choosing a prior _________________________________________________________________19 Summary _______________________________________________________________________19 2 11/03/2022 2 - Gaussian Process _________________________________________________________20 Gaussian Process ________________________________________________________________20 Bayesian Linear Regression as a Kernel Machine ______________________________________20 Bayesian Linear Regression as a Kernel Machine ______________________________________20 Kernels ________________________________________________________________________20 Gaussian Processes ______________________________________________________________21 Gaussian Processes Prior over Functions _____________________________________________21 Kernel _________________________________________________________________________21 Gaussian Processes Prior over Functions _____________________________________________21 Gaussian Processes Regression example _____________________________________________23 Optimization of Gaussian Process parameters ________________________________________24 Summary _______________________________________________________________________25 3 - Bayesian Logistic Regression and the Bayesian Classifier ________________________26 Classification ___________________________________________________________________26 Probabilistic v non-probabilistic classifiers ___________________________________________26 Classification syllabus ____________________________________________________________26 Some data ______________________________________________________________________26 Logistic regression _______________________________________________________________27 Bayesian logistic regression _______________________________________________________27 Defining a prior _________________________________________________________________27 Defining a likelihood _____________________________________________________________28 Posterior _______________________________________________________________________28 What can we compute? ___________________________________________________________28 MAP estimate (Maximum A Posteriori) _______________________________________________28 Decision boundary _______________________________________________________________29 Predictive probabilities ___________________________________________________________29 Roadmap _______________________________________________________________________30 Laplace approximation ___________________________________________________________30 Laplace approximation 1D example_________________________________________________30 Laplace approximation for logistic regression ________________________________________31 Predictions with the Laplace approximation _________________________________________31 Summary roadmap _______________________________________________________________32 MCMC sampling _________________________________________________________________33 Back to the script: Metropolis-Hastings _____________________________________________33 MH proposal ____________________________________________________________________33 MH acceptance __________________________________________________________________33 MH flowchart ___________________________________________________________________34 MH walkthrough _________________________________________________________________34 3 11/03/2022 What do the samples look like? ____________________________________________________35 Predictions with MH ______________________________________________________________35 Summary _______________________________________________________________________35 3.1 - Bayesian Classifier ______________________________________________________36 Bayes classifier__________________________________________________________________36 Bayes classifier likelihood _________________________________________________________36 Bayes classifier prior _____________________________________________________________36 Naive-Bayes ____________________________________________________________________36 Bayes classifier, example 1 ________________________________________________________36 Step 1: fitting the class-conditional densities ________________________________________37 Compute predictions _____________________________________________________________38 Bayes classifier, example 2 ________________________________________________________38 Fit the class conditionals… ________________________________________________________39 Compute predictions _____________________________________________________________39 Bayes classifier summary _________________________________________________________39 3.2 Performance Evaluation __________________________________________________40 Performance evaluation __________________________________________________________40 0/1 Loss _______________________________________________________________________40 ROC Analysis ____________________________________________________________________41 ROC curve ______________________________________________________________________41 AUC ___________________________________________________________________________42 Confusion matrices ______________________________________________________________42 Confusion matrices example ______________________________________________________42 Summary _______________________________________________________________________43 4 - Variational Inference _____________________________________________________44 Where are we? __________________________________________________________________44 Refresh: Kullback-Leibler divergence _______________________________________________44 Logistic Regression as a working example____________________________________________45 Inference ______________________________________________________________________45 Variational Inference _____________________________________________________________45 Visual illustration of Variational Inference ___________________________________________46 Variational Inference Form of the approximation _____________________________________46 Variational Inference Objective ____________________________________________________47 Variational Inference: Reparameterization trick ______________________________________49 Variational Inference Reparameterization trick (Derivation) ____________________________49 Variational Inference Reparameterization trick (Properties) ____________________________50 Variational Inference with Stochastic Optimization ___________________________________50 Stochastic Gradient Optimization __________________________________________________50 4 11/03/2022 Results on Classification __________________________________________________________51 Extensions ______________________________________________________________________51 Mini-batching ___________________________________________________________________51 Better approximation with Normalizing Flows ________________________________________52 5 - K-means, Kernel K-means, and Mixture models _______________________________53 Unsupervised learning ____________________________________________________________53 Aims ___________________________________________________________________________53 Clustering ______________________________________________________________________53 K-means _______________________________________________________________________53 How do we find ? ________________________________________________________________54 When does K-means break? _______________________________________________________54 Kernelizing K-means _____________________________________________________________55 Kernel K-means _________________________________________________________________55 K-means summary _______________________________________________________________56 Mixture models thinking generatively _______________________________________________56 A generative model ______________________________________________________________57 Mixture model likelihood _________________________________________________________57 Jensen's inequality_______________________________________________________________58 Optimizing lower bound __________________________________________________________59 Gaussian mixture model __________________________________________________________59 Update for qnk __________________________________________________________________60 Updates for and ________________________________________________________________60 Mixture model optimization algorithm ______________________________________________60 Mixture model clustering _________________________________________________________61 Mixture model issues _____________________________________________________________61 What can we do? (A: cross validation…) _____________________________________________61 Mixture models other distributions _________________________________________________61 Binary example _________________________________________________________________62 Summary _______________________________________________________________________62 6 - Feature Selection, PCA and probabilistic PCA _________________________________63 5 Bayesian Linear Regression 11 / 03 / 2022 1 - Bayesian Linear Regression Recap - Probabilities Consider two continuous random variables x and y: • Sum rule: p(x) = p(x, y)d y ∫ • Product rule: p(x, y) = p(x ∣ y)p(y) = p(y ∣ x)p(x) In the product rule, writing |y means that y is not a random variable anymore. It has a fixed particular value. What is this y? It depends. The interesting thing is that the order doesn’t matter. • Bayes' rule: p(y ∣ x) = p(x ∣ y)p(y) p(x) It’s a consequence of what happen with the product rule. Recap - Expectations Consider a random variable with density p(x). Imagine wanting to know the average value of x , x̃. This consists simply in generate S sample and do the average: x̃ ≈ 1 S xs S∑ s=1 This is called empirical estimate. Our sample based approximation to x̃ will get better as we take more samples. We can also (sometimes) compute it exactly using expectations. • Discrete: x̃ = Ep(x)(x) = xp(x); ∑ • Continuous: x 6 Bayesian Linear Regression 11 / 03 / 2022 x̃ = Ep(x)(x) = xp(x)d x ∫ Example: • X is outcome of rolling dice. P(X=x) = 1/6. All we need to do is go though all these values, sum them together and multiply its value by the probability so eventually we get 3.5: x̃ = ∑ xP(X = x) = 3.5 x • X is uniform distributed RV between a and b: x̃ = In general: x=b ∫x=a xp(x)d x = (b + a)/2 Ep(x)[ f (x)] = f (x)p(x)d x ∫ Some important properties: Ep(x)[ f (x)] ≠ f (Ep(x)[x]) Ep(x)[k f (x)] = k Ep(x)[ f (x)] The expectation of a random variable is commonly called “Mean”. Mean and variance: μ = Ep(x)[x] σ 2 = Ep(x) [(x − μ)2 ] = Ep(x) [x 2 ] − μ 2 The square of the distance between the random variable x and the mean is a measure of the dispersion of a random variable. So, in expectation, how far are we from the mean. If in expectation we are very far from the mean — and we use the square because we don’t care about which side of the mean we are, we just care about the distance — it means we have a large variance. NB: the expectation of the square is always bigger than the square of the expectation. What we said also apply for vectors of random variables: Ep(x)[ f (x)] = f (x)p(x)d x ∫ We want to make inference about high dimensional spaces and functions. Mean and covariance: μ = Ep(x)[x] cov(x) = Ep(x) [(x − μ)(x − μ)⊤] = Ep(x) [xx⊤] − μ μ ⊤ Usually when we discuss about vector we consider a matrix of Nx1, where N is the size of the vector. 7 Bayesian Linear Regression 11 / 03 / 2022 In the covariance product we get an NxN matrix. What the variance tells is how a random variable is far from the mean. In the diagonal of this matrix we are going to have the variance for all the components of the vector. And all the diagonal elements are going to say “for this particular component, how much does the other components vary”. It captures the correlations between random variables. The Gaussian Distribution Consider a continuous random variable V. The Gaussian probability density function is: p (v ∣ μ, σ 2) = 1 σ exp 2π 1 (v − μ)2 { 2σ 2 } − μ is the mean and it controls the location (more on the left or more on the right) of the kernel, σ 2 is the variance and controls the “flatness”. The first term is used as a scaler, it has to control that the area inside the curve is equal to 1 and depends on the variance. The Multivariate Gaussian Distribution It’s the distribution of a random vector. It is the most useful one, the most flexible one to use in Machine learning. Consider v = (v1, . . . , vD )T with joint Gaussian distribution p(v | μ, Σ) = 𝒩(v | μ, Σ): 𝒩(v ∣ μ, Σ) = The inverse covariance 1 (2π) D/2 | Σ |1/2 exp 1 − (v − μ)⊤Σ−1(v − μ) { 2 } Σ−1 — the object that controls the variance in the diagonal and the 1 1 , now we have . It normalizes the scalar 2 2σ Σ product according to the covariance and this is going to tell us the “orientation” of the gaussian. Now the normalization of the integral is harder to compute than before but if we do it we get: interaction — acts like a scaler: before we had 1 (2π) D/2 | Σ |1/2 Numerically this may be challenges. Here we have a vision from the top: Using 9 in the diagonal will scale each dimension by a factor which is 1/9. This will give the same results as take two gaussian and multiply them together. The second case and the third cases are more interesting. The interesting here is that -4 is going to tell us that the fact that one variable is on average larger then the mean, what is going to do to the average value of the second variables compared to the mean. So here the fact that we have -4 means that whenever we are in v − μ for the first value is positive on average we are going to have a negative effect for the second one. The interesting thing about that is that if we take the eigenvectors of the covariance we are doing to get a basis onto which the covariance is actually the diagonal. If we rotate the axis v1 8 Bayesian Linear Regression 11 / 03 / 2022 and v2 in a way that they are aligned with the major axis of the gaussian, this rotation allows to see the covariance which is perfectly diagonal. Expectations Gaussians For the gaussian, the introduction of these parameters that we chose in order to locate and play around the location of this distribution are going to be the mean and variance of the distribution and the covariance in the multivariate case. Univariate: p (x ∣ μ, σ 2) = 𝒩 (x ∣ μ, σ 2) Mean: Ep(x)[x] = μ Variance: Ep(x) [(x − μ)2 ] = σ 2 Multivariate: p (x ∣ μ, Σ) = 𝒩(x ∣ μ, Σ) Mean: Ep(x)[x] = μ Variance: Ep(x) [(x − μ)(x − μ)⊤] = Σ Working example We need to reverse the way this data was obtained and try to figure out which was the function that generated this data. The application of Bayesian Principles allows to obtain a distribution over function that can model this data. This is powerful because this picture is telling us that there is not just one single function that interpolates the data but according to the knowledge that we have we can use Bayesian allows me to do assumptions. 9 Bayesian Linear Regression 11 / 03 / 2022 Definitions Features, inputs, covariates, or attributes x: x ∈ ℝD X = (x1, . . . , xN )T Labels, outputs, or responses: y ∈ ℝO Y = (y1, . . . , yN )T Linear Regression Definitions Data is a set of N pairs feature vectors and labels: D O 𝒟 = {(xi, yi )}i=1,...N GOAL: Estimate a function f (x) : ℝ → ℝ For simplicity, we will assume O = 1 (univariate labels) y so we aim to estimate: f (x) : ℝD → ℝ. = (y1, . . . , yN )T Linear Models for Regression Implement a linear combination of basis functions f (x) = D ∑ wi φi(x) i=1 ⊤ = w φ(x) with φ(x) = (φ1(x), …, φD(x)) . ⊤ For simplicity we will start with linear functions f (x) = D ∑ wi xi i=1 ⊤ =w x Where w controls the “angle”. Linear Regression as Loss Minimization Definition of the quadratic loss function: ℒ= N y − w ⊤ xi] ∑[ i 2 i=1 = ∥y − Xw∥2 We want that the difference between this y and the evaluation of my function to be as small as possible. This is what we call the loss: for all the data I have I want my function to be very close to the observation I have. One way to put this computation in a matrix form is: we can think at the sum of squared, we can think of a sum of a norm, a vector, and this vector has components yi and w T xi. Solution to the regression problem is: ∇w ℒ = 0 ⟹ w ̂ = ( X⊤ X) −1 X⊤ y 10 Bayesian Linear Regression 11 / 03 / 2022 Probabilistic Interpretation of Loss Minimization Consider a simple transformation of the loss function Instead of minimizing the first function we can take the function on the right (which looks like a gaussian) and try to maximize that one. I want to find w such that my model is most likely to explain this data. So instead of minimizing a loss we maximize a likelihood. Minimizing the quadratic loss equivalent to maximizing the Gaussian likelihood function. exp(−γℒ) = exp (−γ∥y − Xw∥2) 1 ∝ 𝒩 y ∣ Xw, ( 2γ ) Probabilistic Interpretation of Loss Minimization If we analyze the likelihood a little more we assume that our y is a distribution around a mean which is Xw with a certain variance which is 1/2γ. So somehow we are assuming that there is a model which puts some sort of gaussian distribution over the observation around the mean. So the likelihood 1 hints to the fact that we are assuming: 𝒩 y ∣ Xw, ( 2γ ) yi = w ⊤ xi + εi The epsilon is goin to be some sort of noise which is going to have variance 1/2γ. In vectorial form: y = Xw + ϵ With ϵ ∼ 𝒩 (ε ∣ 0,σ 2 I). Remark: the likelihood is not a probability! It’s a probability density function over y and this is controlled by w. If the ML solution is the optimal σ 2: w ̂ = (X ⊤ X ) −1 X ⊤y now we can also maximize the log-likelihood to obtain ∂ log [p (y ∣ x, w, σ 2)] ∂σ 2 yielding σ2̂ = =0 1 (y − X w )̂ ⊤(y − X w )̂ N Properties of the Maximum-Likelihood Estimator Are there any useful properties for the estimator ŵ ? 11 Bayesian Linear Regression 11 / 03 / 2022 The estimator w is unbiased! That is: Ep(y∣X,w)[ w ]̂ = ∫ w p(y ̂ ∣ X, w)dy = w An estimator is unbiased when the expectation of the estimator under the distribution of p(y | X, w) is actually w. But what is this expectation Ep(y∣X,w)[ w ]̂ ? Imagine the process of how to compute the expectation. An expectation can be approximated as an average, we sample from the distribution we have and we take an average. So imagine we generate many datasets, so we generate many ys. For each of this dataset we estimate ŵ . So we are going to have a family of w hat. Then we take the average of these ws. This average is going to be exactly w in the limit of infinte estimations. So if the data are generated from the model that I am assuming we are doing the best we can. This is an important property. Properties of the Maximum-Likelihood Estimator Unfortunately, the estimate of the optimal σ 2 is biased! Ep(y∣X,w) ( σ2̂ ) = 1 Ep(y∣X,w) [(y − X w )̂ ⊤(y − X w )̂ ] N D = σ2 1 − ( N) Model Selection How can we prefer one model over another? Lowest loss N highest likelihood? NO! Higher model complexity yields lower loss M higher likelihood but it usually does not generalize well on test data. So we have to avoid overfitting and to do so we have to define another way to select our model. Model Selection Effect of increasing model complexity Consider polynomial functions: f (x) = k ∑ wi x i i=0 The training loss decreases with k but test loss increases: But on the test the error decreases and then eventually goes up again. So we could use a portion of our data as “unseen data”, as our validation set. And this is the 12 Bayesian Linear Regression 11 / 03 / 2022 Validation on "unseen" data Cross-validation is a safe way to do model selection Predictions evaluated using validation loss: ℒv = 1 Ntest ∑ i∈ℐtest ⊤ (yi − w xi) 2 We take this loss as a measure of how well our model is. We could also take the validation loglikelihood, in that case we would like it to be big. log [p (ytest ∣ Xtest , w ,̂ σ 2)] = − 1 2σ 2 ∑ i∈ℐtest ⊤ (yi − w xi) 2 How should we choose which data to hold back (as unseen data)? In some applications it will be clear but in many cases pick it randomly.The best thing to do is to do it more than once and then average the results. So the cross-validation is made by splitting the data into C equal sets; train on C-1, test on remaining. Cross-validation If we do this C times we have to learn C time our model in order just to do our validation and this could be problematic! Leave-one-outCross-validation Extreme example is when C = N SO each fold includes one input-label pair. We call it “Leaveone-out” (LOO) CV. Computational issues CV and LOOCV let us choose from a set of models based on predictive performance. This comes at a computational cost: - For C-fold CV. need to train our model C times. - For LOO-CV, need to train out model N times. 13 Bayesian Linear Regression 11 / 03 / 2022 Bayesian Inference ⊤ Inputs : X = (x1, …, xN) Labels : Weights y = (y1, …, yN) ⊤ : w = (w1, …, wD) ⊤ The essence of how we are going to apply this to machine learning is as follow: we are going to turn the condition p(y | w) into p(w | y) . We are going to use p(y | X, w) , imagining X is fixed. Why is this important? Because we are going to turn something we know, our likelihood function, into something powerful: a distribution over w given data. This is going to tell us what are the values of w which are compatible with the observation we have. And it’s not going to be just a value, there is no argmax, no optimization. In Bayesian theorem there is no maximization. So this problem is just an identity distribution. The way we can turn p(y | w) into p(w | y) is: p(w ∣ y, X ) = p(y ∣ X, w)p(w) ∫ p(y ∣ X, w)p(w)d w But what is p(w)? Is some sort of distribution we have over the parameter and it’s not conditional data, it’s something that we know about parameter before we look at any data. In our case it can be any distribution. Now, the process of multiplying by the likelihood is going to give us something which is a distribution over w which is constraint by the fact that we have observed data, we have evidence now. Not all the values of w are good to model our data, some of them are very bad functions to model our data and so those are going to get a very low value in the distribution of p(w|y,X) because the likelihood is going to be very small. Now we can focus on the denominator, which is a normalization constant. This is going to be the problem. Bayesian inference is nice but we have to solve this integral. It’s a quite difficult problem because it’s an integral in D dimension. Integrals are messy and especially here where we have a product of functions. In the gaussian case this is going to be easy because product of gaussian is going to be an exponential of a quadratic form. But in general this is not the case and we have to approximate this integral. Bayesian Linear Regression So here we are formalizing what we said. This is the likelihood function we talked about before and this is what we call likelihood. p(y|Xw) is going to me a sort of gaussian, centered on Xw and variance σ 2: p (y ∣ w, X, σ 2) = 𝒩 (y ∣ Xw, σ 2 I) 14 Bayesian Linear Regression 11 / 03 / 2022 Now we are going to have a gaussian distribution over the parameters before looking at any data. We call it Gaussian proper over model parameters because this is prior to observing any data: p(w) = 𝒩(w ∣ 0, S) We can expand the integral into something which is equal to p(y|X): p(w ∣ X, y) = p(y ∣ X, w)p(w) ∫ p(y ∣ X, w)p(w)d w = p(y ∣ X, w)p(w) p(y ∣ X) Now, thanks to the multiplication by the likelihood we can turn the prior, p(w), into a posterior, which is the distribution over parameters after observing data. So these are the actors of the Bayes rule: Posterior density: p(w|X,y), the distribution over parameters after observing data; Likelihood: p(y|X,w), the measure of “fitness”; Prior density: p(w), anything we know about parameters before we see any data. Marginal likelihood: p(y|X), a normalization constant that ensures ∫ p(w ∣ X, y)d w = 1. We are integrating the likelihood wrt the prior, so if we sample from the prior how many times we get a good likelihood? If p(w) has few parameters, for example just one parameter, we are going to have a simple one gaussian. But the likelihood is not very good because data it’s more complex but at least I have some sort of support of p(w) with a small likelihood, which is not very good. Very complex models are going to have spread p(w) across huge dimensional space and there are some p(y|X,w) which are really, really good because the model is very complex and the value of the likelihood is very high. Of course we will need a trade off, a model that cover the space in a reasonable way with p(w) in a way that the likelihood is also good. This is the meaning of this marginal likelihood. When can we compute the posterior? Conjugacy (definition): A prior p(w) is said to be conjugate to a likelihood if results in a posterior of the same type of density as the prior. Example: Prior: Gaussian; Likelihood: Gaussian; Posterior: Gaussian Prior: Beta; Likelihood: Binomial; Posterior: Beta Many others… If we know that the posterior is going to have a certain form we are just going to match the parameters directly. Why is this important? Bayes rule: p(w ∣ X, y) = p(y ∣ X, w)p(w) p(y ∣ X ) If prior and likelihood are conjugate, we know the form of p(w|X, y); Therefore, we know the form of the normalizing constant; Therefore, we don't need to compute p(y|X)! So in the case of the mean and the variance we just have to understand what is the mean and what is the variance of the product. And a way to do this is… Bayesian Linear Regression Finding posterior parameters Back to our model… The posterior must be Gaussian, ignoring normalizing constants, the posterior is: 15 Bayesian Linear Regression 11 / 03 / 2022 p (w ∣ X, y, σ 2) ∝ exp 1 − (w − μ)⊤Σ−1(w − μ) { 2 } = exp 1 − (w ⊤Σ−1w − 2w ⊤Σ−1 μ + μ ⊤Σ−1 μ) { 2 } ∝ exp 1 − (w⊤Σ−1w − 2w⊤Σ−1 μ} } { 2 Ignoring non-w terms, the prior multiplied by the likelihood is: p (y ∣ w, X, σ 2) ∝ exp 1 1 − (y − Xw)⊤(y − Xw) exp − w ⊤S−1w { 2σ 2 } { 2 } ∝ exp 1 1 2 − w ⊤ 2 x ⊤ X + S−1 w − 2 w ⊤ x ⊤ y [ ] σ σ { 2( )} Posterior (from previously): ∝ exp 1 − (w ⊤Σ−1w − 2w ⊤Σ−1 μ) { 2 } Equate individual terms on each side. Covariance: w⊤Σ−1w = w⊤ Σ= 1 ⊤ X X + S−1 w [ σ2 ] 1 ⊤ X X + S−1 ( σ2 ) Mean: −1 2 ⊤ ⊤ w X y σ2 1 μ = 2 Σ⊤ y σ 2w ⊤Σ−1 μ = Bayesian Linear Regression Example Imagine we have some data. If we take as model assumption a linear model with two parameters f(x): W0 + W1x This is the kind of family of function that are compatible with the data, under the modeling assumption: 16 Bayesian Linear Regression 11 / 03 / 2022 This illustration explains the fact that we get a concentration around certain values of parameters: Predictive Distribution We have a family of functions that fit our data so if we take a new input called x* and want to predict the value of y we are not going to get just one value from y but a family of function that goes through x* so a family of value for y*. This is what we call Predictive Distribution. In order to do this we apply the sum rule and the product rule and this allows us to derive a pretty powerful expression: p (y* ∣ x, y, x*, σ 2) = p (y* ∣ x*, w, σ 2) p (w ∣ x, y, σ 2) d w ∫ Consider this p(y*|x*, w, σ 2). This is the likelihood function that we impose on the model. So this is giving me what is p(y*) given x*, w, σ 2. So if you give me w I know how to predict on x*, I just evaluate my function at x*. And this is going to be my mean of the distribution over y*. But now I have the posterior over w, the distribution over w given data. So what I can do is to assign a weight to each of my prediction according to how good my parameters are, and for good I mean something tells me how large is the posterior of p(w), of w given data. And if I do that something magical happens! My prediction does not contain parameters anymore. My prediction on y* is going to be a distribution where there is no w anymore so w disappears and this is very elegant. We are doing prediction parameter free! We don’t need to optimize parameter. And the way it is done is thanks to this predict distribution, this simple expression: p (y* ∣ x, y, x*, σ 2) = p (y* ∣ x*, w, σ 2) p (w ∣ x, y, σ 2) d w ∫ Same tedious exercise as before yields: p (y* ∣ X, y, x*, σ 2) = 𝒩 (y* ∣ x⊤* μ, σ 2 + x⊤* Σx*) Here we can see that the mean of the prediction is going to be centered on the mean of w while the covariance is going to have the covariance of w, which is sigma, and we are going to have this quadratic form with x*. Introducing basis functions Now we can transform our input through basis functions and apply the same machinery. Instead of working with x we will work with ϕ(x) x → φ(x) = (φ1(x), …, φD(x)) ⊤ 17 Bayesian Linear Regression 11 / 03 / 2022 Somehow if we think of the matrix phi instead of the matrix x we are going to have N x D where we evaluate for each xi we evaluate the function from ϕ1 to ϕD and do it for all the N: Applying Bayesian Linear Regression on the transformed features gives: Covariance: Σ= p (w ∣ X, y, σ 2) = 𝒩(w ∣ μ, Σ) 1 ⊤ Φ Φ + S−1 ( σ2 ) −1 1 ΣΦ⊤ y σ2 ⊤ ⊤ Predictions: p (y* ∣ X, y, x*, σ 2) = 𝒩 (y* ∣ φ (x*) μ, σ 2 + φ (x*) Σφ (x*)) Mean: μ= The important thing about this is that now we increase the flexibility of our model tremendously because now we don’t need to work with hyperplanes, we can work with something which is more involved, non-linear functions, but the beautiful thing about this is that we still have a model which is linear in the parameters because we still have a combination of this functions. The parameters still enter linearly in the equations and that’s what makes this a linear model even when we have non-linear functions. We can do polynomial, sine, cosine, log, exp, whatever we want provided that we combine them linearly we can still apply this machinery, we can still be Bayesian, we can still use this posterior distribution over parameters, we can still do prediction and everything is going to be gaussian, the posterior is going to be Gaussian, the predicted distribution is going to be gaussian. So Bayesian linear regression is a solved problem! The only problem is how to choose these basis functions. Predictions Here we can see polynomial of order 2: Which is a family of functions generated by sampling from the posterior distribution of the parameters. Computing posterior: recipe (Assuming prior conjugate to likelihood) I. Write down prior times likelihood (ignoring any constant terms) II. Write down posterior (ignoring any constant terms) III. Re-arrange them so the look like one another IV. Equate terms on both sides to read off parameter values. 18 Bayesian Linear Regression 11 / 03 / 2022 Marginal likelihood So far, we've ignored p(y|X, σ 2), the normalizing constant in Bayes rule. We stated that it was equal to: p (y ∣ X, σ 2) = ∫ p (y ∣ X, w, σ 2) p(w)d w We're averaging over all values of w to get a value for how good the model is, how likely is y given X and the model. We can use this to compare models and to optimize σ 2! When prior is 𝒩(μ0, Σ0) and likelihood is𝒩(Xw, σ 2 I), marginal likelihood is: p (y ∣ X, y, σ 2, μ0, Σ0) = 𝒩 (y ∣ Xμ0, σ 2 I + XΣ0 X⊤) i.e. an-dimensional Gaussian evaluated at y. If we use the marginal likelihood as a criterion of model selection we can see that 2 is the best choice: Choosing a prior How should we choose the prior? I. Prior effect will diminish as more data arrive; II. When we don't have much data, prior is very important. Some influencing factors: I. Data type: real, integer, string, etc. II. Expert knowledge: 'the coin is fair', 'the model should be simple’; III. Computational considerations (not as important as it used to be!) IV. If we know nothing, can use a broad prior e.g. uniform density. Summary I. Moved away from a single parameter value. II. Saw how predictions could be made by averaging over all possible parameter values Bayesian. III. Saw how Bayes rule allows us to get a density for W conditioned on the data (and other stuff). IV. Computing the posterior is hard except in some cases.... V. ....we can do it when things are conjugate. VI. Can also (sometimes) compute the marginal likelihood.... VII. ...and use it for comparing models I. No need for costly cross-validation. 19 Gaussian Process 18/03/2022 2 - Gaussian Process Gaussian Process Linear models requires specifying a set of basis functions: Polynomials, Trigonometric, … Can we use Bayesian inference to let data tell us? Gaussian Processes kinda do that, they work implicitly with an infinite set of basis functions and learn a probabilistic combination of these. Also, as we increase the number of data, the model capacity grows with the number of data, so the more data we use the more complex the model become automatically. If we have an infine set of basis functions as we increase the number of data our posterior is going to keep changing because we add data to the problem. And somehow because there is an infinite number of this basis function what happen is that if we really use a lot of data the posterior really can capture all these data without any problem because there is an infinite number of basis functions. When we have a linear model in which we just have the identity function as basis function, the capacity of this model can’t improve, we just have a basis function which is fixed. Bayesian Linear Regression as a Kernel Machine We are going to show that predictions can be expressed exclusively in terms of scalar products as follows: k (x, x′) = ψ (x)⊤ψ (x′) This allows us to work with either k( ⋅ , ⋅ ) or ψ ( ⋅ ) Why is this useful? Because a scalar product it’s just a function that takes two input and gives a scalar. What are the properties of this number? If it’s a scalar product we know that the function must be positive definite. If we now choose the basis functions we will end up with a formulation of this kind: k (x, x′) = ψ (x)⊤ ψ (x′) . But if we think about it as a function that takes two argument and spits out a scaler the other way we can see this is that whenever we have this scalar product we can replace it with the function k. Now as long as we choose function k(x,x’) that is positive definite I may also do the other way around: instead of specifying the basis function I could specify k and that k may induce a psi of any kind. Also, if I choose a certain k I can assure that psi is infinite dimensional and I don’t need to know what psi is, all I need to know is what the scalar product is. We can choose k such that psi is infinte dimensional and all I need to know is the scala product between the two vector psi. This is really what people refers to the kernel trick in SVM for example. Bayesian Linear Regression as a Kernel Machine This is one reason why k (x, x′) = ψ (x)⊤ ψ (x′) is useful. The other one is the following: if we work with basis functions what we need to do is to be able to store the covariance matrix inverted for the posterior over the parameters. We have to remember that matrix is DxD so we have to compute a quadratic function and if we want to factorize the matrix it’s going to cost D3 time. So, working with ψ ( ⋅ ) costs O(D2) storage, O(D3) time. Things are different with k: working with k( ⋅ , ⋅ ) costs O(N2) storage, O(N3) time. So gaussian processes are nice but this constraints are pretty bad. One way to do that, if we have a lot of features and few data, it makes sense to work with this formulation of kernel because we only evaluate things which are NxN instead of DxD. So this is one reason why we would use gaussian processes, when we have more features than observation. But what if we could pick k( ⋅ , ⋅ ) so that ψ ( ⋅ ) is infinite dimensional? Kernels It is possible to show that for k (x, x′) = exp − ( x − x′ 2 2 ) 20 Gaussian Process 18/03/2022 there exists a corresponding ψ ( ⋅ ) that is infinite dimensional! Of course there are other kernels satisfying this property. To show that Bayesian Linear Regression can be formulated through scalar products only, we need Woodbury identity: (A + UCV)-1 = A-1 A-1 U(C-1 + VA-1 U)-1 VA-1 Do not memorize this! Gaussian Processes Gaussian Processes can be explained in two ways 1. Weight Space View (Bayesian linear regression with infinite basis functions); 2. Function Space View (Defined as priors over functions) Gaussian Processes Prior over Functions Consider an infinite number of Gaussian random variables: think of them as indexed by the real line and as independent and denote them as f(x). If we look at the covariance of all these random variables we are going to get something which is infinte dimensional and diagonal. Kernel 2 Consider the Gaussian kernel again: k (x, x′) = α exp(−β | | x − x′| | ) . We introduced some parameters for added flexibility. We can see how multiplying for alpha and beta effetely we multiply the overall shape of the polynomial where alpha > 0 -> stretch vertically the basis function. NB: if we multiply x and x’ by different coefficient we will have trouble because it’s difficult to prove it’s a kernel. Gaussian Processes Prior over Functions Now imagine that we use this function here so say something of this kind: this function is going to be a gaussian (it’s going to have a bump in x-x’ = 0). Imagine now to use this function to impose covariance on the random variables that are around 0, around the middle point of the plot. If we do that, if we fix x=0, and we say that all the random variables around 0 should behave in a way that they covary with x=0 according to the function that we decided. This can be used as a prior over functions instead of parameters, who multiply parameters by basis functions, and combination of basis function and so on. 21 Gaussian Process 18/03/2022 Now we can play around with alpha and beta. These are Infinite Gaussian random variables with parameterized and input-dependent covariance: We can also use all this for model selection and if we can have access to the marginal likelihood of the model we can optimize the marginal likelihood w.r.t. alpha and beta to get something nice. But how can we deal with infinity? We are still talking about an infinite number of random variables, all of them are Gaussian. How can I handle them? If I think of N random variables joined into a gaussian, if I only care about a few of them I don’t really need to know what the others do. If I look at the covariance I have an infinite number of variable and they are all gaussian so I have this infinite by infinite covariance! But if I only care about what happens at a certain point, at certain random variables then the only thing I need to do is to take this big covariance, select the rows and the columns corresponding to the random variable I am looking at and throw away anything else. If I do this I will obtain a matrix which is going to be NxN. Let’s have a look formally at what we have just done. We have the vector f, which is the vector of realization of these random variable f(x1), … ,f(xN). Then the distribution over these guys based on the construction we have just done tells us that they have 0 mean and K covariance. So, The marginal distribution of f=(f(x1), … ,f(xN))T is p(f I X) = N(0, K), with: m̄ = k⊤* K−1f s̄2 = k** − k⊤* K−1k* 22 Gaussian Process 18/03/2022 The definition of k* and K are the same as before: k* is the vector of evaluating the kernel between x* and all the other data points in the training; K is going to be the matrix obtained by evaluating the covariance across all pairs of training points; k** is the evaluation of K between x* and itself. So now we have a prior. What do we do? Let’s introduce a likelihood. So that we may will find a way to get posterior over this functions. Remember that when we modeled labels y in the linear model we assumed noise with variance sigma2 around wTx. Let’s do the same thing in Gaussian process but now the likelihood maybe we want to center it across f. We are going to put a prior in f we have a likelihood and then we are going to be able to find perhaps a posterior over f. The way we do it is as before: we say that the likelihood is independent across observations (which doesn’t mean that it’s completely random) and that p(y ∣ f) = ∏i=1 p (yi ∣ fi) N with p (yi ∣ fi) = 𝒩 (yi ∣ fi, σ 2) It’s just as before excepts that now fi is not computed as wTxi but its’ just our fi that we put a gaussian process prior over. Likelihood and prior are both Gaussian conjugate! Because we have a gaussian prior over f and the likelihood is gaussian. And so we can compute the marginal likelihood and the posterior. We can integrate out the Gaussian process prior over f: p(y ∣ X) = p(y ∣ f)p(f ∣ X)d f ∫ The marginal likelihood is going to be something which is p(y|X) and it’s going to give us something pretty simple because when we have the integral like that we can just sum the variances: p(y ∣ X) = 𝒩 (0, K + σ 2 ∣ ) We can derive the predictive distribution as follows: p ( f* ∣ y, x*X) = p ( f* ∣ f, x*, X) p(f ∣ y, X)d fd f* = 𝒩 (m, s 2) ∫ With m = k⊤* (K + σ 2 I) s 2 = k** − k⊤* (K + σ 2 I) −1 −1 y k* Same expression as in the "Weight-Space View" section. Let’s use this posterior with the gaussian condition we can get the probability of f* given data. Again f disappears because we remove the dependence of w because we sum over all possibile values of w. Here we do the same summing over all possibile value of f weight by how good they are given data. Gaussian Processes Regression example Some data generated as a noisy version of some function 23 Gaussian Process 18/03/2022 Draws from the posterior distribution over f. on the real line Optimization of Gaussian Process parameters The kernel has parameters that have to be tuned. Alpha and beta control the kind of family of function that can be used: k (x, x′) = α exp (−β x − x′ 2 ) 2 and there is also the noise parameter σ in the sense that we can thing of this sigma square controlling the variance of the likelihood and also as a modeling choice. So let’s put them all into a parameter vector called theta: θ = (α, β, σ 2) For simplicity let define C=K+σ 2I. Maximize the logarithm of the likelihood: p(y ∣ X, θ ) = 𝒩(0, C) that is − 1 1 log | C | − y⊤C−1y + const 2 2 Derivatives can be useful for gradient-based optimization ∂ log[ p(y ∣ x, θ )] 1 ∂C 1 ∂C −1 = − Tr C−1 + y⊤C−1 C y ∂θi 2 ( ∂θi ) 2 ∂θi 24 Gaussian Process 18/03/2022 Summary Introduced Gaussian Processes - Weight space view - Function space view Gaussian processes for regression Optimization of kernel parameters To think about: - Gaussian processes for classification? - Scalability? 25 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 3 - Bayesian Logistic Regression and the Bayesian Classifier Classification A set of N objects with attributes (usually vector) xn: Each object has an associated response (or label) yn. Binary classification: yn ∈ {0,1} or y, ∈ {-1,1} (depends on algorithm). Multi-class classification: yn ∈ {1,2,... K}. Probabilistic v non-probabilistic classifiers ⊤ ⊤ Classifier is trained on X = (x1, …, xN) and y = (y1, …, yn) and then used to classify x*. In the same kind of vain of what we saw in regression in the end what we really want is to end up with an expression of this kind: P (y* = k ∣ x*, x, y) So we want to know the probabilty of a class membership y* equal to k given x* (obviously because it’s the new feature vector we want to know the label) and obviously previous data. So we want to use information from training data to say something about the label of a new test point. So we wanto to estimate the distribution over the class label y* given the information from training data. So for example for binary calssification, P (y* = 1 ∣ x*, X, y) and P (y* = 1 ∣ x*, X, y). For non-probabilistic classifiers, instead, we would have something that does not give us a full probability distribution over the class labels but just something that says whether y is equal to 1 or 0. But probabilities provide us with more information P(y* = 1) = 0.6 is more useful than y* = 1. It tells us how sure the algorithm is. Particularly important where cost of misclassification is high and imbalanced. e.g. Diagnosis: telling a diseased person they are healthy is much worse than telling a healthy person they are diseased. Extra information (probability) often comes at a cost. Classification syllabus We will study two probabilistic classifiers: • Bayes classifier; • Logistic regression. Some data Squares and circles are the two classes. We want to find a boundary to separate these two classes so that when we have a new data point we know with what probability this is going to be classified. 26 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 Logistic regression Similarly to regression, we could think about modeling P(y*= k|x*, w) through some f(×*; w) with parameters w. Before we saw this w T x, which is a linear combination of the features. Can we use this here? No output is unbounded and so can't be a probability. Also, the sum of probability needs to be 1 and each of the P of y* equal to any class has to be something between 0 and 1. We cannot have anything grater than 1. So really w T x doesn’t works. Unless… We think of something that really squashes these values in order for them to lie in the interval 0-1. So if we apply a transformation h to this linear function then we can get something that is always between 0 and 1: P (y* = k ∣ x*, w) = h ( f (x*; w)) where h(.) squashes f(x*;w) to lie between 0 and 1. h( ⋅ ) For logistic regression (binary), we use the sigmoid function: P (y* = 1 ∣ x*, w) = h (w ⊤ x*) = 1 1 + exp (−w ⊤ x*) When wTx is negative and large we get something which is exp( something large ), which is something huge. So 1/something huge is close to 0. When I go left things become close to 0, on the right become closer to 1. When wTx is 0 (center) we end up with 1/2. So for any x* that we give to this function we know that the output is going to be between 0 and 1. This could really be something that models the probability of for example class 1. Bayesian logistic regression Recall the Bayesian ideas from two weeks ago…. In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w ∣ X, y) = p(y ∣ X, w)p(w) p(y ∣ X ) Then what we can do with this posterior is make prediction so we can take the expectation under the posterior of the predicted distribution: P (y* = 1 ∣ x*, X, y) = Ep(w∣X,y) [P (y* = 1 ∣ x*, w)] Defining a prior We can define a prior, so for example choose a Gaussian prior: 27 04 - Bayesian Logistic Regression and the Bayesian Classifier p(w ∣ s) = 08/04/2022 D ∏ 𝒩(0,s) d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important 'for the maths’. This isn't the case today could choose any prior no prior makes the maths easier! Defining a likelihood First assume independence: p(y ∣ X, w) = N p y ∣ x ,w ∏ ( n n ) n=1 So knowing what happens for one x doesn’t tell me anything about the outcome for another x. In the regression case the noise terms, y, they are all independent. We have already defined this! If y n = 1: P (yn = 1 ∣ xn, w) = and if y n = 0: 1 1 + exp (−w ⊤ xn) P(yn = 0 | xn, w) = 1 − P(yn = 1 | xn w) Posterior p(w ∣ y, X, s) = p(y ∣ X, w)p(w ∣ s) p(y ∣ X, s) Now we have a likelihood — p(y|X,w) — a prior — p(w|s) — and so the posterior is also given s somehow but then we will drop this s because we don’t really need it. Although we can do model selection with this s if we could compute the marginal likelihood of y|X,s. Remember that the denominator gives a way to do the model selection, we can think of s as controlling a continuum of an infinite number of models that are all similar but vary the way these variance is tuned. We can't compute p(w|y, X, s) analytically. Prior is not conjugate to likelihood. No prior is! This means we don't know the form of p(w|y, X,s), and we can't compute the marginal likelihood: p(y ∣ X, s) = p(y ∣ X, w)p(w ∣ s)d w ∫ Because the integral is not analytically available. What can we compute? For simplicity, let's drop the dependence on s. We can compute p(y|X, w)p(w). The product it’s not a problem at all. What we cannot compute is the normalization in the denominator. Define g(w) = p(y|X, w)p(w) for notation. Armed with this, we have three options: 1. If I can find g(x) I can also optimize it. And if so, I can find the most likely value of w, a point estimate; 2. Approximate p(w | y, X) with something easier. The gaussian here is one of the main character; 3. Sample from p(w|y, X). Even though we cannot normalize it! We'll cover examples of each of these in turn… These are not the only ways of approximating/ sampling! They are also general not unique to logistic regression. MAP estimate (Maximum A Posteriori) Out first method is to find the value of w that maximizes p(w|y, x) (call it w). Since g(w) is proportional to p(w|y, X), ŵ therefore also maximizes g(w). This is similar to maximum likelihood but has an additional effect of prior. 28 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 Once we have ŵ we make predictions with: P (y* = 1 ∣ x*, ŵ ) = 1 1 + exp (− ŵ ⊤ x*) When we met maximum likelihood, we could find ŵ exactly with some algebra. We can't do that here (can't solve ∇w log g(w) = 0) but we can resort to numerical optimization: 1. Guess ŵ ; 2. Change it a bit in a way that increases g(w); 3. Repeat until no further increase is possible. Many algorithms exist that differ in how they do step 2. e.g. Newton-Raphson. [Not covered in this course. You just need to know that sometimes we can't do things analytically and there are methods to help us!]. Decision boundary Once we have ŵ , we can classify new examples. Decision boundary is a useful visualization: We can now start playing around with this classification rule. What we see here is what correspond to P(y*=1|x*,ŵ )=0.5. Predictive probabilities But the most interesting thing is looking at the predictive distribution as a whole. What is the set of points for which my predictive distribution is a certain value between 0 and 1? Do these boundaries look sensible? Not really! The classifier it’s quite poor. But we are going to see that being Bayesian we can bend the contours. 29 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 Roadmap Find the most likely value of w a point estimate. Approximate p(w|y, X) with something easier. Sample from p(w|y, X). Laplace approximation Approximating p(w|y, X) with another distribution. i.e. Find a distribution q(w|y, X) which is similar. What is ‘similar’? • Mode (highest point) in same place. • Similar shape? • Might as well choose something that is easy to manipulate! Approximate p(w|y, X, s) with a Gaussian: q(w ∣ y, X ) = 𝒩(μ, Σ) Where μ = w,̂ Σ−1 = − ∇w ∇w log[g(w)] ŵ And ŵ = argmax log[g(w)] w We already know ŵ because it is the maximum a posteriori. What is the justification for this obscure expression for the covariance of q? It is based on Taylor expansion of log[g(w)] around mode (ŵ ). Means approximation will be best at mode. Expansion up to 2nd order terms 'looks' like a Gaussian. Laplace approximation 1D example Laplace approximation of the Gamma density function: p(y ∣ α, β ) ∝ y α−1 exp(−β y) α−1 ŷ= β ∂ log y α−1 = − ∂y 2 y2 ∂ log y ∂y 2 ŷ =− q(y ∣ α, β ) = 𝒩 α−1 y 2̂ α − 1 y 2̂ , α − 1) ( β 30 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 Solid: true density. Dashed: approximation. Left: a = 20, B = 0.5 Right: a = 2, B = 100 Approximation is best when density looks like a Gaussian (left). Approximation deteriorates as we move away from the mode (both). Laplace approximation for logistic regression • Not going into the details here; • p(w ∣ y, X ) ≈ q(w ∣ y, X ) = 𝒩(w ∣ μ, Σ); • Find μ = ŵ (that maximizes g(w)) by Newton-Raphson (already done it MAP). • Find: −1 • Σ = − ∇w ∇w log[g(w)] ŵ • How good an approximation is it? Black - approximation. Grey - p(w|y,X). Approximation is OK. As expected, it gets worse as we move away from the mode. Predictions with the Laplace approximation We have 𝒩(μ, Σ) as an approximation to p(w|y, X). Can we use it to make predictions? Need to evaluate: P (y* = 1 ∣ x*, x, y) = E𝒩( μ,Σ) [P (y* = 1 ∣ x*, w)] 1 = 𝒩( μ, Σ) dw ∫ 1 + exp (−w⊤ x*) Cannot do this! So, what was the point? Sampling from an expectation with samples! 𝒩(μ, Σ) is easy. And we can approximate 31 04 - Bayesian Logistic Regression and the Bayesian Classifier Draw S samples 08/04/2022 w1, …, wS from z𝒩( μ, Σ) 1 S 1 E𝒩( μ,Σ) [P (y* = 1 ∣ x*, w)] ≈ S∑ 1 + exp (−w⊤s x*) s=1 Contours of P(y* = 1|x*, y, X). Better than those from the point prediction? Because in one case we use 1 decision boundary, 1 value of w. It’s still not perfect. Summary roadmap • Defined a squashing function that meant we could model P (y* = 1 ∣ x*, w) = • Wanted to make 'Bayesian predictions': average over all posterior values of w. • Couldn't do it exactly. • Tried a point estimate (MAP) and an approximate distribution (via Laplace). • Laplace probability contours looked more sensible (to me at least!) • Next: • Find the most likely value of w a point estimate. • Approximate p(w|y, X) with something easier. • Sample from p(w|y, X). h (w ⊤ x*) 32 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 MCMC sampling Laplace approximation still didn't let us exactly evaluate the expectation we need for predictions. Good news! If we're happy to sample, we can sample directly from p(w|y, X) even though we can't compute it! i.e. don't need to use an approximation like Laplace. Various algorithms exist we'll use Metropolis-Hastings. Back to the script: Metropolis-Hastings Produces a sequence of samples - w1, w2, . . . , ws; . . . Imagine we've just produced ws-1. MH firsts proposes a possible ws (call it w̃s) based on ws-1. MH then decides whether or not to accept ws • If accepted, ws = w̃s • If not, ws = ws-1. Two distinct steps: proposal and acceptance. MH proposal Treat ws as a random variable conditioned on ws-1. i.e. need to define p(w̃s|ws-1) Note that this does not necessarily have to be similar to posterior we're trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on ws-1 with some covariance: p ( w̃ s ∣ ws−1, Σp) = 𝒩 (ws−1, Σp) MH acceptance Choice of acceptance based on the following ratio: the posterior in the point where we are going over the posterior where we are currently. So the acceptance rate is going to be higher if we move to a point whit higher posterior density compared to when we move to a point where the posterior density is worst. r = p ( w̃ s ∣ y, X) p (ws−1 ∣ w̃ s, Σp) p (ws−1 ∣ y, X) p w̃ ∣ w , Σ s−1 p) ( s So this r is going to be greater than 1 when we move to a point where density is improved and is going to be less than 1 when decrease the posterior density. Which simplifies to (all of which we can compute): r = g ( w̃s ; y, X) p (ws−1 ∣ w̃ s, Σp) g (ws−1; y, X) p w̃ ∣ w , Σ s−1 p) ( s 33 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 What does this ratio tell us? It tells us with what probability we should accept the move. Whenever we go to a point where the density improves we always accept, but if we did reject every move whenever we go downward this is just optimization in the end, so the algorithm will just end up with a local optimum. But what we want is samples for the posterior. For this to happen we have to allow to also move to values of the posterior which are lower. So that’s the reason why we have this probabilistic acceptance of the move. When we go downwards we still accept with the probability given by the ratio. This is the key of this algorithm, this is what makes MH algorithm converge and give us overall samples from the true posterior distribution. But we can see that there is another term which tells us the “opposite”: from w̃s to here, to ws−1 compared going from ws−1 to w̃s. The reason for that is that it could be more likely to go from B to A instead of going from A to B and somehow we have to account that. That’s why the second term is necessary. We now use the following rules: If r ≥ 1, accept: ws =w̃s . If r < 1, accept with probability r. If we do this enough, we'll eventually be sampling from p(w|y, X), no matter where we started! i.e. for any w1. MH flowchart MH walkthrough 34 04 - Bayesian Logistic Regression and the Bayesian Classifier 08/04/2022 What do the samples look like? Predictions with MH MH provides us with a set of samples W1, ... , Ws These can be used like the samples from the Laplace approximation. Summary • Introduced logistic regression a probabilistic binary classifier. • Saw that we couldn't compute the posterior. • Introduced examples of three alternatives: • Point estimate MAP solution. • Approximate the density Laplace. • Sample Metropolis-Hastings. • Each is better than the last (in terms of predictions).... • …but each has greater complexity! • To think about: • What if posterior is multi-modal? 35 05 - Bayesian Classifier 08/04/2022 3.1 - Bayesian Classifier Bayes classifier The Bayesian classifier uses Bayes rule as follows: P (y* = k ∣ X, y, x*) = p (x* ∣ y* = k, X, y) P (y* = k) ∑j p (x* ∣ y* = j, X, y) P (y* = j) We need to define a likelihood and a prior and we're done! What is this p(x*|y*)? This is the “opposite” as before: instead of modeling y we are modeling x! So we are looking at the distribution of the inputs for a given class. Bayes classifier likelihood p (x* ∣ y* = k, X, y) = p (x* ∣ y* = k, θ(X, y)) How likely is x* if it is in class k? (not necessarily a probability…) Free to define this class-conditional distribution as we like. Will depend on data type: • D-dimensional vectors of real values Gaussian likelihood. • Number of heads in N coin tosses Binomial likelihood. Training data with y = k used to determine parameters of likelihood for class k (e.g. Gaussian mean and covariance). The parameters θ encode information from data X and y. Bayes classifier prior P(y* = k) Used to specify prior probabilities for different classes. e.g: • There are far fewer instances of class 0 than class 1: P(Y* = 1) > P(y; = 0). • No prior preference: P(y* = 0) = P(y*=1). • Class 0 is very rare: P(y*= 0) << P(y* = 1). Naive-Bayes Naive-Bayes makes the following additional likelihood assumption: The components of x* are independent for a particular class: p (x* ∣ y* = k, θ) = D p (x ) ∣ y = k, θ) ∏ ( *d * d=1 Where D is the number of dimensions and (x*)d is the value of the dth component. Often used when D is high: Fitting D uni-variate distributions is easier than fitting one D-dimensional one. Bayes classifier, example 1 Each object has two attributes: x = [x1, x2]T; 36 05 - Bayesian Classifier 08/04/2022 K = 3 classes; We'll use Gaussian class-conditional distributions (with Naive-Bayes assumption). P(y* = k) =1/K uniform prior. Step 1: fitting the class-conditional densities p(x ∣ y = k, X, y) = p(x ∣ y = k, θ ) = μkd = 1 x ∑ nd Nk n:y =k n 2 ∏ d=1 2 σkd = 2 𝒩 (μkd , σkd ) 1 2 (xnd − μkd) ∑ Nk n:y =k n What we need to do here is to estimate p(x|y=k) and we start with k=class red. Information from the dataset X and y is incapsulated in the theta parameter which is the mean and the variance for the two component. The mean and the variance are estimated using the “maximum likelihood” if we want. Keeping things simple let’s assume that this is a Bayesian classifier from the fact that we are using Bayesian theorem to express p(y|x) as p(x|y). Once we have estimated one mean vector for each of them (green, red, blue) and one diagonal covariance for each of them, we are ready to make our prediction: 37 05 - Bayesian Classifier 08/04/2022 It’s going to be a product of gaussians evaluated for various dimensions and so we can calculate this density for each of the three classes: p (x* ∣ y * = k, X, y) = D ∏ d=1 𝒩 (μkd , σ 2 k d ) Compute predictions Once we have done it, what we want is to normalize this value of the densities that we calculated for all of them in a way that according to the proportion of how much each component supports the fact that that point comes from that class is just a ratio of the densities. Remember that we assumed P(y* = k) = 1/K. P (y* = k ∣ x*, θ) = p (x* ∣ y* = k, θ) p (y* = k) ∑j p (x* ∣ y* = j, θ) P (y* = j) Remember that here we are not really completely Bayesian. Bayes classifier, example 2 Data are number of heads in 20 tosses (repeated 50 times for each) from one of two coins: Coin 1 (Y n = 0): *n = 4, 7, 7, 7. 4, ... Coin 2 (Y n = 1): Xp = 18, 16, 18, 14, 17, ... Use binomial class conditional densities: Where 0 = {rk} k=1,2 is the probability that coin k lands heads on any particular toss. Problem: predict the coin, y* given a new count, x*. (Again assume P(y* = k) = 1/K) 38 05 - Bayesian Classifier 08/04/2022 Fit the class conditionals… Fitting is just finding rk: rk = r0 = 0.287, r1 = 0.706. Compute predictions P (y* = k ∣ x*, θ) = 1 x ∑ n 20Nk n:y =k n p (x* ∣ y* = k, θ) P (y* = k) ∑j p (x* ∣ y* = j, θ) P (y* = j) Bayes classifier summary • Decision rule based on Bayes rule. • Choose and fit class conditional densities. • Decide on prior. • Compute predictive probabilities. • Naive-Bayes: • Assume that the dimensions of x are independent within a particular class. • Our Gaussian used the Naive Bayes assumption (could have written p(x|y = k,...) as product of two independent Gaussians). 39 05 - Bayesian Classifier 08/04/2022 3.2 Performance Evaluation Performance evaluation How do we choose a classifier? Which algorithm? Which parameters? We need performance indicators. We'll cover: • 0/1 loss. • ROC analysis (sensitivity and specificity) • Confusion matrices 0/1 Loss 0/1 loss: proportion of times classifier is wrong. Consider a set of predictions y1,..., yN and a set of true labels y*1,…,y*N. Mean loss is defined as: 1 N δ(yn ≠ yn∗) ∑ N n=1 (δ(a) is 1 if a is true and 0 otherwise) Advantages: • Can do binary or multiclass classification. • Simple to compute. • Single value. Disadvantage: Doesn't take into account class imbalance: • We're building a classifier to detect a rare disease. • Assume only 1% of population is diseased. • Diseased: = 1 • Healthy: V = 0 • What if we always predict healthy? (y = 0) • Accuracy 99% but classifier is rubbish! Sensitivity and specificity: We'll stick with our disease example. Need to define 4 quantities. The numbers of: 1. True positives (TP) the number of objects with y*n=1 that are classified as = 1 (diseased people diagnosed as diseased). 2. True negatives (TN) the number of objects with y*n=0 that are classified as yn = 0 (healthy people diagnosed as healthy). 3. False positives (FP) the number of objects with y*n=0 that are classified as yn = 1 (healthy people diagnosed as diseased). 4. False negatives (FN) the number of objects with y*n=1 that are classified as yn = 0 (diseased people diagnosed as healthy). We can now define the sensitivity: Se = TP TP + FN The proportion of diseased people that we classify as diseased. The higher the better. In our example, Se = 0. But there is also another actor, the specificity: Sp = TN T N + FP The proportion of healthy people that we classify as healthy. The higher the better. In our example, Sp = 1. 40 05 - Bayesian Classifier 08/04/2022 We would like both to be as high as possible. Often increasing one will decrease the other. Balance will depend on application: e.g. diagnosis: We can probably tolerate a decrease in specificity (healthy people diagnosed as diseased).... if it gives us an increase in sensitivity (getting diseased people right). How can we find the right spot between sensitivity and specificity? ROC Analysis One way to do this is choosing a point where the classifier is maximally uncertain around and let’s use this as a threshold. For example 0.5 So, classification rules involve setting a threshold and for a probabilistic classifier we can say: p(y∗ ∣ x∗, y, x) = 0.5 However, we could use any threshold we like…. The Receiver Operating Characteristic (ROC) curve shows how Se and 1 − Sp vary as the threshold changes. ROC curve As we move the threshold we change our point, our sensitivity and specificity to different values. So in the bottom left we have everything classified as 0 (sensitivity = 0 and specificity = 1). On the other hand, the top right is where everything is classified as 1. Goal: get the curve to the top left corner perfect classification (Se = 1, Sp = 1) . So we would have the curve as much as possible close to the top left corner. So a better classifier could be this one: 41 05 - Bayesian Classifier 08/04/2022 It reaches pretty soon a sensitivity of 1 for a specificity of more than 0.8. Even a better one is this one: AUC We can quantify performance by computing the Area Under the ROC Curve (AUC). The higher this value, the better. For the three classifier we saw before we have: First: AUC=0.8348 Second: AUC= 0.9551 Third: AUC=0.9936 AUC is generally a safer measure than 0/1 loss. Confusion matrices The quantities we used to compute Se and Sp can be neatly summarized in a table: We want that the value on the diagonal is as much as possible and ideally we would like to have 0 in the false positive and false negative. This is known as a confusion matrix It is particularly useful for multi-class classification. It tells us where the mistakes are being made. Note that normalising columns gives us Se and Sp. Confusion matrices example • 20 newsgroups data. • Thousands of documents from 20 classes (newsgroups) • Use a Naive Bayes classifier (~ 50000 dimensions (words)!) • Details in book Chapter. • ~ 7000 independent test documents. • Summarise results in 20 x 20 confusion matrix: 42 05 - Bayesian Classifier 08/04/2022 Here we can see that we have large numbers on the diagonal, which is good, so our classifier works. But also we have some mixed results that tell us that for example that we predict 17 but it’s actually 19. So Algorithm is getting 'confused' between classes 20 and 16, and 19 and 17. • 17: talk politics. guns • 19: talk. politics. misc • 16: talk.religion.misc • 20: soc. religion. christian Maybe these should be just one class? Maybe we need more data in these classes? Summary Introduced two different performance measures: • 0/1 loss • ROC/AUC Introduced confusion matrices a way of assessing the performance of a multi-class classifier. 43 4 - Variational Inference 15/04/2022 4 - Variational Inference Where are we? We did regression, gaussian prior + linear model (Gaussian likelihood) and we were able to get an analytical solution for the posterior. Then we worked with the classification. We took a gaussian prior on the parameters but now we have a non-linear model due to the fact that we squeezed the output of the linear function with a Sigmoid function or a commutative gaussian. So the posterior is not tractable anymore so we need to do an approximation. The solutions we saw were: • Sample from the intractable posterior: Markov-Chain Monte-Carlo (random walk with Metropolis-Hastings). Remember that the random walk consists in random walking through the space of the parameters and then it has a mechanism to either accept or reject move. If we only accept when we improve our posterior then it would be optimization but here the randomness indicates that we accept even lower values of the posterior. • Approximate the intractable posterior: Collapse the posterior on the most likely value (Maximum-a-Posteriori or MAP), this is not being really Bayesian, it’s more like being almost maximum likelihood with some regularization of the prior. We saw that we use the Laplace approximation (2nd order Taylor expansion around the MAP). The best possible gaussian we can find to approximate a distribution is to look at the mode and try to find the best curvature. Now we are going to see how to use variational inference for this approximation. Refresh: Kullback-Leibler divergence The main ingredient of the variational inference is that Kullback-Leibler divergence. The KL divergence is a measure of "similarity" between probability distributions. So if we take q and p, where p is our posterior we want to approximate and q is our approximating distribution, the way KL is computed is: q(z) q(z) KL[q(z) ∥ p(z)] = q(z)log dz = Eq(z) log [ p(z) ] ∫ p(z) It’s some sort of expectation under q of the log of the ratio. If q is equal to p then the ratio is 1, log of 1 is 0 so the expectation is going to be 0. This KL is going to give 0 when q=p. But when they are different we are going to get something different to 0. There is a positivity constraint that the Kullback-Leibler satisfy. That’s why we talk about dissimilarity, because it’s never negative. Also, it’s not symmetric so we are going to get different values for KL(q||p) and KL(p||q). KL divergence between two Gaussians: KL[𝒩(μ1, σ12) ∥ 𝒩(μ2, σ22)] σ22 σ12 (μ1 − μ2) 1 = log −1+ 2 + 2 σ2 σ22 ( σ12 ) 2 44 4 - Variational Inference 15/04/2022 We talk about divergence because it’s not symmetric and doesn’t even satisfy the triangular inequality so it’s not a distance. But for many purposes it’s got enough because it’s a measure of how dissimilar two distributions are and it’s enough to say that we can use this as an objective to say that we want to reduce this divergence. Ideally if we could optimize q, we could bring this divergency to 0 and so solve the problem of Bayesian inference. Logistic Regression as a working example We will use logistic regression as example. Likelihood: p(y ∣ X, w) = N p y ∣ x ,w ∏ ( n n ) n=1 If yn=1: P(yn = 1 ∣ xn, w) = 1 1 + exp(−w ⊤ xn) And if yn=0: P(yn = 0 ∣ xn, w) = 1 − P(yn = 1 ∣ x, w) Inference Sometimes the likelihood is also defined as: p(yn ∣ xn, w) = Ber(yn ∣ pn), Using Bayes theorem p(w ∣ y, X ) = with pn = 1 1 + exp(−w ⊤ xn) p(y ∣ X, w)p(w) p(y ∣ X ) There is no prior which is conjugate to the likelihood. This means we don't know the form of p(w|y,X). So we can't compute the marginal likelihood because it’s an integral that we can’t compute analytically: p(y ∣ X) = p(y ∣ X, w)p(w)d w ∫ and we can't compute the predictive distribution: p(y∗ ∣ x∗, y, X) = p(y∗ ∣ x∗, w)p(w ∣ y, X)d w ∫ Variational Inference Main idea: instead of trying to solve intractable integrals, let's solve an optimization problem. Try to optimize the position and the shape of our approximation in a way that it matches as close as possible the posterior. We are trying to move the mean and change the covariance in a way to get as close as possible to the posterior. A very general recipe: 1. Introduce a set Q of distributions q(w); 45 4 - Variational Inference 15/04/2022 2. Define an objective which measures the "distance" between an arbitrary distribution q(w) ∈ Q and p(w | y, X). We talked about KL and that’s what we are going to use; 3. In the set of possible solutions Q, find the best q(w) that minimizes the "distance" to p(w|y, X). Q can be anything and depending on the family of Q we use we have different parameters to tune; 4. Interpret q(w) as a distribution that approximates the intractable p(w|y,X). Visual illustration of Variational Inference We have some sort of p(w|y,X) which is our posterior that we want to approximate. We choose a set of approximating distribution Q, which is a subset in the set of all possibile distribution and we will try to minimize the distance between q(w) and the posterior. And our optimization is going to move our distribution within the space of possibile distribution and we are going to get as close as possibile to p(w|y,X). If the posterior does not have the same form that we chose for the q we are going to get close but not able to get 0 out this distance. Variational Inference Form of the approximation What form should q(w) have? We will work a lot with this mean-field approach, that imposes independent distributions for each component of w: q(w) = D−1 q w ∏ ( i) i=0 This is a fancy way to say something simple: we factorize the distribution across the various dimensions of our parameters space. Which means that we are not going to be able to capture covariances in our posterior. For simplicity, we are going to work with Gaussian distributions. q(w) = D−1 D−1 i=0 i=0 q w = 𝒩 w ∣ μ , σ2 ∏ ( i) ∏ ( i i i ) μi and σ are parameters we are going to optimize in order to change the position and the shape of our q. They are called variational parameters (for notation, they are collected into ν). We are going to have many of them, we are going to have Dx2 because we are going to have mean and variance for each parameter. So we double the number of things we need to optimize now because before if we wanted to do optimization of w (so not be Bayesian) we had D parameters to optimize. Now, because we want to be Bayesian, we do it with this simple approximation what we are going to get is that we have to optimize twice the number of parameters: the means and the variances. Now, in order to find the best distribution of q(w) we are going to find the best variational parameters ν. 46 4 - Variational Inference 15/04/2022 Variational Inference Objective How do we define a "distance" between q(w;ν) and the posterior p(w|y, X)? We will use the KL divergence to measure the "distance" between the two distributions. q(w; ν) q(w; ν) KL[q(w; ν) ∥ p(w ∣ y, X )] = q(w; ν)log dw = Eq(w;ν) log [ p(w ∣ y, X ) ] ∫ p(w ∣ y, x) This is a problematic part because in this expression there is something we are not able to deal with: the posterior because we don’t know how to write it analytically. So the difficulty is that p(w|y, X) is intractable! But we can do something claver. Start with rewriting things: q(w; ν) KL[q(w; ν) ∥ p(w ∣ y, × )] = Eq(w;ν) log [ ( p(w ∣ y, X ) )] A tractable objective to optimize q(w; v) is obtained by manipulating the KL divergence. We can rewrite: KL[q(w; ν) ∥ p(w ∣ y, X )] = Eq(w;ν)[logq(w; ν)] − Eq(w;ν)[logp(w ∣ y, X )] We can split the KL in entropy and cross entropy (the two terms separated by the minus). The second term is the problematic one because the posterior is intractable. But what we are going to do is say: Focusing on the intractable term Eq(w;ν)[logp(w ∣ y, x)] = q(w; ν)[logp(w ∣ y, x)]d w ∫ we can expand the posterior: [logp(w ∣ y, X )]q(w; ν)d w = q(w; ν)log [ ∫ ∫ p(y ∣ X, w)p(w) dw ] p(y ∣ X ) Obtaining: Eq(w;ν)[logp(y ∣ × , w)] + Eq(w;ν)[logp(w)] − logp(y ∣ × ) Putting everything together in the original KL objective: KL[q(w; ν) ∥ p(w ∣ y, × )] = = Eq(w;ν)[logq(w; ν)] − Eq(w;ν)[logp(w ∣ y, X)] = = Eq(w;ν)[logq(w; ν)] − Eq(w;ν)[logp(y ∣ X, w)] − Eq(w;ν)[logp(w)] + logp(y ∣ X) Rearranging: KL[q(w; ν) ∥ p(w ∣ y, X )] = − Eq(w;ν)[logp(y ∣ X, w)] + KL[q(w; ν) ∥ p(w)] + logp(y ∣ × ) This is an important equation for variational inference. Because now we can somehow rearrange again. Manipulating the previous expression: logp(y ∣ X) − KL[q(w; ν) ∥ p(w ∣ y, X)] = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)] The right hand side is computable, so we can use it as an objective! ℒELBO 47 4 - Variational Inference 15/04/2022 What I’m doing here is saying that something that I can’t compute is relate to something I can compute and the gap among the two is the KL. If q=p then this is 0! And this is always >=0. So there is a gap between what I can compute and what I cannot compute, which is due to this KL. But when the KL is 0, so when I can manage to get q=p, the thing I can compute is going to be exactly the marginal likelihood or the “evidence”. This is why we call this expression LELBO (Evidence, Lower, Bound). So somehow if I push this KL to 0 what I’m doing is to increase as much as possibile this lower bound. Minimizing the KL is equivalent to maximize LELBO. And this is very important because now it says that if I optimize and can make q=p then I solve the problem, I get exactly the value of marginal likelihood (or the evidence). This is the key: rearranging the expression of the KL between q and p allows us to derive a criterion which is computable. The more I optimize, the more I maximize wrt q the more I get close to p(y|X). Eventually if q=p KL is 0 and I have approximate exactly log p(y|X). So now the problem becomes optimize this expression wrt q. ℒELBO = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)] First term is a model fitting term: Eq(w;ν)[logp(y ∣ × , w)] the higher the better the parameters drawn from q(w;ν) are at modeling the labels. When q is actually close to the posterior, our samples are going to look good and so our average log likelihood is going to be good. This is really a model fitting term. Second term is a regularization term: −K L[q(w; ν) ∥ p(w)] which penalizes q(w;ν) deviating too much from the prior p(w). If we want to fit our data nicely we should maximize wrt q but at the same time we can’t go far away from the prior p otherwise we’ll be penalized. In that way we have a model fitting and regularization. So everywhere in machine learning when we have these two we are protected from overfitting. So what is the best q to maximize the log likelihood? A gaussian whit a very narrow variance (actually 0 variance) which has a big spike on that value, so we fall back to maximum likelihood. So the regularization term is actually nice because it really prevents from this to happen because if we do that we become so different from the prior that we’ll be penalized. It turned out that if we try this on a neural network it doesn’t work because we are going to have a sum of millions parameters. And if we have much more parameters than data what happens is that the KL term will dominate and we’ll get q=p which is not very useful. But going back to the main problem: How to compute the objective? How to optimize it? ℒELBO = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)] Recall the assumption on q(w;ν) q(w; ν) = q w = 𝒩 w ∣ μ , σ2 ∏ ( i) ∏ ( i i i ) i i Optimizing wrt to q(w;ν) means optimizing wrt to μi, σi2. The second term, −KL[q(w; ν)∥p(w)] , can be expressed analytically by using the expression of the KL divergence between Gaussians. In our case, if p(w) = Πi 𝒩 (wi ∣ 0,s 2) we obtain: KL[q(w; ν)∥p(w)] = 1 2 ∑i log s 2 − 1 + [ ( σi ) 2 σi2 s2 + μi2 s2 ] So in the computation of the variational objective the KL is the easy part. Now the difficult part is the expectation. How do we optimize the expectation wrt q? This is an integral: 48 4 - Variational Inference 15/04/2022 Eq(w;ν)[log p(y ∣ × , w)] = ∫ log p(y ∣ × , w)q(w; ν)d w And this is possible only if q is gaussian and log p is quadratic for example or gaussian. But that’s not why we use variational inference! We use variational inference in cases where log p is something messed up and q is gaussian. So we can’t do this integral analytically. But…we can do Montecarlo! What we can do is to write it as an expectation. Eq(w;ν)[log p(y ∣ X, w)] ≈ 1 N MC MC log p y ∣ × , w̃(h) , w̃(h) ∼ q(w; ν) ∑i=1 ( ) N Where w̃(h) is sampled from q. Remember: this estimator is unbiased and its variance shrinks with 1/NMC (NMC= Montecarlo samples) independently of the dimensionality. So somehow when we write it like this ν disappear. And With q(w; ν) fixed, when we resample w from q(w;ν) we obtain a different value! So how do we do now? How can we make gradient updates to the μi, σi2 parameters of q(w;ν)? Answer: freeze the randomness within Monte Carlo (Reparameterization Trick)! Variational Inference: Reparameterization trick Idea: Samples of w can be obtained by a deterministic transformation f of a random variable ϵ ∼ p(ϵ), such that p(ϵ) has no tunable parameters. The variational parameters ν are parameters of the function f. The chain rule tells us that if I take samples from p(ϵ) and we pass it though f, I get samples from q and then I can use this to compute log p(y|X,w). So we use the chain rule of differentiation to push the gradient though this function f. For example, with q (wi) = 𝒩 (wi ∣ μi, σi2) we have: ε ∼ 𝒩(0, 1) wi = f(ε; ν) = μi + εσi We can prove that by building wi in this way, we recover the original q(wi). We can do a key observation: when we want to take the gradient of the expectation we can turn the gradient of an expectation into an expectation of a gradient: ∇ν Eq(w;ν) logp(y ∣ X, w) = ∇ν Ep(ε) logp(y ∣ X, w) . w=f (ε;ν) Variational Inference Reparameterization trick (Derivation) ∇νEq(w;ν)logp(y ∣ X, w) = ∇ν Ep(ε) logp(y ∣ X, w) = Ep(ε) ∇ν logp(y ∣ X, w) [ w=f (ε;ν) w=f (ε;ν)] = Ep(ε) ∇w logp(y ∣ X, w) [ ∇νf(ε; ν) w=f (ε;ν) N MC 1 ≈ ∇w logp(y ∣ X, w) NMC ∑ h=1 ] ∇νf(~ε (h); ν), ~ε (h) ∼ p(ε) w=f (~ε (h); ν) 49 4 - Variational Inference 15/04/2022 This looks awful! Good news: if we use any auto-diff tool (PyTorch, Tensorflow, JAX, NumPyro, etc.), we will never compute this gradients manually. All we need to do is to compute the objective. All we need to know is that the samples w are constructed in a deterministic way from mean and variance of q. Variational Inference Reparameterization trick (Properties) N 1 MC ∇ν Eq(w;ν)logp(y ∣ X, w) ≈ ∇w logp(y ∣ X, w) NMC ∑ h=1 w=f (~ε (h); ν) ∇ν f (~ε (h); ν), ~ε (h) ∼ p(ε) • Estimation of the gradients is unbiased • The likelihood p(y|X, w) must be differentiable • Transformation f must be differentiable • Need to be able to sample from p(ϵ), but not from q(w;ν) • Often has low variance; even a single-sample estimation is OK (NMC = 1). Variational Inference with Stochastic Optimization So, we can use stochastic gradient optimization of our approximate variational objective. ℒ^ N ELBO with 1 MC ~ (h) − KL[q(w; ν) ∥ p(w)] = logp(y ∣ X, w ) ∑ NMC h=1 ~ (h) = f ~ε (h); ν and ~ε (h) ∼ p(ε) w ( ) We have guarantees of convergence to the original variational objective! ℒELBO = Eq(w;ν)[logp(y ∣ X, w)] − KL[q(w; ν) ∥ p(w)] Stochastic Gradient Optimization When we are stochastic when we interrogate our routine to compute the gradiente we are going to end up to a different place. To converge we need that our steps have to be small. Stochastic gradient-based optimization has good theory about convergence. Optimizing using stochastic updates reaches a local optimum if step-size αi goes to zero with a certain rate: ∑ i αi2 < ∞ ∑ αi = ∞ i 50 4 - Variational Inference 15/04/2022 Price to pay: convergence in O(1/√t) rather than O(1/t) for gradient-based optimization (t is # iterations), so we are going to be slower. Results on Classification Results with fully factorized Gaussian posterior: Results with full covariance Gaussian posterior In the Gaussian likelihood case, the optimization makes q(w;ν) converge to the true posterior! Extensions We can also extend to Mini-batch-based formulation because we have an expectation of a log likelihood, log of a likelihood is the log of a product, which becomes the sum of the logs. We can also use more general forms of q(w;ν), implicit q(w; v), any likelihood and any prior. Mini-batching Once we have the objective ~ objective = N 1 MC ~ (h) − KL[q(w; ν) ∥ p(w)] logp(y ∣ X, w ) ∑ NMC h=1 the only term depending on data is the first term When the likelihood factorizes N ~ (h) = ~ (h) logp(y ∣ X, w ) ∑ logp(yi ∣ xi, w ) i=1 We can get unbiased estimate by selecting M out of N data ~ (h) ≈ N logp(y ∣ X, w ) M ∑ i∈ minibatch ~ (h) logp(yi ∣ xi, w ) Double source of stochasticity: Monte Carlo and mini-batch. 51 4 - Variational Inference 15/04/2022 Better approximation with Normalizing Flows Key idea Build complex distributions from simple distributions via a flow of successive (invertible) transformations: 52 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 5 - K-means, Kernel K-means, and Mixture models Unsupervised learning Everything we've seen so far has been supervised. We were given a set of xn and associated yn. What if we just have xn? Aims • Understand what clustering is. • Understand the K-means algorithm. • Understand the idea of mixture models. • Be able to derive the update expression for mixture model parameters. Clustering Example: each sample has two attributes, xn=[xn1, xn2]T. And we are going to have n of these guys. It’s a completely differente set up because before we kew the “color”, we had the labels and we had to figure out how to separate the classes. But in unsupervised we just have the black data, the input. So our algorithms are going to be able to determine some grouping. This is a very arbitrary problem. The reason is that we may see 3 clusters here but someone else can see just 2. Some people has formalized this problem giving some properties to the clusters. For example, if we scale the data of a factor alpha we would like our algorithm to give the same result. So we want this algorithm to be consistent across different scaling. Then we may want to satisfy other properties but we will be never be able to satisfy all of them. Left: data. Right: data after clustering (points coloured according to cluster membership). K-means Assume that there are K clusters, so we know the number of clusters. How do we know how many clusters are there? We will see strategies to solve this problem. Each cluster is defined by a position in the input space: μk = [μk1, μk2]T Each xn is assigned to its closest cluster, to the closest mean (mu): 53 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 This is a mechanism of “compression” if we want. Distance is normally Euclidean distance: dnk = (xn − μk )T (xn − μk ) but obviously we are free to use another distance. How do we find μk? What we would like to do is to place our mean in a strategic way so that somehow we can compress our data in a nice way. But there is no analytical solution, we can't write down μk as a function of X. We have to use an iterative heuristic algorithm that for sure is going to converge: 1. Guess μ1, μ2, . . . , μK ; 2. Assign each xn to its closest μk ; 3. Then we use this indicator, znk, which is a big matrix of 0 except for ones when the point is associated to cluster k: znk = 1 if xn assigned to μk (0 otherwise) ; 4. Update μk to average of xns assigned to μk: N μk = ∑n=1 znk xn N ∑n=1 znk 5. Then we do this iterative by returning to 2 until assignments do not change. Algorithm will converge...it will reach a point where the assignments don’t change, but this is going to cost us n*k distances to compute. But it’s linear in n so not so bad (gaussian are quadratic). When does K-means break? Outer cluster cannot be represented as a single point. The reason is that k-means would do something like this: 54 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 Why? Because k-means use the euclidean distance. And doing so there is no way we can achieve separations that are not linear unless we use 100 clusters, not very elegant and interpretable because we use 100 clusters to represent 2. So what if we could change this boundaries to be non-linear? One way could be to change the distance among the clusters but an easier way is to thing in the same way we did with gaussian processes. We said: we are good with linear regression, and now instead of just doing linear regression with linear function what if we transform the problem by including many new basis functions and we do linear regression by using these basis functions that are not linear? For this particular problem is very easy: imagine that I would take each xi and construct a new 2 variable hi, and saying that hi = | | xi | | . What do I get? I get all the points separated and this will lead to an easy separation of the clusters by the k-means. What if we can use an infinite number of basis function? We are going to have an infinite way to separate the points. So the idea is to do a transformation that brings us into another space, do the k-means there and then go back to the original space. Obviously we don’t like infinite stuff. Kernelizing K-means Maybe we can kernelize K-means? Let’s start with the distances: dnk = (xn − μk )T (xn − μk ). This tells us if a point belong to a cluster. So we compute this for all the different clusters, we say which one is the minimum and we associate the point to the cluster for which the distance is minimum. The means are calculated like this, so we take the “barycenter” of the cluster: N μk = ∑m=1 zmk xm N ∑m=1 zmk And now if we just expand this expression of the distances by plugging the mean inside we get: (xn − μk) (xn − μk) = ⊤ ( xn − Nk−1 N ⊤ ∑ N −1 2Nk z x ⊤x ∑ mk m n m=1 + Nk−2 m=1 N xn − Nk−1 z x ∑ mk m) ) ( m=1 zmk xm Multiply out: xn⊤ xn − ∑ zmk zlk xm⊤ xl m,l But in linear regression we said that we can express our prediction just in terms of scaler product among the points. If we want to determine wether a point belongs to a cluster we have to compute these distances and these distances to the means. But because the distances are themselves average of point, when we do the scalar product we end up with is that all the terms that appear contain scalar product between the inputs. So we can say: “what if I change the representation where these x are mapped in some other dimensional space with some sort of phi function and then this product will be replaces with a kernel function: k (xn, xn) − N −1 2Nk z k x ,x ∑ mk ( n m) m=1 + Nk−2 N ∑ m,l=1 zmk zlk k (xm, xl) So what we are doing is again to map this problem into another space and the scalar product is going just to be some kernel. If we do this we are going to work in a new space induced by the kernel K. Choosing K is equivalent to choose some mapping between our points. Kernel K-means Algorithm: 1. Choose a kernel and any necessary parameters. 2. Start with random assignments znk. 55 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 3. For each xn assign it to the nearest 'center' where distance is defined as: k (xn, xn) − N −1 2Nk z k x ,x ∑ mk ( n m) m=1 + 4. If assignments have changed, return to 3. Nk−2 N ∑ m,l=1 zmk zlk k (xm, xl) It’s very important to notice that we will never compute the means because the K induce some sort of mapping of the points but we don’t want to work with that mapping and that’s why we work with this implicit representation where we can actually compute the distances in the map space, in the high-dimensional space, but we don’t really care about this mapping, we don’t really care about computing the mean in this new space and we don’t need to because the only thing we need is to compute the distance between the points and the means. And this is just what we did in linear regression, we dind’t compute the weights in the infinite dimension representation because we can’t. And we don’t need to because all the prediction in that space can be expressed simply in the form of scalar product of this function in the highdimensional space. And the scalar product is just K. Here is the same: we want to use the mapping but not compute it and all we need is be able to compute the distance between the points and the means in the infinite-dimensional space. So Kernel K-means: • Makes simple K-means algorithm more flexible. • But, have to now set additional parameters. • Very sensitive to initial conditions because of lots of local optima. K-means summary • Simple (and effective) clustering strategy. • Converges to (local) minima of: z x − μk) (xn − μk) ∑ ∑ nk ( n ⊤ n k • Sensitive to initialization. • How do we choose K? • Tricky: Quantity above always decreases as K increases. • Can use CV (cross validation) if we have a measure of 'goodness'. • For clustering these will be application specific. Mixture models thinking generatively When we looked at Bayesian way to do linear regression we were thinking about how to generate data, so we were thinking in a generative fashion: “let’s take some parameters, let’s construct a function, add some noise and that’s how data are generated”. Thanks to that we could design a likelihood, design a prior for the parameter and then we did posterior inference so we get a posterior over the parameters. So can we do the same here? The idea is to start with hypothesis on how we would generate data like this: 56 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 A generative model Assumption: Each xn comes from one of different K distributions. To generate x, for each n: 1. Pick one of the K components. 2. Sample xn from this distribution. We repeat this n times and we have our data. But in practice we have data and we would like to know how these data are generated somehow. This is exactly how linear regression works. So we are going to define parameters of all these distributions as Δ and we'd like to reverse-engineer this process learn Δ which we can then use to find which component each point came from. And this is what we have seen at the beginning and we are going to try to do the same here, that is maximize the likelihood. Mixture model likelihood For the likelihood what we need is some sort of p of our data, p(x) given parameters. If so we can maximize this p(x) given parameters. So let the kth distribution have pdf: p(xn|znk = 1, Δk). This is what is going to determine the probability of each xn when xn belongs to cluster k and cluster k is going to have parameters Δk. What we want is the likelihood: p( × | Δ) The first assumption we are going to make is the likelihood factorize: p( × ∣ Δ) = N p × ∣ Δ) 1 ∏ ( n i=1 Then, un-marginalize k: p( × ∣ Δ) = = N K p × , z = 1 ∣ Δ) ∏ ∑ ( n nk i=1 k=1 N K p × ∣ z = 1,Δk) p (znk = 1 ∣ Δ) ∏ ∑ ( n nk i=1 k=1 Why is this useful? Because these guys here are gaussians. So this is basically our likelihood and we want to find delta: argmax Δ N K p x ∣ z = 1,Δk) p (znk = 1 ∣ Δ) ∏ ∑ ( n nk i=1 k=1 But this is going to look very bad already because we have a product of N terms of something which is a sum of K terms! On the other hand, using logs we obtain this: argmax Δ N ∑ n=1 log K p x ∣ z = 1,Δk) p (znk = 1 ∣ Δ) ∑ ( n nk k=1 But now we have the log of a sum… so we need to do something else. 1 There is some noise in the generation that we can model and we try to capture this uncertainty around that with this parameter delta. 57 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 Jensen's inequality log Ep(x)[ f (x)] ≥ Ep(x)[log f (x)] How does this help us? Our log likelihood: L= N ∑ log n=1 K p x ∣ z = 1,Δk) p (znk = 1 ∣ Δ) ∑ ( n nk k=1 Add a (arbitrary looking) distribution L= N ∑ log n=1 q (znk = 1) ( s.t. q z = 1) = 1 : ∑ ( nk ) k q (znk = 1) p × ∣ z = 1,Δk) p (znk = 1 ∣ Δ) ∑ q z = 1 ( n nk ( ) nk k=1 K q is going to act as out posterior over the membership of points of clusters. So this is going to be an approximation of the posterior. So let’s see it as an expectation under q(znk = 1). L= N ∑ n=1 log Eq(znk = 1) [ q (znk = 1) 1 p ( ×n ∣ znk = 1,Δk) p (znk = 1 ∣ Δ) ] So, using Jensen’s: L≥ = N 1 Eq(znk = 1) log p (xn ∣ znk = 1,Δk) p (znk = 1 ∣ Δ) ∑ [ ] q z = 1 ( nk ) n=1 N K ∑∑ n=1 k=1 q (znk = 1) log { q (znk = 1) 1 p (xn ∣ znk = 1,Δk) p (znk = 1 ∣ Δ) } There are only sums! What are we going to do with this? L≥ = N K q z = 1) log ∑ ∑ ( nk n=1 k=1 N K { q (znk = 1) 1 p (xn ∣ znk = 1,Δk) p (znk = 1 ∣ Δ) } q z = 1) log p (znk = 1 ∣ Δ) + … ∑ ∑ ( nk n=1 k=1 N K q z = 1) log p (xn ∣ znk = 1,Δk) − … ∑ ∑ ( nk n=1 k=1 N K q z = 1) log q (znk = 1) ∑ ∑ ( nk n=1 k=1 And these two terms are a loss (because it measures the log likelihood so the model fitting if we want) and a regularization, so how the posterior deviates from the prior. So if we take these two terms we get the same result we got in the variational inference: some sort of log-likelihood and a regularization. 58 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 Now we have to optimize this wrt to delta and maybe other parameters, like q! So let’s take the derivative also w.r.t. that. And another thing we could also try to optimize p(znk = 1). So eventually we define qnk = q (znk = 1), πk = p (znk = 1 ∣ Δ) (both just scalars). So what we do now is to derive all the parameters w.r. we want to optimize which are qnk, set them to 0 and get the solution. Ergo, differentiate lower bound w.r.t qnk, πk and Δ k and set to zero to obtain iterative update. Optimizing lower bound And in particular updates for Δk, πk will depend on qnk Update Ink and then use these values to update Δk and πk etc. This is a form of the ExpectationMaximization algorithm (EM) but we've derived it differently. Let’s take an example… Gaussian mixture model Assume component distributions are Gaussians with its own mean and variance, with diagonal covariance: p ( ×n ∣ znk = 1,μk , σk2) = 𝒩 (μ, σ 2 ∣ ) Update for πk. The only relevant bit of bound is: ∑ n,k Now, we have a constraint: ∑ qnk log (πk) πk = 1. And a technique we can use when we have constraint is k to use a Lagrangian, which is just to add a term called Lagrangian multiplier λ: ∑ qnk log πk − λ n,k (∑ k πk − 1 ) Now if we take the derivative and we set to zero we end up with: ∂ 1 = qnk − λ = 0 ∂πk πk ∑ n Which allows us to obtain the Lagrange multiplier and then we substitute it back. So if we rearrange we end up with something like this: ∑ qnk = λπk n Then we sum both sides over k to find lambda: ∑ qnk = λ × 1 n,k Finally we substitute and re-arrange again: πk = ∑n qnk ∑n, j qnj = 1 qnk N∑ n 59 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 In the end this πk should be just the average of the posteriors. This is not surprising because if we take the prior which is flexible as the posterior and we allow the prior to be optimized, the prior only enter in the kl regularization term so that means that the kl can be 0 if we make the prior equal to the posterior. And it’s looks like a nonsense because this is telling us that the prior has to be equal to the posterior. Here is kind of the same thing except that the prior now it’s cluster specific, it’s not going to be a prior for each point individually otherwise we would have πk = qnk. But since we said that πk is the prior for a given cluster πk has so be the average of the posterior memberships. Update for qnk Now for qnk . Whole bound is relevant because all the bound depends on qnk. So take the derivative, set it to 0, then again add the Lagrange multiplier −λ (∑ k qnk − 1 ) associate to this because qnk have to sum to 1 over k. So we do some rearrangement and we end up with something like this (don’t need to remember the demonstration, take the below result directly): qnk = πk p ( ×n ∣ znk = 1,Δk) ∑j=1 πj p ( ×n ∣ znj = 1,Δj) K And this is basically the same thing we got from bayesian classifier. Updates for μk and σk2 These are easier no constraints. Differentiate the following and set to zero (D is dimension of xn): ∑ n,k qnk log 1 (2πσk2) D/2 1 ⊤ exp − 2 ( ×n − μk) ( ×n − μk) ( 2σk ) Result: μk = ∑n qnk ×n ∑n qnk ∑ qnk ( ×n − μk) ( ×n − μk) σk2 = n D ∑n qnk ⊤ It is just like what we had with the k-means but now we have this extra computation that we need to do for the variances because now we actually are assuming that each cluster is going to have different variance but for the variance we get something very similar to the sample variance that we would get if we want to estimate the variance of some data. Mixture model optimization algorithm Following optimization algorithm: 1. Guess μk , σj2, πk 2. Compute qnk, 3. Update μl, σk2, 4. Update πk 5. Return to 2 unless parameters are unchanged. 60 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 Guaranteed to converge to a local maximum of the lower bound. Note the similarity with kmeans except that now we have something a bit more elaborate to compute not only the means but also the variances and also the πk determining the priors that we associate to each clusters, so how much mass we want to put in each cluster. In the literature it has been proven that it converges. Basically what we got is a special case of the k-means where we fixed the σk to be the same and also πk to be equal to 1/k. Because of that — because of k-mean is a special case of EM (expectation maximization) — then we know also the k-means is guaranteed to converge. Mixture model clustering Imagine now we want to know which points came from which distribution. This qnk is our posterior, the probability that xn came from distribution k. qnk = P (znk = 1 ∣ xn, X ) So this is condition on the data that we have. And so now what we can do is stick with probabilities or assign each Xn to it's most likely component. Mixture model issues How do we choose K? What happens when we increase it? The likelihood is going to improve. In mixture models if we have one cluster for each data point now we have the extra flexibility that the variance can be learned, is optimized. And so, what the models like to do is to put 1 mean for each point and shrink the variance to zero. By doing that the likelihood becomes infinity. So we get this degeneracy if we start increasing the number of clusters and it can happen that one component is attracted to one data point and doesn’t take any other data point and that point starts to make the variance smaller and smaller and eventually infinite likelihood. So increase the number of clusters is not a good idea in general. Likelihood always increases as σ 2 decreases. What can we do? (A: cross validation…) We live out some data and we check the log-likelihood on unseen data. If we do that we end up with something like the usual plot that we see when we do model selection where we see some sweet spots, in this case around 5 components. Mixture models other distributions We've seen Gaussian distributions. Can actually use anything…. As long as we can define p(xn|znk = 1, ΔK) e.g. Binary data 61 5 - K-means, Kernel K-means, and Mixture models 13/05/2022 Binary example xn = [0, 1, 0, 1, 1,.. . ,0,1]T (D dimensions) x nd p (xn ∣ znk = 1,Δk) = ∏d=1 pkd (1 − pkd) D 1−x nd Updates for pkd are: pkd = ∑n qnk Xnd ∑n qnk qnk and πk are the same as before… Initialize with random pkd (0 ≤ pkd ≤ 1) K = 5 clusters. Clear structure present. Summary • Introduced two clustering methods. • K-means • Very simple. • Iterative scheme. • Can be kernelized. • Need to choose K. • Mixture models • Create a model of each class (similar to Bayes classifier) • Iterative sceme (EM) • Can use any distribution for the components. • Can set K by cross-validation (held-out likelihood) • State-of-the-art: Don't need to set K treat as a variable in a Bayesian sampling scheme. 62