Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1 Content Recap from last weak Bayesian Linear Regression What is linear regression? Application of the Bayesian Theory on Linear Regression Example Comparison to Conventional Linear Regression Bayesian Logistic Regression Naive Bayes classifier Source: Bishop (ch. 3,4); Barber (ch. 10) Saskia Klein & Steffen Bollmann 2 Maximum a posterior estimation • The bayesian approach to estimate parameters of the distribution given a set of observations is to maximize posterior distribution. likelihood posterior prior evidence • It allows to account for the prior information. Conjugate prior • In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior. • For any member of the exponential family, there exists a conjugate prior that can be written in the form • Important conjugate pairs include: Binomial – Beta Multinomial – Dirichlet Gaussian – Gaussian (for mean) Gaussian – Gamma (for precision) Exponential – Gamma Linear Regression goal: predict the value of a target variable 𝑡 given the value of a D-dimensional vector 𝐱 of input variables 𝐱→𝑡 linear regression models: linear functions of the adjustable parameters 𝐱 for example: 𝒕 = 𝟎. 𝟓 ⋅ 𝒙𝟏 + 𝟎. 𝟑 ⋅ 𝒙𝟐 −𝟎. 𝟐 ⋅ 𝒙𝟑 + 𝟎. 𝟏 ⋅ 𝒙𝟒 𝑡 Saskia Klein & Steffen Bollmann 5 Linear Regression Training {𝐱 𝑛 } … training data set comprising 𝑁 observations, where 𝑛 = 1, … , 𝑁 {𝑡𝑛 } … corresponding target values compute the weights Prediction goal: predict the value of 𝑡 for a new value of 𝐱 = model the predictive distribution 𝑝 𝑡 𝐱 and make predictions of 𝑡 in such a way as to minimize the expected value of a loss function Saskia Klein & Steffen Bollmann 6 Examples of linear regression models simplest linear regression model: 𝑦 𝐱, 𝐰 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷 𝑀−1 𝑤𝑗 𝑥𝑗 = 𝐰 𝑇 𝐱 𝑦 𝐱, 𝐰 = 𝑗=0 linear function of the weights/parameters 𝐰 and the data 𝐱 linear regression models using basis functions 𝛟: 𝑀−1 𝑤𝑗 𝜙𝑗 (𝐱) = 𝐰 𝑇 𝛟(𝐱) 𝑦 𝐱, 𝐰 = 𝑗=0 𝐰 = 𝑤0 , … 𝑤𝑀−1 𝑇 𝛟 = 𝜙0 , … , 𝜙𝑀−1 𝑇 Saskia Klein & Steffen Bollmann 7 Bayesian Linear Regression model: 𝑡 = 𝑦 𝐱, 𝐰 + 𝜖 𝑡 … target variable 𝑦 … model 𝐱 … data 𝐰 … weights/parameters 𝜖 … additive Gaussian noise: 𝑝 𝜖 = 𝒩(0, 𝛽−1 ) with zero mean and precision (inverse variance) 𝛽 Saskia Klein & Steffen Bollmann 8 Maximum a posterior estimation • The bayesian approach to estimate parameters of the distribution given a set of observations is to maximize posterior distribution. likelihood posterior prior evidence • It allows to account for the prior information. Bayesian Linear Regression - Likelihood likelihood function: 𝑝 t 𝐱, 𝐰, 𝛽 = 𝒩(𝑡|𝑦 𝐱, 𝐰 , 𝛽 −1 ) observation of N training data sets of inputs 𝐗 = 𝐱1 , … , 𝐱 𝑁 and target values 𝐭 = {𝑡1 , … , 𝑡𝑁 } (independently drawn from the distribution) 𝑁 𝒩(𝑡𝑛 |𝐰 𝑇 𝛟 𝐱 𝐧 , 𝛽 −1 ) 𝑝 𝐭 𝐗, 𝐰, 𝛽 = 𝑛=1 Saskia Klein & Steffen Bollmann 10 Maximum a posterior estimation • The bayesian approach to estimate parameters of the distribution given a set of observations is to maximize posterior distribution. likelihood posterior prior evidence • It allows to account for the prior information. Conjugate prior • In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior. • For any member of the exponential family, there exists a conjugate prior that can be written in the form • Important conjugate pairs include: Binomial – Beta Multinomial – Dirichlet Gaussian – Gaussian (for mean) Gaussian – Gamma (for precision) Exponential – Gamma Bayesian Linear Regression - Prior prior probability distribution over the model parameters 𝐰 conjugate prior: Gaussian distribution 𝑝 𝐰 = 𝒩 𝐰 𝐦0 , 𝐒0 mean 𝐦0 and covariance 𝐒0 Saskia Klein & Steffen Bollmann 13 Maximum a posterior estimation • The bayesian approach to estimate parameters of the distribution given a set of observations is to maximize posterior distribution. likelihood posterior prior evidence • It allows to account for the prior information. Bayesian Linear Regression – Posterior Distribution due to the conjugate prior, the posterior will also be Gaussian 𝑝 𝐰 𝐭 = 𝒩(𝐰|𝐦𝑁 , 𝐒𝑁 ) 𝐦𝑁 = 𝐒𝑁 𝐒0−1 𝐦0 + 𝛽𝛟𝑇 𝐭 −1 𝐒𝑁 = 𝐒0−1 + 𝛽𝛟𝑇 𝛟 𝒘𝑀𝐴𝑃 = 𝒎𝑁 (derivation: Bishop p.112) Saskia Klein & Steffen Bollmann 15 Example Linear Regression matlab Saskia Klein & Steffen Bollmann 16 Predictive Distribution making predictions of 𝑡 for new values of 𝐱 predictive distribution: 𝑝 𝑡 𝐱, 𝐭, 𝛼, 𝛽 = 𝒩(𝑡|𝐦𝑇𝑁 𝛟 𝐱 , 𝜎𝑁2 𝐱 ) variance of the distribution: 1 2 𝜎𝑁 𝐱 = + 𝛟 𝐱 𝑇 𝐒𝑁 𝛟(𝐱) 𝛽 first term represents the noise in the data second term reflects the uncertainty associated with the parameters 𝐰 optimal prediction, for a new value of 𝐱, would be the conditional mean of the target variable: 𝐄 𝑡 𝐱 = ∫ 𝑡 ⋅ 𝑝 𝑡 𝐱 𝑑𝑡 = 𝑦(𝐱, 𝐰) Saskia Klein & Steffen Bollmann 17 Common Problem in Linear Regression: Overfitting/model complexitiy Least Squares approach (maximizing the likelihood): point estimate of the weights Regularization: regularization term and value needs to be chosen Cross-Validation: requires large datasets and high computational power Bayesian approach: distribution of the weights good prior model comparison: computationally demanding, validation data not required Saskia Klein & Steffen Bollmann 18 From Regression to Classification for regression problems: target variable 𝑡 was the vector of real numbers whose values we wish to predict in case of classification: target values represent class labels two-class problem: 𝑡 𝜖 {1, 0} K > 2: 𝐭 = (0, 1, 0, 0, 0)𝑇 → class 2 Saskia Klein & Steffen Bollmann 19 Classification goal: take an input vector 𝐱 and assign it to one of 𝐾 discrete classes 𝐶𝑘 decision boundary Saskia Klein & Steffen Bollmann 20 Bayesian Logistic Regression model the class-conditional densities 𝑝 𝐱 𝐶𝑘 and the prior probabilities 𝑝 𝐶𝑘 and apply Bayes Theorem: 𝑝 𝐱 𝐶𝑘 𝑝 𝐶𝑘 𝑝 𝐶𝑘 𝐱 = 𝑝 𝐱 Saskia Klein & Steffen Bollmann 21 Bayesian Logistic Regression exact Bayesian inference for logistic regression is intractable Laplace approximation aims to find a Gaussian approximation to a probability density defined over a set of continuous variables posterior distribution is approximated around 𝐰𝑀𝐴𝑃 Saskia Klein & Steffen Bollmann 22 Example Barber: DemosExercises\demoBayesLogRegression.m Saskia Klein & Steffen Bollmann 23 Example Barber: DemosExercises\demoBayesLogRegression.m Saskia Klein & Steffen Bollmann 24 Naive Bayes classifier Why naive? strong independence assumptions assumes that the presence/absence of a feature of a class is unrelated to the presence/absence of any other feature, given the class variable Ignores relation between features and assumes that all feature contribute independently to a class [http://en.wikipedia.org/wiki/Naive_Bayes_classifier] Saskia Klein & Steffen Bollmann 25 Thank you for your attention Saskia Klein & Steffen Bollmann 26