Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Introduction to Machine Learning CSE474/574: Linear Regression Varun Chandola <chandola@buffalo.edu> Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Outline 1 2 3 4 5 6 7 Inference in Machine Learning Joint Distributions Linear Gaussian Systems Example Linear Regression Problem Formulation Geometric Interpretation Learning Parameters Recap Issues with Linear Regression Handling Non-linear Relationships Handling Overfitting via Regularization Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References What is Inference? Definition Given a joint distribution p(x1 , x2 ), how to compute the marginals, p(x1 ), and conditionals p(x1 |x2 ). Discrete Random Variables X - length of a word, Y - # vowels in word x x y 0 1 2 2 0 0.34 0 3 0.03 0.16 0 4 0 0.30 0.03 5 0 0 0.14 y 0 1 P 2 y p(x, y ) Varun Chandola 2 0 0.34 0 0.34 3 0.03 0.16 0 0.19 4 0 0.30 0.03 0.33 Introduction to Machine Learning 5 0 0 0.14 0.14 P p(x, y ) 0.03 0.80 0.17 x Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References What is Inference? Definition Given a joint distribution p(x1 , x2 ), how to compute the marginals, p(x1 ), and conditionals p(x1 |x2 ). Discrete Random Variables X - length of a word, Y - # vowels in word x x y 0 1 2 2 0 0.34 0 3 0.03 0.16 0 4 0 0.30 0.03 5 0 0 0.14 y 0 1 P 2 y p(x, y ) Varun Chandola 2 0 0.34 0 0.34 3 0.03 0.16 0 0.19 4 0 0.30 0.03 0.33 Introduction to Machine Learning 5 0 0 0.14 0.14 P p(x, y ) 0.03 0.80 0.17 x Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Joint Distributions Multiple random variables may be jointly modeled Joint probability distribution p(x1 , x2 , . . . , xD ) Example, x = {x1 , x2 , . . . , xD } is a joint Gaussian distribution, if: p(x) = N (µ, Σ) Marginal Distribution: Probability distribution of one (or more) variable(s) “marginalized” over all others Conditional Distribution: Probability distribution of one (or more) variable(s), given a particular value taken by the others Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Inference for Multivariate Normal Random Variables Let x = (x1 , x2 ) be jointly Gaussian with parameters: µ1 Σ11 Σ12 Λ11 −1 µ= ,Σ = ,Λ = Σ = µ2 Σ21 Σ22 Λ21 Both marginals and conditions are also Gaussian! p(x1 ) = N (µ1 , Σ11 ) p(x1 |x2 ) = N (µ1|2 , Σ1|2 ) where, µ1|2 = Σ1|2 (Λ11 µ1 − Λ12 (x2 − µ2 )) Σ1|2 = Λ−1 11 Varun Chandola Introduction to Machine Learning Λ12 Λ22 Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 longitude 4 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 longitude 4 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 longitude 4 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 longitude 4 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 Radar Blips longitude 4 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Bayesian Approach Let the exact, but unknown, location be a random variable (2d) that takes the value y Each radar blip is a random variable as well that takes the value x We assume that x is generated from y (by adding noise) However, we can only observe x Can we estimate the true location y, given the noisy observations Assumptions 1 2 y has a Gaussian prior, y ∼ N (µ0 , Σ0 ) Each xi ∼ N (y, Σ), i.e., each radar blip is a noisy version of the true location Problem Statement Given samples of xi infer posterior for y Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 longitude 4 µ0 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Locating an Airplane 5 longitude 4 µ0 3 2 1 0 0 1 2 3 latitude Varun Chandola 4 5 Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Linear Gaussian Systems What if each x is a noisy linear combination of a hidden variable y? p(y) = N (µy , Σy ) p(x|y) = N (Ay + b, Σx ) A is a matrix and b is a vector. x is a linear combination of y Inference Problem: Can we infer y given x? Example In a univariate case, x can be a noisy measurement of an actual quantity y . x can be measurements from many sensors and y can be the actual phenomenon to be measured x and y can be different length vectors (sensor fusion) Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Applying Bayes Rule Bayes Rule for Linear Gaussian Systems p(y|x) = N (µy |x , Σy |x ) Σ−1 y |x = > −1 Σ−1 y + A Σx A µy |x = −1 Σy |x [A> Σ−1 x (x − b) + Σy µy ] Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Example Bayesian Approach for Finding Airplane Assume y and x1 , x2 , . . . , xN are jointly Gaussian A = I, b = 0 Assuming same precision (Σy ) of each radar blip (same sensor) Posterior for x p(y|x1 , x2 , . . . , xN ) Σ−1 N µN = N (µN , ΣN ) −1 = Σ−1 0 + NΣx −1 = ΣN (Σ−1 x (Nx̄) + Σ0 µ0 ) Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Linear Regression There is one scalar target variable y (instead of hidden) There is one vector input variable x Linear Regression Learning Task Learn w given training examples, hX, yi. Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Two Interpretations 1. Probabilistic Interpretation y is assumed to be normally distributed y ∼ N (w> x, σ 2 ) or, equivalently: y = w> x + where ∼ N (0, σ 2 ) y is a linear combination of the input variables Given w and σ 2 , one can find the probability distribution of y for a given x Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Two Interpretations 2. Geometric Interpretation Fitting a straight line to d dimensional data y = w> x y = w> x = w1 x1 + w2 x2 + . . . + wd xd Will pass through origin Add intercept y = w 0 + w 1 x1 + w 2 x2 + . . . + w d xd Equivalent to adding another column in X of 1s. Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Learning Parameters - MLE Approach Find w and σ 2 that maximize the likelihood of training data b MLE w 2 σ bMLE (X> X)−1 X> y 1 (y − Xw)> (y − Xw) = N = Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Learning Parameters - Least Squares Approach Minimize squared loss N J(w) = 1X (yi − w> xi )2 2 i=1 Make prediction (w> xi ) as close to the target (yi ) as possible Least squares estimate b = (X> X)−1 X> y w Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Problem Formulation Geometric Interpretation Learning Parameters Gradient Descent Based Method Minimize the squared loss using Gradient Descent N J(w) = 1X (yi − w> xi )2 2 i=1 Why? Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Issues with Linear Regression Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Issues with Linear Regression Recap - Linear Regression Geometric Bayesian y = w> x 1 p(y ) = N (w> x, σ 2 ) Least Squares 1 > b = (X X) w 2 −1 > X y Maximum Likelihood Estimation b w Gradient Descent N 1X J(w) = (yi − w> xi )2 2 2 σ bMLE = (X> X)−1 X> y = N 1 X (y − Xw)> (y − Xw N i=1 i=1 Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Issues with Linear Regression Issues with Linear Regression 1 Susceptible to outliers Robust Regression: Use a “fat-tailed” distribution for y Student t-distribution or Laplace distribution 2 Too simplistic - Underfitting 3 Unstable in presence of correlated input attributes 4 Gets “confused” by unnecessary attributes Non-linear? Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Issues with Linear Regression Issues with Linear Regression 1 Susceptible to outliers Robust Regression: Use a “fat-tailed” distribution for y Student t-distribution or Laplace distribution 2 Too simplistic - Underfitting 3 Unstable in presence of correlated input attributes 4 Gets “confused” by unnecessary attributes Non-linear? Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Issues with Linear Regression Issues with Linear Regression 1 Susceptible to outliers Robust Regression: Use a “fat-tailed” distribution for y Student t-distribution or Laplace distribution 2 Too simplistic - Underfitting 3 Unstable in presence of correlated input attributes 4 Gets “confused” by unnecessary attributes Non-linear? Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Handling Overfitting via Regularization Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Handling Overfitting via Regularization Handling Non-linear Relationships Replace x with non-linear functions φ(x) p(y |x, θ) ∼ N (w> φ(x)) Model is still linear in w Also known as basis function expansion Example φ(x) = [1, x, x 2 , . . . , x d ] Increasing d results in more complex fits Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Handling Overfitting via Regularization How to Control Overfitting? Use simpler models (linear instead of polynomial) Might have poor results (underfitting) Use regularized complex models b = arg min J(Θ) + αR(Θ) Θ Θ R() corresponds to the penalty paid for complexity of the model Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Handling Overfitting via Regularization Examples of Regularization Ridge Regression b = arg min J(w) + αkwk2 w w Also known as l2 or Tikhonov regularization Helps in reducing impact of correlated inputs Least Absolute Shrinkage and Selection Operator - LASSO b = arg min J(w) + α|w| w w Also known as l1 regularization Helps in feature selection – favors sparse solutions Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Handling Overfitting via Regularization Parameter Estimation for Ridge Regression Exact Loss Function N J(w) = 1X 1 (yi − w> xi )2 + λ||w||22 2 2 i=1 MAP Estimate of w b MAP = (λID + X> X)−1 X> y w Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Outline 1 Inference in Machine Learning 2 Joint Distributions 3 Linear Gaussian Systems Example 4 Linear Regression Problem Formulation Geometric Interpretation Learning Parameters 5 Recap Issues with Linear Regression 6 Handling Non-linear Relationships Handling Overfitting via Regularization 7 Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Putting a Prior on w “Penalize” large values of w A zero-mean Gaussian prior p(w) = Y N (wj |0, τ 2 ) j What is posterior of w p(w|D) ∝ Y N (yi |wo + w> xi , σ 2 )p(w) i Posterior is also Gaussian Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Posterior Estimates of the Weight Vector MAP estimate of w arg max w N X log N (yi |w> xi , σ 2 ) + i=1 D X log N (wj |0, τ 2 ) j=1 Equivalent to Ridge Regression Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Parameter Estimation for Bayesian Regression Prior for w w ∼ N (0, τ 2 ID ) Posterior for w p(w|y, X) = p(y|X, w)p(w) p(y|X) = N (w̄ = (X> X + σ2 σ2 −1 2 > I ) Xy, σ (X X + IN )−1 ) N τ2 τ2 Posterior distribution for w is also Gaussian What will be MAP estimate for w? Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Prediction with Bayesian Regression For a new x∗ , predict y ∗ Point estimate of y > b MLE y∗ = w x∗ Treating y as a Gaussian random variable > 2 b MLE p(y ∗ |x∗ ) = N (w x∗ , σ bMLE ) > 2 b MAP p(y ∗ |x∗ ) = N (w x∗ , σ bMAP ) Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Full Bayesian Treatment Treating y and w as random variables Z ∗ ∗ p(y |x ) = p(y ∗ |x∗ , w)p(w|X, y)dw This is also Gaussian! Varun Chandola Introduction to Machine Learning Inference in Machine Learning Joint Distributions Linear Gaussian Systems Linear Regression Recap Handling Non-linear Relationships Bayesian Regression References References Varun Chandola Introduction to Machine Learning