Introduction to Machine Learning Contents Varun Chandola <>

2. Joint Distributions 2 2 Joint Distributions • Multiple random variables may be jointly modeled Introduction to Machine Learning • Joint probability distribution p(x1 , x2 , . . . , xD ) CSE474/574: Linear Regression • Example, x = {x1 , x2 , . . . , xD } is a joint Gaussian distribution, if: Varun Chandola <chandola@buffalo.edu> p(x) = N (µ, Σ) Outline • Marginal Distribution: Probability distribution of one (or more) variable(s) “marginalized” over all others Contents • Conditional Distribution: Probability distribution of one (or more) variable(s), given a particular value taken by the others 1 Inference in Machine Learning 1 2 Joint Distributions 1 3 Linear Gaussian Systems 3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 4 Linear Regression 4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Learning Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 5 5 Recap 5.1 Issues with Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 6 Handling Non-linear Relationships 6.1 Handling Overfitting via Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 Bayesian Regression 7.1 Estimating Bayesian Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Prediction with Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 9 • Let x = (x1 , x2 ) be jointly Gaussian with parameters: µ1 Σ11 Σ12 Λ11 Λ12 µ= ,Σ = , Λ = Σ−1 = µ2 Σ21 Σ22 Λ21 Λ22 • Both marginals and conditions are also Gaussian! p(x1 ) = p(x1 |x2 ) = N (µ1 , Σ11 ) N (µ1|2 , Σ1|2 ) where, Σ1|2 = Λ−1 11 3 Radar Blips 4 Definition 1. Given a joint distribution p(x1 , x2 ), how to compute the marginals, p(x1 ), and conditionals p(x1 |x2 ). Discrete Random Variables x x X - length of a word, Y - # vowels in word 2 0 0.34 0 3 0.03 0.16 0 4 0 0.30 0.03 Example 5 Inference in Machine Learning y 0 1 2 Linear Gaussian Systems 3.1 5 0 0 0.14 y 0 1 P 2 y p(x, y) 2 0 0.34 0 0.34 3 0.03 0.16 0 0.19 4 0 0.30 0.03 0.33 5 0 0 0.14 0.14 P The main question that arises here is why? Why are we interested in computing marginals and conditionals? Sometimes we wish to remove the “influence” of one or more variables from a probability. This maybe because some variables are less reliable, or uninteresting, or irrelevant to some analysis. In those cases we need to marginalize the effect of certain variables, and compute the marginal distribution for other “more interesting” variables. p(x, y) 0.03 0.80 0.17 x longitude 1 µ1|2 = Σ1|2 (Λ11 µ1 − Λ12 (x2 − µ2 )) 3 2 1 0 0 1 2 3 latitude 4 5 The most straightforward approach here would be to take the mean location as the estimated location. A simple probabilistic approach would be to consider the radar blips as samples from a probabilistic distribution. One can estimate the distribution parameters using MLE and MAP methods we have discussed above. 3.1 Example 3 4. Linear Regression 4 • Let the exact, but unknown, location be a random variable (2d) that takes the value y • Assume y and x1 , x2 , . . . , xN are jointly Gaussian • Each radar blip is a random variable as well that takes the value x • A = I, b = 0 • We assume that x is generated from y (by adding noise) • Assuming same precision (Σy ) of each radar blip (same sensor) • However, we can only observe x Posterior for x • Can we estimate the true location y, given the noisy observations p(y|x1 , x2 , . . . , xN ) = N (µN , ΣN ) −1 Σ−1 = Σ−1 0 + N Σx N −1 µN = ΣN (Σx (N x̄) + Σ−1 0 µ0 ) • Assumptions 1. y has a Gaussian prior, y ∼ N (µ0 , Σ0 ) 2. Each xi ∼ N (y, Σ), i.e., each radar blip is a noisy version of the true location Problem Statement Given samples of xi infer posterior for y Linear Regression 4.1 Problem Formulation • There is one scalar target variable y (instead of hidden) 5 • There is one vector input variable x 4 longitude 4 Linear Regression Learning Task Learn w given training examples, hX, yi. µ0 3 2 The training data is denoted as hX, yi, where X is a N × D data matrix consisting of N data examples such that each data example is a D dimensional vector. y is a N × 1 vector consisting of corresponding target values for the examples in X. 1 1. Probabilistic Interpretation 0 • y is assumed to be normally distributed 0 1 2 3 latitude 4 5 • What if each x is a noisy linear combination of a hidden variable y? p(y) = N (µy , Σy ) p(x|y) = N (Ay + b, Σx ) • A is a matrix and b is a vector. • x is a linear combination of y • Inference Problem: Can we infer y given x? Example 2. • In a univariate case, x can be a noisy measurement of an actual quantity y. • x can be measurements from many sensors and y can be the actual phenomenon to be measured y ∼ N (w> x, σ 2 ) • or, equivalently: where ∼ N (0, σ 2 ) • y is a linear combination of the input variables • Given w and σ 2 , one can find the probability distribution of y for a given x 4.2 Geometric Interpretation 2. Geometric Interpretation • Fitting a straight line to d dimensional data y = w> x – x and y can be different length vectors (sensor fusion) Bayes Rule for Linear Gaussian Systems p(y|x) = N (µy|x , Σy|x ) −1 > −1 Σ−1 y|x = Σy + A Σx A −1 µy|x = Σy|x [A> Σ−1 x (x − b) + Σy µy ] y = w> x + y = w> x = w1 x1 + w2 x2 + . . . + wd xd • Will pass through origin • Add intercept y = w0 + w1 x1 + w2 x2 + . . . + wd xd • Equivalent to adding another column in X of 1s. 4.3 Learning Parameters 4.3 5 Learning Parameters • Find w and σ that maximize the likelihood of training data 2 5. Recap 6 The derivation of the least squares estimate follows the exact computation as the MLE derivation above, once the expression of the squared loss is converted into matrix notation, i.e., N > −1 J(w) = > b M LE = (X X) X y w 1 2 σ bM (y − Xw)> (y − Xw) LE = N = L(w) = i=1 √ N J(w) = 1 (yi − w> xi )2 exp(− ) 2σ 2 2πσ The log-likelihood is given by: N 1 X 1 (yi − w> xi )2 LL(w) = − log 2π − log σ − 2 2 2σ i=1 This can be rewritten in matrix notation as: 1 1 LL(w) = − log 2π − log σ − 2 (Y − Xw)> (Y − Xw) 2 2σ > The analytical approach discussed earlier involves a matrix inversion ((X> X)−1 ) which is a (D+1)×(D+1) matrix. Alternatively, one could solve a system of equations. When D is large, this inversion can be computationally expensive (O ∗ D3 ) for standard matrix inversion. Moreover, often, the linear system might have singularities and inversion or solving the system of equations might yield numerically unstable results. To compute the gradient update rule one can differentiate the error with respect to each entry of w. N 1 ∂ X ∂J(w) = (yi − w> xi )2 ∂wj 2 ∂wj i=1 = Using the above result, one can perform repeated updates of the weights: Note that, we use the fact that (Xw) Y = Y Xw, since both quantities are scalars and the transpose of a scalar is equal to itself. Continuing with the derivative: Setting the derivative to 0, we get: 2w> X> X − 2Y > X w > X> X (X> X)> w (X> X)w w wj := wj − α 5 ∂J(w) ∂wj Recap Geometric = = = = = 0 Y >X X> Y (Taking transpose both sides) X> Y (X> X)−1 X> Y y = w> x 1. Least Squares b = (X> X)−1 X> y w 2. Gradient Descent N J(w) = In a similar fashion, one can set the derivative to 0 with respect to σ and plug in the the optimal value of w 1X (yi − w> xi )2 2 i=1 Bayesian • Minimize squared loss J(w) = 1 2 N X i=1 (yi − w> xi )2 • Make prediction (w xi ) as close to the target (yi ) as possible > • Least squares estimate N X (w> xi − yi )xij i=1 > ∂LL(w) 1 = − 2 (2w> X> X − 2Y > X) ∂w 2σ 1X (yi − w> xi )2 2 i=1 • Why? To maximize the log-likelihood, we first compute its derivative with respect to w and σ. ∂LL(w) 1 ∂ = − 2 (Y − Xw)> (Y − Xw) ∂w 2σ ∂w 1 ∂ (Y > Y + w> X> Xw − 2Y > Xw) = − 2 2σ ∂w 1 (Y − Xw)> (Y − Xw) 2 • Minimize the squared loss using Gradient Descent The derivation of the MLE estimates can be done by maximizing the log-likelihood of the data set. The likelihood of the training data set is given by: N Y 1X (yi − w> xi )2 2 i=1 b = (X> X)−1 X> y w p(y) = N (w> x, σ 2 ) 1. Maximum Likelihood Estimation b = (X> X)−1 X> y w N 1 X 2 σ bM (y − Xw)> (y − Xw) LE = N i=1 5.1 Issues with Linear Regression 5.1 7 7. Bayesian Regression 8 • Also known as l1 regularization Issues with Linear Regression • Helps in feature selection – favors sparse solutions 1. Susceptible to outliers • Robust Regression: Use a “fat-tailed” distribution for y • Student t-distribution or Laplace distribution Exact Loss Function N J(w) = 2. Too simplistic - Underfitting • Non-linear? MAP Estimate of w b M AP = (λID + X> X)−1 X> y w 3. Unstable in presence of correlated input attributes 4. Gets “confused” by unnecessary attributes 6 Handling Non-linear Relationships • Replace x with non-linear functions φ(x) 7 Bayesian Regression • “Penalize” large values of w • A zero-mean Gaussian prior p(y|x, θ) ∼ N (w> φ(x)) • Model is still linear in w p(w) = • What is posterior of w Example 3. φ(x) = [1, x, x2 , . . . , xd ] Y j p(w|D) ∝ • Also known as basis function expansion Y i N (wj |0, τ 2 ) N (yi |wo + w> xi , σ 2 )p(w) • Posterior is also Gaussian • MAP estimate of w • Increasing d results in more complex fits 6.1 1X 1 (yi − w> xi )2 + λ||w||22 2 i=1 2 arg max Handling Overfitting via Regularization w How to Control Overfitting? • Use simpler models (linear instead of polynomial) – Might have poor results (underfitting) N X i=1 b = arg min J(Θ) + αR(Θ) Θ Θ • R() corresponds to the penalty paid for complexity of the model Examples of Regularization 7.1 b = arg min J(w) + αkwk2 w w • Also known as l2 or Tikhonov regularization • Helps in reducing impact of correlated inputs Least Absolute Shrinkage and Selection Operator - LASSO b = arg min J(w) + α|w| w w j=1 log N (wj |0, τ 2 ) Estimating Bayesian Regression Parameters w ∼ N (0, τ 2 ID ) • Posterior for w p(w|y, X) = p(y|X, w)p(w) p(y|X) = N (w̄ = (X> X + Ridge Regression D X • Equivalent to Ridge Regression • Prior for w • Use regularized complex models logN (yi |w> xi , σ 2 ) + • Posterior distribution for w is also Gaussian • What will be MAP estimate for w? σ2 σ2 IN )−1 Xy, σ 2 (X> X + 2 IN )−1 ) 2 τ τ 7.2 Prediction with Bayesian Regression 9 The denominator term in the posterior above can be computed as the marginal likelihood of data by marginalizing w: Z p(y|X) = p(y|X, w)p(w)dw One can compute the posterior for w as follows. We first show that the likelihood of y, i.e., all target values in the training data, can be jointly modeled as a Gaussian as follows: 1 1 √ exp − 2 (yi − w>xi ) 2σ 2πσ 2 i=1 1 1 2 exp − |y − Xw| = (2πσ 2 )N/2 2σ 2 = N (Xw, σ 2 IN ) p(y|X, w) = N Y Ignoring the denominator which does not depend on w: 1 1 (y − Xw)> (y − Xw))exp(− 2 w> w) 2σ 2 2τ 1 1 1 ∝ exp(− (w − w̄)> ( 2 X> X + 2 IN )(w − w̄)) 2 σ τ p(w|y, X) ∝ exp(− where w̄ = (X> X + 7.2 σ2 I )−1 Xy. τ2 N Prediction with Bayesian Regression • For a new x∗ , predict y ∗ • Point estimate of y • Treating y as a Gaussian random variable > ∗ bM y∗ = w LE x > ∗ 2 bM p(y ∗ |x∗ ) = N (w bM LE x , σ LE ) > ∗ 2 bM p(y ∗ |x∗ ) = N (w bM AP x , σ AP ) • Treating y and w as random variables p(y ∗ |x∗ ) = • This is also Gaussian! References Z p(y ∗ |x∗ , w)p(w|X, y)dw

Introduction to Machine Learning Contents Varun Chandola <>

Related documents

Products

Support

Introduction to Machine Learning Contents Varun Chandola &lt;&gt;

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

Introduction to Machine Learning Contents Varun Chandola <>