Linear Regression Oliver Schulte Machine Learning 726 The Linear Regression Model 2/13 Parameter Learning Scenarios The general problem: predict the value of a continuous variable from one or more continuous features. Parent Node/ Discrete Child Node Continuous Discrete Maximum Likelihood Decision Trees Continuous conditional Gaussian (not discussed) logit distribution (logistic regression) linear Gaussian (linear regression) 3/13 Example 1: Predict House prices Price vs. floor space of houses for sale in Berkeley, CA, July 2009. Size Price House price in $1000 1000 900 800 700 600 500 400 300 500 1000 1500 2000 2500 3000 3500 House size in square feet Figure Russell and Norvig 4/13 Grading Example • Predict: final percentage mark for student. • Features: assignment grades, midterm exam, final exam. • Questions we could ask. • I forgot the weights of components. Can you recover them from a spreadsheet of the final grades? • I lost the final exam grades. How well can I still predict the final mark? • How important is each component, actually? Could I guess well someone’s final mark given their assignments? Given their exams? 5/13 Line Fitting Input: a data table XNxD. a target vector tNx1. Output: a weight vector wDx1. Prediction model: predicted value = weighted linear combination of input features. tn xn w t = Xw 6/13 Least Squares Error We seek the closest fit of the predicted line to the data points. Error = the sum of squared errors. Sensitive to outliers. N ED (w) =1 / 2å{tn - x n ·w} 2 n=1 7/13 Squared Error on House Price Example House price in $1000 1000 900 800 700 600 500 400 300 500 1000 1500 2000 2500 3000 3500 House size in square feet Figure 18.13 Russell and Norvig 8/13 Intuition Suppose that there is an exact solution and that the input matrix is invertible. Then we can find the solution by simple matrix inversion: -1 t = Xw Û w = X t • Alas, X is hardly ever square let alone invertible. But XTX is square, and usually invertible. So multiply both sides of equation by XT, then use inversion. -1 t = Xw Û X t = X X w Û (X X) X t = w T T invert this T pseudo-inverse T 9/13 Partial Derivative Think about single weight parameter wj. Partial derivative is error on input x ¶E =å ( ¶w j n=1 N n xn · w - tn )xnj prediction for input x n Gradient changes weight to bring prediction closer to actual value. 10/13 Gradient ¶E = å {x n · w - tn }xnj ¶w j n=1 N 1. Find gradient vector for each input x, add them up. 2. =Linear combination of row vectors xn with coefficients.xnw-tn. N ÑE = å {x n · w - tn }x n n=1 = XT (Xw - t) 11/13 Solution: The Pseudo-Inverse 0 = ÑE = XT (Xw - t) Û XT Xw = XT t Assume that XTX is invertible. Then the solution is given by w = (XT X)-1 XT t pseudo-inverse 12/13 The w0 offset Recall the formula for the partial derivative, and that xn0=1 for all n. Write w*=(w1,w2,...,wD) for the weight vector without w0, and similarly xn*=(xn1,xn2,...,xnD) for the n-th feature vector without the “dummy” input.Then E N n1 w* x*n w0 tn w0 Setting the partial derivative to 0, we get E 1 N 1 N * * 0 w0 ( n1 tn ) ( w x n ) w0 N N n1 average target value average predicted value 13/13 Geometric Interpretation Any vector of the form y = Xw is a linear combination of the columns (variables) of X. If y is the least squares approximation, then y is the orthogonal projection of t onto this subspace ϕi = column vector i Figure Bishop 14/13 Probabilistic Interpretation 15/13 Noise Model A linear function predicts a deterministic value yn(xn,w) for each input vector. We can turn this into a probabilistic prediction via a model true value = predicted value + random noise: Let’s start with a Gaussian noise model. t = y(x, w)+ e where p(e | s 2 ) = N(e | 0, s 2 ) 16/13 Curve Fitting With Noise 17/13 The Gaussian Distribution 18/13 Meet the exponential family A common way to define a probability density p(x) is as an exponential function of x. Simple mathematical motivation: multiplying numbers between 0 and 1 yields a number between 0 and 1. E.g. (1/2)n, (1/e)x. Deeper mathematical motivation: exponential pdfs have good statistical properties for learning. E.g., conjugate prior, maximum likelihood, sufficient statistics. 19/13 Reading exponential prob formulas • Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. • f(x), p(x) • Use p(x) = α exp(-f(x)). 20/13 Example: exponential form sample size • Fair Coin: The longer the sample size, the less likely it is. • p(n) = 2-n. ln[p(n)] Sample size n 21/13 Location Parameter • The further x is from the center μ, the less likely it is. ln[p(x)] (x-μ)2 22/13 Spread/Precision parameter • The greater the spread σ2, the more likely x is (away from the mean). • The greater the precision β, the less likely x is. ln[p(x)] 1/σ2 = β 23/13 Minimal energy = max probability • The greater the energy (of the joint state), the less probable the state is. ln[p(x)] E(x) 24/13 Normalization Let p*(x) be an unnormalized density function. To make a probability density function, need to find normalization constant α s.t. ¥ ò a p *(x)dx =1. -¥ Therefore a= ¥ ò p *(x)dx. -¥ For the Gaussian (Laplace 1782) æ 1 1 2ö . ò exp çè- 2s 2 (x - m ) ÷ø dx = 2 2ps -¥ ¥ 25/13 Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Laplace (1810). Example: N uniform [0,1] random variables. 26/13 Gaussian Likelihood Function Exercise: Assume a Gaussian noise model, so the likelihood function becomes (copy 3.10 from Bishop). æ 1 1 2ö p(t | X, w, s ) = Õ exp ç - 2 (tn - w • x n ) ÷ 2 è ø 2 s 2 ps n=1 N Show that the maximum likelihood solution minimizes the sum of squares error: N EX (w) =1 / 2å{tn - x n · w}2 n=1 27/13 Regression With Basis Functions 28/13 Nonlinear Features We can increase the power of linear regression by using functions of the input features instead of the input features. These are called basis functions. Linear regression can then be used to assign weights to the basis functions. 29/13 Linear Basis Function Models (1) Generally where j(x) are known as basis functions. Typically, 0(x) = 1, so that w0 acts as a bias. In the simplest case, we use linear basis functions : d(x) = xd. 30/13 Linear Basis Function Models (2) Polynomial basis functions: These are global, the same for all input vectors. 31/13 Linear Basis Function Models (3) Gaussian basis functions: These are local; a small change in x only affects nearby basis functions. ¹j and s control location and scale (width). Related to kernel methods. 32/13 Linear Basis Function Models (4) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. j and s control location and scale (slope). 33/13 Basis Function Example Transformation 34/13 Limitations of Fixed Basis Functions • M basis functions along each dimension of a D- dimensional input space require MD basis functions: the curse of dimensionality. • In later chapters, we shall see how we can get away with fewer basis functions, by choosing these using the training data. 35/13 Overfitting and Regularization 36/13 Polynomial Curve Fitting 37/13 0th Order Polynomial 38/13 3rd Order Polynomial 39/13 9th Order Polynomial 40/13 Over-fitting Root-Mean-Square (RMS) Error: 41/13 Polynomial Coefficients 42/13 Data Set Size: 9th Order Polynomial 43/13 1st Order Polynomial 44/13 Data Set Size: 9th Order Polynomial 45/13 Quadratic Regularization Penalize large coefficient values 46/13 Regularization: 47/13 Regularization: 48/13 Regularization: vs. 49/13 Regularized Least Squares (1) Consider the error function: Data term + Regularization term With the sum-of-squares error function and a quadratic regularizer, we get which is minimized by ¸ is called the regularization coefficient. 50/13 Regularized Least Squares (2) With a more general regularizer, we have Lasso Quadratic 51/13 Regularized Least Squares (3) Lasso tends to generate sparser solutions than a quadratic regularizer. 52/13 Evaluating Classifiers and Parameters Cross-Validation 53/13 Evaluating Learners on Validation Set Training Data Learner Validation Set Model 54/13 What if there is no validation set? What does training error tell me about the generalization performance on a hypothetical validation set? Scenario 1:You run a big pharmaceutical company.Your new drug looks pretty good on the trials you’ve done so far. The government tells you to test it on another 10,000 patients. Scenario 2:Your friendly machine learning instructor provides you with another validation set. What if you can’t get more validation data? 55/13 Examples from Andrew Moore Andrew Moore's slides 56/13 Cross-Validation for Evaluating Learners Cross-validation to estimate the performance of a learner on future unseen data Learn on 3 folds, test on 1 Learn on 3 folds, test on 1 Learn on 3 folds, test on 1 Learn on 3 folds, test on 1 57/13 Cross-Validation for Meta Methods Training Data • If the learner requires setting a parameter, we can evaluate different parameter settings against the data using training error or cross validation. • Then cross-validation is part of learning. Learner(λ) Model(λ) 58/13 Cross-Validation for evaluating a parameter value Cross-validation for λset (e.g. λ = 1) Learn withλ on 3 folds, test on 1 Learn withλ on 3 folds, test on 1 Learn withλ on 3 folds, test on 1 Learn withλ on 3 folds, test on 1 Average error over all 4 runs is the cross-validation estimated error for the λ value 59/13 Stopping Criterion for any type of validation (hold-out set validation, cross-validation) 60/13 Bayesian Linear Regression 61/13 Bayesian Linear Regression (1) • Define a conjugate shrinkage prior over weight vector w: p(w|σ2) = N(w|0,σ2I) • Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives a posterior distribution. • Log of the posterior = sum of squared errors + quadratic regularization. 62/13 Bayesian Linear Regression (3) 0 data points observed Prior Data Space 63/13 Bayesian Linear Regression (4) 1 data point observed Likelihood Posterior Data Space 64/13 Bayesian Linear Regression (5) 2 data points observed Likelihood Posterior Data Space 65/13 Bayesian Linear Regression (6) 20 data points observed Likelihood Posterior Data Space 66/13 Predictive Distribution (1) • Predict t for new values of x by integrating over w. • Can be solved analytically. 67/13 Predictive Distribution (2) Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point 68/13 Predictive Distribution (3) Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points 69/13 Predictive Distribution (4) Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points 70/13 Predictive Distribution (5) Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points 71/13