Linear Regression.

Linear Regression Oliver Schulte Machine Learning 726 The Linear Regression Model 2/13 Parameter Learning Scenarios  The general problem: predict the value of a continuous variable from one or more continuous features. Parent Node/ Discrete Child Node Continuous Discrete Maximum Likelihood Decision Trees Continuous conditional Gaussian (not discussed) logit distribution (logistic regression) linear Gaussian (linear regression) 3/13 Example 1: Predict House prices  Price vs. floor space of houses for sale in Berkeley, CA, July 2009. Size Price House price in $1000 1000 900 800 700 600 500 400 300 500 1000 1500 2000 2500 3000 3500 House size in square feet Figure Russell and Norvig 4/13 Grading Example • Predict: final percentage mark for student. • Features: assignment grades, midterm exam, final exam. • Questions we could ask. • I forgot the weights of components. Can you recover them from a spreadsheet of the final grades? • I lost the final exam grades. How well can I still predict the final mark? • How important is each component, actually? Could I guess well someone’s final mark given their assignments? Given their exams? 5/13 Line Fitting  Input:  a data table XNxD.  a target vector tNx1.  Output: a weight vector wDx1.  Prediction model: predicted value = weighted linear combination of input features. tn  xn  w t = Xw 6/13 Least Squares Error  We seek the closest fit of the predicted line to the data points.  Error = the sum of squared errors.  Sensitive to outliers. N ED (w) =1 / 2å{tn - x n ·w} 2 n=1 7/13 Squared Error on House Price Example House price in $1000 1000 900 800 700 600 500 400 300 500 1000 1500 2000 2500 3000 3500 House size in square feet Figure 18.13 Russell and Norvig 8/13 Intuition  Suppose that there is an exact solution and that the input matrix is invertible. Then we can find the solution by simple matrix inversion: -1 t = Xw Û w = X t • Alas, X is hardly ever square let alone invertible. But XTX is square, and usually invertible. So multiply both sides of equation by XT, then use inversion. -1 t = Xw Û X t = X X w Û (X X) X t = w T T invert this T pseudo-inverse T 9/13 Partial Derivative  Think about single weight parameter wj.  Partial derivative is error on input x ¶E =å ( ¶w j n=1 N n xn · w - tn )xnj prediction for input x n Gradient changes weight to bring prediction closer to actual value. 10/13 Gradient ¶E = å {x n · w - tn }xnj ¶w j n=1 N 1. Find gradient vector for each input x, add them up. 2. =Linear combination of row vectors xn with coefficients.xnw-tn. N ÑE = å {x n · w - tn }x n n=1 = XT (Xw - t) 11/13 Solution: The Pseudo-Inverse 0 = ÑE = XT (Xw - t) Û XT Xw = XT t Assume that XTX is invertible. Then the solution is given by w = (XT X)-1 XT t pseudo-inverse 12/13 The w0 offset  Recall the formula for the partial derivative, and that xn0=1 for all n.  Write w*=(w1,w2,...,wD) for the weight vector without w0, and similarly xn*=(xn1,xn2,...,xnD) for the n-th feature vector without the “dummy” input.Then E N  n1 w*  x*n  w0  tn w0 Setting the partial derivative to 0, we get E 1 N 1 N * * 0  w0  ( n1 tn )  (  w  x n ) w0 N N n1 average target value average predicted value 13/13 Geometric Interpretation  Any vector of the form y = Xw is a linear combination of the columns (variables) of X.  If y is the least squares approximation, then y is the orthogonal projection of t onto this subspace ϕi = column vector i Figure Bishop 14/13 Probabilistic Interpretation 15/13 Noise Model  A linear function predicts a deterministic value yn(xn,w) for each input vector.  We can turn this into a probabilistic prediction via a model true value = predicted value + random noise:  Let’s start with a Gaussian noise model. t = y(x, w)+ e where p(e | s 2 ) = N(e | 0, s 2 ) 16/13 Curve Fitting With Noise 17/13 The Gaussian Distribution 18/13 Meet the exponential family  A common way to define a probability density p(x) is as an exponential function of x.  Simple mathematical motivation: multiplying numbers between 0 and 1 yields a number between 0 and 1.  E.g. (1/2)n, (1/e)x.  Deeper mathematical motivation: exponential pdfs have good statistical properties for learning.  E.g., conjugate prior, maximum likelihood, sufficient statistics. 19/13 Reading exponential prob formulas • Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. • f(x), p(x) • Use p(x) = α exp(-f(x)). 20/13 Example: exponential form sample size • Fair Coin: The longer the sample size, the less likely it is. • p(n) = 2-n. ln[p(n)] Sample size n 21/13 Location Parameter • The further x is from the center μ, the less likely it is. ln[p(x)] (x-μ)2 22/13 Spread/Precision parameter • The greater the spread σ2, the more likely x is (away from the mean). • The greater the precision β, the less likely x is. ln[p(x)] 1/σ2 = β 23/13 Minimal energy = max probability • The greater the energy (of the joint state), the less probable the state is. ln[p(x)] E(x) 24/13 Normalization  Let p*(x) be an unnormalized density function.  To make a probability density function, need to find normalization constant α s.t. ¥ ò a p *(x)dx =1. -¥  Therefore a= ¥ ò p *(x)dx. -¥  For the Gaussian (Laplace 1782) æ 1 1 2ö . ò exp çè- 2s 2 (x - m ) ÷ø dx = 2 2ps -¥ ¥ 25/13 Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows.  Laplace (1810). Example: N uniform [0,1] random variables. 26/13 Gaussian Likelihood Function  Exercise: Assume a Gaussian noise model, so the likelihood function becomes (copy 3.10 from Bishop). æ 1 1 2ö p(t | X, w, s ) = Õ exp ç - 2 (tn - w • x n ) ÷ 2 è ø 2 s 2 ps n=1 N  Show that the maximum likelihood solution minimizes the sum of squares error: N EX (w) =1 / 2å{tn - x n · w}2 n=1 27/13 Regression With Basis Functions 28/13 Nonlinear Features  We can increase the power of linear regression by using functions of the input features instead of the input features.  These are called basis functions.  Linear regression can then be used to assign weights to the basis functions. 29/13 Linear Basis Function Models (1)  Generally  where j(x) are known as basis functions.  Typically, 0(x) = 1, so that w0 acts as a bias.  In the simplest case, we use linear basis functions : d(x) = xd. 30/13 Linear Basis Function Models (2) Polynomial basis functions: These are global, the same for all input vectors. 31/13 Linear Basis Function Models (3) Gaussian basis functions: These are local; a small change in x only affects nearby basis functions. ¹j and s control location and scale (width). Related to kernel methods. 32/13 Linear Basis Function Models (4) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. j and s control location and scale (slope). 33/13 Basis Function Example Transformation 34/13 Limitations of Fixed Basis Functions • M basis functions along each dimension of a D- dimensional input space require MD basis functions: the curse of dimensionality. • In later chapters, we shall see how we can get away with fewer basis functions, by choosing these using the training data. 35/13 Overfitting and Regularization 36/13 Polynomial Curve Fitting 37/13 0th Order Polynomial 38/13 3rd Order Polynomial 39/13 9th Order Polynomial 40/13 Over-fitting Root-Mean-Square (RMS) Error: 41/13 Polynomial Coefficients 42/13 Data Set Size: 9th Order Polynomial 43/13 1st Order Polynomial 44/13 Data Set Size: 9th Order Polynomial 45/13 Quadratic Regularization  Penalize large coefficient values 46/13 Regularization: 47/13 Regularization: 48/13 Regularization: vs. 49/13 Regularized Least Squares (1)  Consider the error function: Data term + Regularization term  With the sum-of-squares error function and a quadratic regularizer, we get  which is minimized by ¸ is called the regularization coefficient. 50/13 Regularized Least Squares (2)  With a more general regularizer, we have Lasso Quadratic 51/13 Regularized Least Squares (3) Lasso tends to generate sparser solutions than a quadratic regularizer. 52/13 Evaluating Classifiers and Parameters Cross-Validation 53/13 Evaluating Learners on Validation Set Training Data Learner Validation Set Model 54/13 What if there is no validation set?  What does training error tell me about the generalization performance on a hypothetical validation set?  Scenario 1:You run a big pharmaceutical company.Your new drug looks pretty good on the trials you’ve done so far. The government tells you to test it on another 10,000 patients.  Scenario 2:Your friendly machine learning instructor provides you with another validation set.  What if you can’t get more validation data? 55/13 Examples from Andrew Moore  Andrew Moore's slides 56/13 Cross-Validation for Evaluating Learners Cross-validation to estimate the performance of a learner on future unseen data Learn on 3 folds, test on 1 Learn on 3 folds, test on 1 Learn on 3 folds, test on 1 Learn on 3 folds, test on 1 57/13 Cross-Validation for Meta Methods Training Data • If the learner requires setting a parameter, we can evaluate different parameter settings against the data using training error or cross validation. • Then cross-validation is part of learning. Learner(λ) Model(λ) 58/13 Cross-Validation for evaluating a parameter value Cross-validation for λset (e.g. λ = 1) Learn withλ on 3 folds, test on 1 Learn withλ on 3 folds, test on 1 Learn withλ on 3 folds, test on 1 Learn withλ on 3 folds, test on 1 Average error over all 4 runs is the cross-validation estimated error for the λ value 59/13 Stopping Criterion for any type of validation (hold-out set validation, cross-validation) 60/13 Bayesian Linear Regression 61/13 Bayesian Linear Regression (1) • Define a conjugate shrinkage prior over weight vector w: p(w|σ2) = N(w|0,σ2I) • Combining this with the likelihood function and using results for marginal and conditional Gaussian distributions, gives a posterior distribution. • Log of the posterior = sum of squared errors + quadratic regularization. 62/13 Bayesian Linear Regression (3) 0 data points observed Prior Data Space 63/13 Bayesian Linear Regression (4) 1 data point observed Likelihood Posterior Data Space 64/13 Bayesian Linear Regression (5) 2 data points observed Likelihood Posterior Data Space 65/13 Bayesian Linear Regression (6) 20 data points observed Likelihood Posterior Data Space 66/13 Predictive Distribution (1) • Predict t for new values of x by integrating over w. • Can be solved analytically. 67/13 Predictive Distribution (2)  Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point 68/13 Predictive Distribution (3)  Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points 69/13 Predictive Distribution (4)  Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points 70/13 Predictive Distribution (5)  Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points 71/13

Linear Regression.

Related documents

Products

Support

Linear Regression.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib