+
Machine Learning and Data Mining
Linear regression
(adapted from) Prof. Alexander Ihler
Supervised learning
• Notation
– Features x
– Targets y
– Predictions ŷ
– Parameters q
Program (“Learner”)
Training data
(examples)
Features
Feedback /
Target values
Characterized by some “ parameters ” θ
Procedure (using
θ
) that outputs a prediction
Score performance
(“cost function”)
Learning algorithm
Change
θ
Improve performance
40
Linear regression
20
0
0 20 10
Feature x
• Define form of function f(x) explicitly
• Find a good f(x) within that family
(c) Alexander Ihler
“ Predictor ” :
Evaluate line: return r
More dimensions?
26 y
24
22
20
30 x
1
20
10
0
0
10 x
20
2
30
40
26 y
24
22
20
30 x
1
20
10
0
0
10
20 x
2
30
40
(c) Alexander Ihler
Notation
Define “feature” x
0
Then
= 1 (constant)
(c) Alexander Ihler
Measuring error
Observation
Prediction
Error or “ residual ”
0
0
(c) Alexander Ihler
20
Mean squared error
• How can we quantify the error?
• Could choose something else, of course…
– Computationally convenient (more later)
– Measures the variance of the residuals
– Corresponds to likelihood under Gaussian model of “ noise ”
(c) Alexander Ihler
MSE cost function
• Rewrite using matrix form
(Matlab) >> e = y’ – th*X ’ ; J = e*e ’ /m;
(c) Alexander Ihler
Visualizing the cost function
20
-10
-20
10
0
40
30
-30
-40
-1 -0.5
θ
1
0 0.5
1 1.5
2
40
30
20
-10
-20
10
0
2.5
-30
(c) Alexander Ihler
3
-40
-1 -0.5
0 0.5
1 1.5
2 2.5
3
Supervised learning
• Notation
– Features x
– Targets y
– Predictions ŷ
– Parameters q
Program (“Learner”)
Training data
(examples)
Features
Feedback /
Target values
Characterized by some “ parameters ” θ
Procedure (using
θ
) that outputs a prediction
Score performance
(“cost function”)
Learning algorithm
Change
θ
Improve performance
Finding good parameters
• Want to find parameters which minimize our error…
• Think of a cost “ surface ” : error residual for that θ…
(c) Alexander Ihler
+
Machine Learning and Data Mining
Linear regression: direct minimization
(adapted from) Prof. Alexander Ihler
MSE Minimum
• Consider a simple problem
– One feature, two data points
– Two unknowns: µ
0
, µ
1
– Two equations:
• Can solve this system directly:
• However, most of the time, m > n
– There may be no linear function that hits all the data exactly
– Instead, solve directly for minimum of MSE function
(c) Alexander Ihler
SSE Minimum
• Reordering, we have
• X (X T X) -1 is called the “ pseudo-inverse ”
• If X T is square and independent, this is the inverse
• If m > n: overdetermined; gives minimum MSE fit
(c) Alexander Ihler
Matlab SSE
• This is easy to solve in Matlab…
% y = [y1 ; … ; ym]
% X = [x1_0 … x1_m ; x2_0 … x2_m ; …]
% Solution 1: “ manual ” th = y’ * X * inv(X ’ * X);
% Solution 2: “ mrdivide ” th = y’ / X ’ ; % th*X ’ = y => th = y/X ’
(c) Alexander Ihler
Effects of MSE choice
• Sensitivity to outliers
12
10
8
6
4
2
0
0
18
16
14
2 4 6
5
4
3
2
1
8 10 12 14 16 18
(c) Alexander Ihler
20
-15
16 2 cost for this one datum
Heavy penalty for large errors
-10 -5 0 5
L1 error
12
10
8
6
4
2
0
0
18
16
14
2 4 6 8 10 12 14 16 18 20
(c) Alexander Ihler
L2, original data
L1, original data
L1, outlier data
Cost functions for regression
(MSE)
(MAE)
Something else entirely…
(???)
“ Arbitrary ” functions can ’ t be solved in closed form…
- use gradient descent
(c) Alexander Ihler
+
Machine Learning and Data Mining
Linear regression: nonlinear features
(adapted from) Prof. Alexander Ihler
Nonlinear functions
• What if our hypotheses are not lines?
– Ex: higher-order polynomials
12
10
8
6
4
2
0
0
18
16
14
2 4 6
Order 1 polynom ial
8 10 12 14
12
10
8
6
18
16
14
16
4
2
18 20
0
0
(c) Alexander Ihler
2
Order 3 polynom ial
4 6 8 10 12 14 16 18 20
Nonlinear functions
• Single feature x, predict target y:
Add features:
Linear regression in new features
• Sometimes useful to think of “feature transform”
(c) Alexander Ihler
4
2
0
8
6
-2
0
14
12
10
18
16
Higher-order polynomials
18
• Fit in the same way
• More “ features ”
16
14
12
10
2 4 6
Order 2 polynom ial
8 10 12 14 16 18 20
12
10
18
0
0
16
14
4
2
8
6
4
2
8
6
0
0 2
2
4
4
6
6
8
Order 1 polynom ial
10
10 12
12
14
14
16
16
18
18
20
20
Features
• In general, can use any features we think are useful
• Other information about the problem
– Sq. footage, location, age, …
• Polynomial functions
– Features [1, x, x 2 , x 3 , …]
• Other functions
– 1/x, sqrt(x), x
1
* x
2
, …
• “ Linear regression ” = linear in the parameters
– Features we can make as complex as we want!
(c) Alexander Ihler
Higher-order polynomials
• Are more features better?
• “ Nested ” hypotheses
– 2 nd order more general than 1 st ,
– 3 rd order “ “ than 2 nd , …
• Fits the observed data better
Y
Overfitting and complexity
• More complex models will always fit the training data better
• But they may “overfit” the training data, learning complex relationships that are not really present
Complex model Simple model
Y
X
(c) Alexander Ihler
X
Test data
• After training the model
•
• Go out and get more data from the world
– New observations (x,y)
How well does our model perform?
Training data
New, “ test ” data
(c) Alexander Ihler
Training versus test error
• Plot MSE as a function of model complexity
30
– Polynomial order
25
• Decreases
– More complex function fits training data better
20
15
• What about new data?
10
• 0 th to 1 st order
– Error decreases
– Underfitting
• Higher order
– Error increases
– Overfitting
5
0
0 0.5
1 1.5
2
Polynomial order
2.5
(c) Alexander Ihler
Training data
New, “ test ” data
3 3.5
4 4.5
5