PPT

advertisement

+

Machine Learning and Data Mining

Linear regression

(adapted from) Prof. Alexander Ihler

Supervised learning

• Notation

– Features x

– Targets y

– Predictions ŷ

– Parameters q

Program (“Learner”)

Training data

(examples)

Features

Feedback /

Target values

Characterized by some “ parameters ” θ

Procedure (using

θ

) that outputs a prediction

Score performance

(“cost function”)

Learning algorithm

Change

θ

Improve performance

40

Linear regression

20

0

0 20 10

Feature x

• Define form of function f(x) explicitly

• Find a good f(x) within that family

(c) Alexander Ihler

“ Predictor ” :

Evaluate line: return r

More dimensions?

26 y

24

22

20

30 x

1

20

10

0

0

10 x

20

2

30

40

26 y

24

22

20

30 x

1

20

10

0

0

10

20 x

2

30

40

(c) Alexander Ihler

Notation

Define “feature” x

0

Then

= 1 (constant)

(c) Alexander Ihler

Measuring error

Observation

Prediction

Error or “ residual ”

0

0

(c) Alexander Ihler

20

Mean squared error

• How can we quantify the error?

• Could choose something else, of course…

– Computationally convenient (more later)

– Measures the variance of the residuals

– Corresponds to likelihood under Gaussian model of “ noise ”

(c) Alexander Ihler

MSE cost function

• Rewrite using matrix form

(Matlab) >> e = y’ – th*X ’ ; J = e*e ’ /m;

(c) Alexander Ihler

Visualizing the cost function

20

-10

-20

10

0

40

30

-30

-40

-1 -0.5

θ

1

0 0.5

1 1.5

2

40

30

20

-10

-20

10

0

2.5

-30

(c) Alexander Ihler

3

-40

-1 -0.5

0 0.5

1 1.5

2 2.5

3

Supervised learning

• Notation

– Features x

– Targets y

– Predictions ŷ

– Parameters q

Program (“Learner”)

Training data

(examples)

Features

Feedback /

Target values

Characterized by some “ parameters ” θ

Procedure (using

θ

) that outputs a prediction

Score performance

(“cost function”)

Learning algorithm

Change

θ

Improve performance

Finding good parameters

• Want to find parameters which minimize our error…

• Think of a cost “ surface ” : error residual for that θ…

(c) Alexander Ihler

+

Machine Learning and Data Mining

Linear regression: direct minimization

(adapted from) Prof. Alexander Ihler

MSE Minimum

• Consider a simple problem

– One feature, two data points

– Two unknowns: µ

0

, µ

1

– Two equations:

• Can solve this system directly:

• However, most of the time, m > n

– There may be no linear function that hits all the data exactly

– Instead, solve directly for minimum of MSE function

(c) Alexander Ihler

SSE Minimum

• Reordering, we have

• X (X T X) -1 is called the “ pseudo-inverse ”

• If X T is square and independent, this is the inverse

• If m > n: overdetermined; gives minimum MSE fit

(c) Alexander Ihler

Matlab SSE

• This is easy to solve in Matlab…

% y = [y1 ; … ; ym]

% X = [x1_0 … x1_m ; x2_0 … x2_m ; …]

% Solution 1: “ manual ” th = y’ * X * inv(X ’ * X);

% Solution 2: “ mrdivide ” th = y’ / X ’ ; % th*X ’ = y => th = y/X ’

(c) Alexander Ihler

Effects of MSE choice

• Sensitivity to outliers

12

10

8

6

4

2

0

0

18

16

14

2 4 6

5

4

3

2

1

8 10 12 14 16 18

(c) Alexander Ihler

20

-15

16 2 cost for this one datum

Heavy penalty for large errors

-10 -5 0 5

L1 error

12

10

8

6

4

2

0

0

18

16

14

2 4 6 8 10 12 14 16 18 20

(c) Alexander Ihler

L2, original data

L1, original data

L1, outlier data

Cost functions for regression

(MSE)

(MAE)

Something else entirely…

(???)

“ Arbitrary ” functions can ’ t be solved in closed form…

- use gradient descent

(c) Alexander Ihler

+

Machine Learning and Data Mining

Linear regression: nonlinear features

(adapted from) Prof. Alexander Ihler

Nonlinear functions

• What if our hypotheses are not lines?

– Ex: higher-order polynomials

12

10

8

6

4

2

0

0

18

16

14

2 4 6

Order 1 polynom ial

8 10 12 14

12

10

8

6

18

16

14

16

4

2

18 20

0

0

(c) Alexander Ihler

2

Order 3 polynom ial

4 6 8 10 12 14 16 18 20

Nonlinear functions

• Single feature x, predict target y:

Add features:

Linear regression in new features

• Sometimes useful to think of “feature transform”

(c) Alexander Ihler

4

2

0

8

6

-2

0

14

12

10

18

16

Higher-order polynomials

18

• Fit in the same way

• More “ features ”

16

14

12

10

2 4 6

Order 2 polynom ial

8 10 12 14 16 18 20

12

10

18

0

0

16

14

4

2

8

6

4

2

8

6

0

0 2

2

4

4

6

6

8

Order 1 polynom ial

10

10 12

12

14

14

16

16

18

18

20

20

Features

• In general, can use any features we think are useful

• Other information about the problem

– Sq. footage, location, age, …

• Polynomial functions

– Features [1, x, x 2 , x 3 , …]

• Other functions

– 1/x, sqrt(x), x

1

* x

2

, …

• “ Linear regression ” = linear in the parameters

– Features we can make as complex as we want!

(c) Alexander Ihler

Higher-order polynomials

• Are more features better?

• “ Nested ” hypotheses

– 2 nd order more general than 1 st ,

– 3 rd order “ “ than 2 nd , …

• Fits the observed data better

Y

Overfitting and complexity

• More complex models will always fit the training data better

• But they may “overfit” the training data, learning complex relationships that are not really present

Complex model Simple model

Y

X

(c) Alexander Ihler

X

Test data

• After training the model

• Go out and get more data from the world

– New observations (x,y)

How well does our model perform?

Training data

New, “ test ” data

(c) Alexander Ihler

Training versus test error

• Plot MSE as a function of model complexity

30

– Polynomial order

25

• Decreases

– More complex function fits training data better

20

15

• What about new data?

10

• 0 th to 1 st order

– Error decreases

– Underfitting

• Higher order

– Error increases

– Overfitting

5

0

0 0.5

1 1.5

2

Polynomial order

2.5

(c) Alexander Ihler

Training data

New, “ test ” data

3 3.5

4 4.5

5

Download