Linear regression with gradient descent

advertisement
Linear regression
with gradient descent
Ingmar Schuster
Patrick Jähnichen
using slides by Andrew Ng
Institut für Informatik
This lecture covers
●
Linear Regression
●
●
●
Hypothesis formulation,
hypthesis space
Optimizing Cost with Gradient
Descent
Using multiple input features
with Linear Regression
●
Feature Scaling
●
Nonlinear Regression
●
Optimizing Cost using
derivatives
Linear regression w. gradient descent
2
Linear Regression
Institut für Informatik
Price for buying a flat in Berlin
●
Supervised learning problem
●
●
Expected answer available for each example in data
Regression Problem
●
Prediction of continuous output
Linear regression w. gradient descent
4
Training data of flat prices
●
●
●
●
m Number of training examples
x is input (predictor) variable
„features“ in ML-speek
y is output (response) variable
Notation
Square meters
Price in 1000€
73
174
146
367
38
69
124
257
...
...
Linear regression w. gradient descent
5
Learning procedure
●
Hypothesis parameters
Training data
●
linear regression,
one input variable (univariate)
Learning Algorithm
Size
of flat
Estimated
price
hypothesis
(mapping between
input and output)
How to choose parameters?
Linear regression w. gradient descent
6
Optimization objective
●
Purpose of learning algorithm expressed in
optimization objective and cost function (often called J)
●
Fit data well
●
Few false negatives
●
Few false positives
●
...
Fitting data well: least squares cost function
●
In regression almost always want to fit data well
●
●
smallest average distance to points in training data
(h(x) close to y for (x,y) in training data)
Cost function often named J
Number
Number of
of
training
training instances
instances
●
Squaring
–
Penalty for positive and negative deviations the same
–
Penalty for large deviations stronger
Linear regression w. gradient descent
8
Optimizing Cost
with Gradient Descent
Linear regression w. gradient descent
9
Gradient Descent Outline
●
Want to minimize
●
Start with random
●
Keep changing
to reduce
until we end up at minimum
Linear regression w. gradient descent
10
3D plots and contour plots
Stepwise
Stepwise
descent
descent
towards
towards
minimum
minimum
Derivatives
Derivatives
work
work only
only for
for
few
few parameters
parameters
[plot by Andrew Ng]
Gradient descent
partial
partial
derivative
derivative
beware: incremental
update incorrect!
Linear regression w. gradient descent
steps
steps become
become smaller
smaller
without
without changing
changing
learning
learning rate
rate
12
Learning Rate considerations
●
●
●
Small learning rate leads to slow
convergence
Overly large learning rate may
not lead to convergence or to
divergence
Often
Linear regression w. gradient descent
13
Checking convergence
●
●
Gradient descent works
correctly if
decreases
with every step
Possible convergence
criterion: converged if
decreases by less than
constant
Linear regression w. gradient descent
14
Local Minima
●
Gradient descent can get stuck at local minima
(e.g. J not squared error for regression with only one variable)
X
Linear regression w. gradient descent
Random restart
with different
parameter(s)
15
Variants of Gradient Descent
Using multiple input features
Linear regression w. gradient descent
16
Multiple features
Square Bedrooms Floors Age of building
meters
(years)
x1
●
x2
x3
x4
Price in
1000€
y
200
5
1
45
460
131
3
2
40
232
142
3
2
30
315
756
2
1
36
178
…
…
…
…
…
Notation
Linear regression w. gradient descent
17
Hypothesis representation
●
●
More compact
with definition
Linear regression w. gradient descent
18
Gradient descent for multiple variables
●
Generalized cost function
●
Generalized gradient descent
Linear regression w. gradient descent
19
Partial derivative of cost function for multiple variables
●
Calculating the partial derivative
Linear regression w. gradient descent
20
Gradient descent for multiple variables
●
Simplified gradient descent
Linear regression w. gradient descent
21
Conversion considerations for multiple variables
●
With multiple variables, comparison of variance in data is lost
(scales can vary strongly)
Square meters 30 - 400
●
Bedrooms
1 - 10
Price
80 000
2 000 000
Gradient descent converges faster for features on similar scale
Linear regression w. gradient descent
22
Feature Scaling
Linear regression w. gradient descent
23
Feature scaling
●
Different approaches for converting features to comparable
scale
●
Min-Max-Scaling makes all data fall into range [0, 1]
(for single data point of feature j)
●
Z-score conversion
Linear regression w. gradient descent
24
Z-Score conversion
●
Center data on 0
●
Scale data so majority falls into range [-1, 1]
mean
mean // empirical
empirical
expected
expected value
value
(mu)
(mu)
empirical
empirical standard
standard
deviation
deviation (sigma)
(sigma)
●
Z-score conversion of single data point for feature j
Linear regression w. gradient descent
25
Visualizing standard deviation
Linear regression w. gradient descent
26
Nonlinear Regression
(by cheap trickery)
Linear regression w. gradient descent
27
Nonlinear Regression Problems
Linear regression w. gradient descent
28
Nonlinear Regression Problems (linear approximation)
Linear regression w. gradient descent
29
Nonlinear Regression Problems (nonlinear hypothesis)
Linear regression w. gradient descent
30
Nonlinear Regression with cheap trickery
●
Linear Regression can be used for Nonlinear Problems
●
Choose nonlinear hypothesis space
●
●
●
Linear regression w. gradient descent
31
Optimizing cost
using derivatives
Linear regression w. gradient descent
32
Comparison Gradient Descent vs. Setting derivative = 0
●
Instead of Gradient descent
solve
for all i
Linear regression w. gradient descent
33
Comparison Gradient Descent vs. Setting derivative = 0
Gradient Descent
●
●
●
Need to choose
Needs many iterations,
random restarts etc.
Works well for many features
Derivation
●
No need to choose
●
No iterations
●
●
Slow for many features
Linear regression w. gradient descent
34
This lecture covers
●
Linear Regression
●
●
●
Hypothesis formulation,
hypthesis space
Optimizing Cost with Gradient
Descent
Using multiple input features
with Linear Regression
●
Feature Scaling
●
Nonlinear Regression
●
Optimizing Cost using
derivatives
Linear regression w. gradient descent
35
Pictures
●
Some public domain plots from
en.wikipedia.org and
de.wikipedia.org
Linear regression w. gradient descent
36
Download