Intro to Machine Learning

advertisement
More Machine Learning
Linear Regression
Squared Error
L1 and L2 Regularization
Gradient Descent
Recall:
Key Components of Intelligent Agents
Representation Language: Graph, Bayes Nets
Inference Mechanism: A*, variable elimination, Gibbs
sampling
Learning Mechanism: Maximum Likelihood, Laplace
Smoothing, many more: linear regression, perceptron, kNearest Neighbor, …
------------------------------------Evaluation Metric: Likelihood, many more: squared error, 0-1
loss, conditional likelihood, precision/recall, …
Recall: Types of Learning
The techniques we have discussed so far are examples of a particular kind of learning:
Supervised: the training examples included the correct labels or outputs.
Vs. Unsupervised (or semi-supervised, or distantly-supervised, …): None (or some, or only part, …) of the
labels in the training data are known.
Parameter Estimation: We only tried to learn the parameters in the BN, not the structure of the BN graph.
Vs. Structure learning: The BN graph is not given as an input, and the learning algorithm’s job is to figure
out what the graph should look like.
The distinctions below aren’t actually about the learning algorithm itself, but rather about the type of
model being learned:
Classification: the output is a discrete value, like Happy or not Happy, or Spam or Ham.
Vs. Regression: the output is a real number.
Generative: The model of the data represents a full joint distribution over all relevant variables.
Vs. Discriminative: The model assumes some fixed subset of the variables will always be “inputs” or
“evidence”, and it creates a distribution for the remaining variables conditioned on the evidence variables.
Parametric vs. Nonparametric: I will explain this later.
We won’t talk much about structure learning, but we will cover some other kinds of learning (regression,
unsupervised, discriminative, nonparameteric, …) in later lectures.
Regression vs. Classification
Our NBC spam detector was a classifier:
the output Y was one of two options, Ham or Spam.
More generally, classifiers give an output from a (usually small) finite
(or countably infinite) set of options.
E.g., predicting who will win the presidency in the next election is a
classification problem (finite set of possible outcomes: US citizens).
Regression models give a real number as output.
E.g., predicting what the temperature will be tomorrow is a regression
problem. Any real number greater than or equal to 0 (Kelvin) is a
possible outcome.
Quiz: regression vs. classification
For each prediction task below, determine whether
regression or classification is more appropriate.
Task
Predict who will win the Super Bowl next year
Predict the gender of a baby when it’s born
Predict the weight of a child one year from now
Predict the average life expectancy of all babies born today
Predict the price of Apple, Inc.’s stock at the close of
trading tomorrow.
Predict whether Microsoft or Apple will have a higher
valuation at the close of trading tomorrow
Regression or Classification?
Answers: regression vs. classification
For each prediction task below, determine whether
regression or classification is more appropriate.
Task
Regression or Classification?
Predict who will win the Super Bowl next year
C
Predict the gender of a baby when it’s born
C
Predict the weight of a child one year from now
R
Predict the average life expectancy of all babies born today
R
Predict the price of Apple, Inc.’s stock at the close of
trading tomorrow.
R
Predict whether Microsoft or Apple will have a higher
valuation at the close of trading tomorrow
C
Concrete Example
250000
House Price, $
200000
175000
150000
100000
50000
0
0
500
1000
1500
2000
Square Footage
2500
Suppose I want to buy a house that’s 2000 square feet.
Predict how much it will cost.
3000
More realistic data
Violent Crime per Capita
Reported Crime Statistics for U.S. Counties
Percentage of the population under the federal poverty level
Linear Regression
Suppose there are N input variables, X1, …, XN (all real
numbers).
A linear regression is a function that looks like this:
Y = w0 + w1X1 + w2X2 + … + wNXN
The wi variables are called weights or parameters. Each one is
a real number.
The set of all functions that look like this (one function for
each choice of weights w0 through wN) is called the
Hypothesis Class for linear regression.
Hypotheses
55100+900*X1
100+900*X1
250000
House Price, $
200000
150000
80000+270*X1
100000
50000
0
0
500
1000
1500
2000
Square Footage
2500
3000
In this example, there is only one input variable: X1 is square footage.
The hypothesis class is all functions Y = w0 + w1 * (square footage).
Several example elements of the hypothesis class are drawn.
Learning for Linear Regression
Linear regression tells us a whole set of possible functions to
use for prediction.
How do we choose the best one from this set?
This is the learning problem for linear regression:
Input: a set of training examples, where each example
contains a value for (X1, …, XN, Y)
Output: a set of weights (w0, …, wN) for the “best-fitting”
linear regression model.
Quiz: Learning for Linear Regression
X
10
30
15
Y
80
40
70
55
-10
For the data on the left, what’s
the best fit linear regression
model?
Answer: Learning for Linear
Regression
X
10
30
15
Y
80
40
70
55
-10
For the data on the left,
what’s the best fit linear
regression model?
80 = w0 + w1 * 10
40 = w0 + w1 * 30
80 = w0 + (-2)*10
100 = w0
80-40 = w0-w0 + w1 * 10-w1*30
40 = w1 * (-20)
-2 = w1
Y= 100 + (-2) * X
Linear Regression with Noisy Data
250000
House Price, $
200000
150000
100000
50000
0
0
500
1000
1500
2000
Square Footage
2500
3000
In the previous example, we could use only two points and find a line that passed through all
of the remaining points.
In this example, points are only “approximately” linear. No single line passes through all
points exactly. We’ll need a more complex algorithm to handle this.
Quadratic Loss (a.k.a. “Squared Error”)
Let’s write our training data D with this notation:
X11
X12
…
X1N
Y1
X21
X22
…
X2N
Y2
…
…
…
…
…
XM1
XM2
…
XMN
YM
Define 𝐿𝑂𝑆𝑆(𝑓, 𝐷) =
=
𝑖
π‘Œπ‘– − 𝑓(𝑋1 , … , 𝑋𝑁 )
2
π‘Œπ‘– − 𝑀0 − 𝑀1 𝑋1 − β‹― − −𝑀𝑁 𝑋𝑁
2
𝑖
Intuitively, this is how much error the function makes on
the training data.
Objective Function
The goal of a linear regression is to find the best linear
function. We’ll say that “best” means the one with the
least amount of quadratic loss.
Mathematically, we say we want f* that satisfies:
𝑓 ∗ (𝑋1 , … , 𝑋𝑁 )=
argmin 𝐿𝑂𝑆𝑆 𝑀0 + 𝑀1 𝑋1 + β‹― + 𝑀𝑁 𝑋𝑁 , 𝐷
𝑀0 ,…,𝑀𝑁
We call LOSS the objective function for our training
algorithm, since it’s the function we’re trying to minimize.
Closed-form Solution
for 1 input variable
𝑓 ∗ 𝑋1 = argmin 𝐿𝑂𝑆𝑆 𝑀0 + 𝑀1 𝑋1 , 𝐷
𝑀0 ,𝑀1
To minimize the LOSS function, we’ll take the partial derivatives, and set them to zero:
πœ•πΏπ‘‚π‘†π‘†
πœ•π‘€1
πœ•
=
=
𝑖
−2
π‘Œπ‘– − 𝑀0 − 𝑀1 𝑋1𝑖
πœ•π‘€1
(π‘Œπ‘– − 𝑀0 − 𝑀1 𝑋1𝑖 )𝑋1𝑖
𝑖
Set this expression equal to zero:
2
𝑋1𝑖 π‘Œπ‘– − 𝑋1𝑖 𝑀0 − 𝑀1 𝑋1𝑖
=0
𝑖
2
𝑋1𝑖
=
𝑀1
𝑖
𝑀1 =
𝑋1𝑖 π‘Œπ‘– − 𝑀0
𝑖
𝑖
2
𝑋1𝑖
𝑖
𝑋1𝑖 π‘Œπ‘– − 𝑀0
𝑖
2
𝑋1𝑖
𝑖
𝑋1𝑖
Closed-form Solution
for 1 input variable
𝑓 ∗ 𝑋1 = argmin 𝐿𝑂𝑆𝑆 𝑀0 + 𝑀1 𝑋1 , 𝐷
𝑀0 ,𝑀1
To minimize the LOSS function, we’ll take the partial derivatives, and set them to zero:
πœ•πΏπ‘‚π‘†π‘†
πœ•π‘€0
πœ•
=
=
π‘Œπ‘– − 𝑀0 − 𝑀1 𝑋1𝑖
πœ•π‘€0
𝑖
−2
(π‘Œπ‘– − 𝑀0 − 𝑀1 𝑋1𝑖 )
𝑖
Set this expression equal to zero:
π‘Œπ‘– − 𝑀0 − 𝑀1 𝑋1𝑖 = 0
𝑖
1
𝑀0 =
𝑀
𝑖
2
𝑀1
π‘Œπ‘– −
𝑀
𝑋1𝑖
𝑖
“Closed-form” Result
𝑀0 =
1
𝑀
π‘Œπ‘– −
𝑖
Substitute for w0 in the second equation gives:
𝑀1 =
𝑀1 =
𝑀1
𝑖
𝑖
2
𝑋1𝑖
−
𝑋1𝑖 π‘Œπ‘– −
𝑋1𝑖 π‘Œπ‘– −
𝑀1 =
𝑖
1
𝑀
𝑖
π‘Œπ‘– −
π‘Œπ‘–
𝑖
𝑖
2
𝑋1𝑖 π‘Œπ‘– −
2
𝑖 𝑋1𝑖
𝑋1𝑖
𝑀1
𝑀
𝑖
𝑋1𝑖
𝑋1𝑖 +
2
𝑋1𝑖
𝑖
=
𝑖
2
𝑋1𝑖
𝑖
1
𝑀
1
𝑋
𝑀 𝑖 1𝑖
2
𝑖 𝑋1𝑖
𝑖
2
𝑋1𝑖
𝑖
𝑖
𝑋1𝑖
𝑋1𝑖 π‘Œπ‘– − 𝑀0
𝑖
𝑀1 =
𝑀1
𝑀
𝑖
1
𝑀
1
−
𝑀
𝑋1𝑖 π‘Œπ‘– −
π‘Œπ‘–
𝑖
1
𝑀
𝑖
𝑋1𝑖
𝑖
2
𝑋1𝑖
𝑋1𝑖
𝑖
𝑋1𝑖
𝑋1𝑖
𝑀1
𝑀
𝑖
𝑖
𝑖
2
π‘Œπ‘–
2
𝑖
𝑋1𝑖
Quiz: Learning for Linear Regression
𝑀1 =
𝑀0 =
X
10
30
15
Y
80
40
70
55
-10
𝑖
1
𝑋1𝑖 π‘Œπ‘– − 𝑀
2
𝑖 𝑋1𝑖
1
𝑀
π‘Œπ‘– −
𝑖
𝑖
1
−𝑀
𝑀1
𝑀
π‘Œπ‘–
𝑖
𝑋1𝑖
𝑖
Using the closed-form solution
for Quadratic Loss, compute w0
and w1 for this dataset.
𝑋1𝑖
𝑖
𝑋1𝑖
2
Answer: Learning for Linear
Regression
𝑀1 =
𝑀0 =
𝑖
X
10
30
15
Y
80
40
70
55
-10
1
𝑋1𝑖 π‘Œπ‘– − 𝑀
2
𝑖 𝑋1𝑖
1
𝑀
π‘Œπ‘– −
𝑖
𝑖
1
−𝑀
𝑀1
𝑀
π‘Œπ‘–
𝑖
𝑋1𝑖
𝑖
𝑋1𝑖
𝑋1𝑖 =
𝑖
Using the closed-form solution
for Quadratic Loss, compute w0
and w1 for this dataset.
2
1
800 + 1200 + 1050 − 550 − 4 180 110
=
= −𝟐
1
100 + 900 + 225 + 3025 − 4 1102
1
−2
180 −
110 = 𝟏𝟎𝟎
4
4
Note that w1, w0 match what we calculated before!
Overfitting and Regularization
It is very common to use a technique called regularization to
combat overfitting for linear methods.
Regularization changes the objective function for training by
adding a penalty for the size of the weights:
Parameter loss
LOSS(f, D) =
𝑖
π‘Œπ‘– − 𝑓(𝑋1 , … , 𝑋𝑁 )
2
+
𝑖
𝑀𝑖
𝑝
When p=1, this is called L1 regularization.
When p=2, this is called L2 regularization.
1 and 2 are by far the two most commonly-used values of p.
Gradient Descent
For more complex loss functions, it is often NOT
POSSIBLE to find closed-form solutions.
Instead, people resort to “iterative methods” that
iteratively find better and better parameter
estimates, until they converge to the best setting.
We’ll go over one example of this kind of method,
called “gradient descent”.
Gradient Descent
Gradient Descent Algorithm
π‘–π‘‘π‘’π‘Ÿ
Create weights 𝑀𝑖𝑛𝑑𝑒π‘₯
,iοƒŸ0
1. (𝑀00 , 𝑀10 ) οƒŸ some initial values (often zero)
2. While |𝑀1𝑖 -𝑀1𝑖−1 +|𝑀0𝑖 −𝑀0𝑖−1 > π‘‘β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘:
for each j:
𝑀𝑗𝑖+1 οƒŸ
𝑀𝑗𝑖
−𝛼
πœ•πΏπ‘œπ‘ π‘ 
πœ•π‘€π‘—π‘–
i οƒŸ i+1
3. Return (𝑀0𝑖 , 𝑀1𝑖 )
Learning rate
LOSS
Quiz: Gradient
c
a
b
w
𝝏𝑳𝒐𝒔𝒔
ππ’˜
a
b
c
positive
About
zero
𝝏𝑳𝒐𝒔𝒔
ππ’˜
negative
Check the boxes that apply.
LOSS
Answer: Gradient
c
a
b
w
𝝏𝑳𝒐𝒔𝒔
ππ’˜
positive
About
zero
a
negative
x
b
c
𝝏𝑳𝒐𝒔𝒔
ππ’˜
x
x
Check the boxes that apply.
LOSS
Quiz: Gradient
a
c
b
w
Where is
a
b
c
Equal everywhere
𝝏𝑳𝒐𝒔𝒔
ππ’˜
the largest?
LOSS
Answer: Gradient
a
c
b
w
Where is
𝝏𝑳𝒐𝒔𝒔
ππ’˜
a
b
c
Equal everywhere
x
the largest?
LOSS
Quiz: Gradient Descent
a
c
b
w
Which point will allow gradient descent to reach the global
minimum, if it is used as the initialization for parameter w?
a
b
c
LOSS
Answer: Gradient Descent
a
c
b
w
Which point will allow gradient descent to reach the global
minimum, if it is used as the initialization for parameter w?
a
b
c
x
Download