Uploaded by Shayhan Ameen

Session 5

advertisement
Machine Learning Programming
BDA712-00
Logistic Regression I:
Classification with Logistic Regression
Lecturer: Josué Obregón PhD
Kyung Hee University
Department of Big Data Analytics
October 5, 2022
Machine Learning Programming, KHU
1
Previously, in our course…
Walking the
gradient
A discerning
machine
Gradient descent
algorithm
From regression to
classification
Your first learning
program
Hyperspace!
Building a tiny
supervised learning
program
Multiple linear
regression
Getting real
Recognize a single digit
using MNSIT
And today…
Walking the
gradient
A discerning
machine
Gradient descent
algorithm
From regression to
classification
Your first learning
program
Hyperspace!
Building a tiny
supervised learning
program
Multiple linear
regression
Getting real
Recognize a single digit
using MNSIT
Today's agenda
• What is Classification?
• Logistic Regression
• Implementation of Logistic Regression
• Intuition behind the loss function
• Transforming linear regression to logistic regression
Machine Learning Programming, KHU
4
Some motivation: Default data
default
student
balance
income
Has the customer defaulted on their debt? (values: Yes, No)
Is the customer a student? (values: Yes, No)
Average credit card balance remaining after monthly payment
Income of customer
Goal: Use student, balance and income information to predict
whether a customer is likely to default on their debts.
Machine Learning Programming, KHU
5
Classification
While introductory classes in statistics tend to focus on Regression, many real-world
tasks are Classification problems.
• An online banking service needs to determine whether a particular transaction is
fraudulent or not.
• A firm running a phone marketing campaign needs to identify which individuals would
be most likely to adopt their product if contacted
• A law firm has obtained all the incoming and outgoing emails from a large corporation
accused of discriminatory hiring and promotion practices.They must identify which
documents are responsive (relevant to the case), and which are irrelevant.
• A judge deciding whether to release a defendant on bail may want to know the
likelihood that the defendant will show up for their trial.
• A client using satellite imaging wants to classify patches of images according to
whether they’re water, rural, or urban.
Machine Learning Programming, KHU
6
What is the classification task?
• For the purpose of this discussion, we’ll assume that we’re doing
binary classification: Our response Y has two possible values {0, 1}
• E.g., in the Default classification task,
Y =
0 if No
1 if Yes
• Our goal is to predict Y using the input variables student, balance
and income
Machine Learning Programming, KHU
7
Is there an “ideal” classifier?
• When we talked about Regression, our goal was to minimize the Mean-Squared-Error, and
we saw that the best we could ever do at modeling an outcome Y from inputs X = (X 1 ,
X 2 , . . . , X p), is to use the regression function
f (x ) = E (Y | X = x )
• In Classification, our goal is now to minimize the error rate:
๐‘›๐‘›
1
๏ฟฝ ๐ผ๐ผ(๐‘ฆ๐‘ฆ๐‘–๐‘– ≠ ๐‘ฆ๐‘ฆ๏ฟฝ๐‘–๐‘– )
๐‘›๐‘›
๐‘–๐‘–=1
• It turns out that the ideal classifier assigns a test observation with predictor values x 0 to
the class j that has the largest value of:
P(Y = j | X = x 0 )
i.e., the class that has the highest probability (is the most likely) given the observed
predictor vector x 0 .
• This classification rule is called the Bayes classifier
Machine Learning Programming, KHU
8
Ideal regressor vs. Ideal classifier
• The Bayes classifier in classification plays the same role as the regression function in prediction:
They are they best we could ever hope to do, but are unknown functions that we seek to estimate
from the data
• It is interesting to note that when Y is binary,
E (Y | X = x) = P(Y = 1 | X = x), so our target becomes
p(x ) = E (Y | X = x ) = P(Y = 1 | X = x )
so by accurately estimating the regression function we would be able to mimic the Bayes
classifier.
• We can think of our intermediate goal as one of constructing good estimators of the unknown
true conditional probability function p(x).3
• Once we have an estimator ๐‘๐‘(๐‘ฅ๐‘ฅ),
ฬ‚
we can classify an observation by predicting default = Yes if
๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ > ๐›ผ๐›ผ for some choice of α
โ—ฆ We’ll see later why we may want to use ๐›ผ๐›ผ ≠ 0.5
3
While there exist classification methods that do not proceed by first estimating p(x), most methods do.
Machine Learning Programming, KHU
9
Is there an “ideal” classifier? Part II
• We just argued that we want to construct ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ that are good estimators of
the true conditional probability function
p(x ) = E (Y | X = x ) = P(Y = 1 | X = x )
• If we could do this, we could use ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ to classify inputs according to the rule:
1 ๐‘–๐‘–๐‘“๐‘“ ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ > ๐›ผ๐›ผ
๏ฟฝ
๐‘Œ๐‘Œ = ๏ฟฝ
0 ๐‘–๐‘–๐‘–๐‘– ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ ≤ ๐›ผ๐›ผ
α = 0.5 is a popular choice, and is what the Bayes classifier uses.
• Now... suppose we define ๐‘๐‘๏ฟฝ ๐‘ฅ๐‘ฅ =
1
๐‘๐‘ฬ‚
10
๐‘ฅ๐‘ฅ
• The rule ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ > α is the same as ๐‘๐‘๏ฟฝ ๐‘ฅ๐‘ฅ >
same way
1
๐›ผ๐›ผ,
10
so both ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ and ๐‘๐‘๏ฟฝ ๐‘ฅ๐‘ฅ classify the
• Thus, we can have estimators ๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ that result in good classification rules, but which
aren’t actually good estimators of p(x)
Machine Learning Programming, KHU
10
The Default data set
Let’s look at a scatterplot of default (0 = No,1 = Yes) vs. balance
This is really hard to make sense of. Since the outcome Y is binary, we don’t get a good
sense of the trends in the data from looking at scatterplots like this.
Machine Learning Programming, KHU
11
The Default data set
How about a histogram, where the bars are partitioned by default status?
That’s a little better. It clearly shows that most of the individuals in our data set did not
default. And it looks like the greater the average monthly balance an individual carries,
the more likely they are to default.
Machine Learning Programming, KHU
12
The Default data set
Here’s what’s called a conditional density plot. This plot is formed by first binning the x-axis
variable (balance), and then calculating the probability of default within each bin.
Now we can clearly see what P(default = Yes | balance) looks like
Machine Learning Programming, KHU
13
Does linear regression work?
What happens if we simply run linear regression here?
Machine Learning Programming, KHU
14
An issue with linear regression
•
•
•
•
For balance < 500, the probability estimates are negative!
Linear regression appears to do a poor job of estimating p(x)
However, it’s actually going to perform OK as a classifier.
Additionally, for problems with more than two classes, encoding the classes impose an
order that has no meaning in general.
Machine Learning Programming, KHU
15
Logistic regression
Much better!
Machine Learning Programming, KHU
16
Linear regression vs. Logistic regression
Linear regression
Logistic regression
• Logistic regression does a much better job of estimating the conditional probability
of default
• However, the Linear regression fit may still produce good classifications.
Machine Learning Programming, KHU
17
How does logistic regression work?
• Logistic regression is an example of a Generalized Linear Model (GLM)
• Remember we have a variable Y that’s 0 (if default = No),or 1 (if
default = Yes)
• Instead of modeling
๐‘Œ๐‘Œ = ๐›ฝ๐›ฝ0 + ๐›ฝ๐›ฝ1 ๏ฟฝ balance + ๐œ–๐œ–
Logistic regression models:
๐‘ƒ๐‘ƒ(default = Yes|balance
log
= ๐›ฝ๐›ฝ0 + ๐›ฝ๐›ฝ1 ๏ฟฝ balance + ๐œ–๐œ–
1 − ๐‘ƒ๐‘ƒ(default = Yes|balance)
• The quantity on the LHS of the equation is called the log-odds of default.
• This turns out to be “right scale” on which to view things
Machine Learning Programming, KHU
18
What is logistic regression doing?
๐‘ƒ๐‘ƒ(default = Yes|balance
log
= ๐›ฝ๐›ฝ0 + ๐›ฝ๐›ฝ1 ๏ฟฝ balance + ๐œ–๐œ–
1 − ๐‘ƒ๐‘ƒ(default = Yes|balance)
• Logistic regression provides us with coefficient estimates ๐›ฝ๐›ฝฬ‚0 and ๐›ฝ๐›ฝฬ‚1
• By rearranging the equation ๐‘Ÿ๐‘Ÿ =
๐‘๐‘
log( )
1−๐‘๐‘
to get ๐‘๐‘ =
estimate of the conditional probability of default:
๐‘’๐‘’ ๐‘Ÿ๐‘Ÿ
(1+๐‘’๐‘’ ๐‘Ÿ๐‘Ÿ )
, we can get an
๐‘’๐‘’ ๐›ฝ๐›ฝ0+๐›ฝ๐›ฝ1๏ฟฝbalance
๐‘ƒ๐‘ƒ๏ฟฝ default = Yes balance =
1 + ๐‘’๐‘’๐›ฝ๐›ฝ๏ฟฝ0+๐›ฝ๐›ฝ๏ฟฝ1๏ฟฝbalance
๏ฟฝ
๏ฟฝ
• i.e., Logistic regression gives us a way of estimating the probability that an
individual will default given their average monthly balance.
Machine Learning Programming, KHU
19
Implementing Logistic Regression
• On busy nights, it’s common for noisy customers to hang out in
front of the pizzeria, until the neighbors eventually call the police.
• Roberto suspects that the same input variables that impact the
number of pizzas sold also affect the number of loud customers by
the entrance
• He wants to know in advance whether a police call is likely to
happen
• What type of machine learning task are we going to do to help
him? Why?
Machine Learning Programming, KHU
20
Let’s help Roberto (Lab Session 05)
Goal: Build
a program on top of our previous implementation that looks
at historical data, grasps the relation between predictors and the
likelihood a police call, and uses it to classify tonight’s outcome, call or
not call.
Let’s do it!
• https://classroom.github.com/a/KoLlsV_a
Machine Learning Programming, KHU
21
Another problem using linear regression
Machine Learning Programming, KHU
22
Sigmoid function
• We want that 0 ≤ ๐‘ฆ๐‘ฆ ๐‘ฅ๐‘ฅ ≤ 1
๐‘ฆ๐‘ฆ ๐‘ฅ๐‘ฅ = ๐›ฝ๐›ฝ๐‘‡๐‘‡ ๐‘ฅ๐‘ฅ ๏ƒ  for linear regression
๐‘ฆ๐‘ฆ ๐‘ฅ๐‘ฅ = ๐œŽ๐œŽ(๐›ฝ๐›ฝ๐‘‡๐‘‡ ๐‘ฅ๐‘ฅ)
๐œŽ๐œŽ acts as a wrapper function that transform
its input into a values between 0 and 1 as
described in the figure.
When we learn neural networks, we will call
them activation functions.
Machine Learning Programming, KHU
23
Training set:
n examples
๐‘ฆ๐‘ฆ ๐‘ฅ๐‘ฅ =
1
1 + ๐‘’๐‘’ −๐›ฝ๐›ฝ
๐‘ฅ๐‘ฅ0
๐‘ฅ๐‘ฅ1
๐‘ฅ๐‘ฅ ∈ ๐‘ฅ๐‘ฅ2
โ‹ฎ
๐‘ฅ๐‘ฅ๐‘๐‘
๐‘ฅ๐‘ฅ
1
, ๐‘ฆ๐‘ฆ
1
, ๐‘ฅ๐‘ฅ
2
, ๐‘ฆ๐‘ฆ
2
, … , ๐‘ฅ๐‘ฅ
๐‘›๐‘›
, ๐‘ฆ๐‘ฆ
๐‘›๐‘›
๐‘ฅ๐‘ฅ0 = 1, ๐‘ฆ๐‘ฆ ∈ {0,1}
๐‘‡๐‘‡ ๐‘ฅ๐‘ฅ
How to choose parameters ๐›ฝ๐›ฝ ?
Machine Learning Programming, KHU
24
Logistic regression cost/loss function
๐‘๐‘ฬ‚ ๐‘ฅ๐‘ฅ
if ๐‘ฆ๐‘ฆ = 1
− log ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
๐ฟ๐ฟ ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ , ๐‘ฆ๐‘ฆ = ๏ฟฝ
− log 1 − ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
if ๐‘ฆ๐‘ฆ = 1
if ๐‘ฆ๐‘ฆ = 0
Log-loss is indicative of how close
the prediction probability is to the
corresponding actual/true value (0
or 1 in case of binary classification).
if ๐‘ฆ๐‘ฆ = 0
The more the predicted probability
diverges from the actual value, the
higher is the log-loss value.
Higher cost ↔ Lower cost
Machine Learning Programming, KHU
Lower cost ↔ Higher cost
25
Log-loss function
− log ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
๐ฟ๐ฟ ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ , ๐‘ฆ๐‘ฆ = ๏ฟฝ
− log 1 − ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
๐‘›๐‘›
1
๐ฟ๐ฟ ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ , ๐‘ฆ๐‘ฆ = − ๏ฟฝ(๐‘ฆ๐‘ฆ ๏ฟฝ log ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
๐‘›๐‘›
๐‘–๐‘–=1
if ๐‘ฆ๐‘ฆ = 1
if ๐‘ฆ๐‘ฆ = 0
๐‘›๐‘›
1
๐ฟ๐ฟ ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ , ๐‘ฆ๐‘ฆ = − ๏ฟฝ(๐‘ฆ๐‘ฆ ๏ฟฝ log ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
๐‘›๐‘›
๐‘–๐‘–=1
๐‘›๐‘›
0
1
๐ฟ๐ฟ ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ , ๐‘ฆ๐‘ฆ = − ๏ฟฝ(๐‘ฆ๐‘ฆ ๏ฟฝ log ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ
๐‘›๐‘›
๐‘–๐‘–=1
if ๐‘ฆ๐‘ฆ = 1
if ๐‘ฆ๐‘ฆ = 0
+ 1 − ๐‘ฆ๐‘ฆ ๏ฟฝ log 1 − ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ )
0
+ 1 − ๐‘ฆ๐‘ฆ ๏ฟฝ log 1 − ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ )
+ 1 − ๐‘ฆ๐‘ฆ ๏ฟฝ log 1 − ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ )
Machine Learning Programming, KHU
26
Log-loss function – Gradient Descent
• The gradient of log-loss is:
Repeat until convergence {
}
๐œ•๐œ•
๐ฟ๐ฟ(๐›ฝ๐›ฝ)
๐œ•๐œ•๐›ฝ๐›ฝ๐‘—๐‘—
(simultaneously update all ๐›ฝ๐›ฝ๐‘—๐‘— )
๐›ฝ๐›ฝ๐‘—๐‘— = ๐›ฝ๐›ฝ๐‘—๐‘— − ๐›ผ๐›ผ
๐‘›๐‘›
๐œ•๐œ•๐œ•๐œ• 1
= ๏ฟฝ ๐‘ฅ๐‘ฅ (๐‘–๐‘–) ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ (๐‘–๐‘–) − ๐‘ฆ๐‘ฆ (๐‘–๐‘–)
๐œ•๐œ•๐›ฝ๐›ฝ ๐‘›๐‘›
๐‘–๐‘–=1
Repeat until convergence {
๐‘–๐‘–
๐›ฝ๐›ฝ๐‘—๐‘— = ๐›ฝ๐›ฝ๐‘—๐‘— − ๐›ผ๐›ผ ∑๐‘›๐‘›๐‘–๐‘–=1 ๐‘ฅ๐‘ฅ๐‘—๐‘— ๐‘ฆ๐‘ฆ๐›ฝ๐›ฝ ๐‘ฅ๐‘ฅ (๐‘–๐‘–) − ๐‘ฆ๐‘ฆ (๐‘–๐‘–)
}
(simultaneously update all ๐›ฝ๐›ฝ๐‘—๐‘— )
Full explanation
https://medium.com/analytics-vidhya/derivative-of-logloss-function-for-logistic-regression-9b832f025c2d
Machine Learning Programming, KHU
27
Acknowledgements
Some of the lectures notes for this class feature content borrowed with
or without modification from the following sources:
• 95-791Data Mining Carneige Mellon University, Lecture notes (Prof.
Alexandra Chouldechova)
• An Introduction to Statistical Learning, with applications in R (Springer, 2013)
with permission from the authors: G. James, D. Witten, T. Hastie and R.
Tibshirani
• Machine learning online course from Andrew Ng
Machine Learning Programming, KHU
28
Download