Machine Learning Programming BDA712-00 Logistic Regression I: Classification with Logistic Regression Lecturer: Josué Obregón PhD Kyung Hee University Department of Big Data Analytics October 5, 2022 Machine Learning Programming, KHU 1 Previously, in our course… Walking the gradient A discerning machine Gradient descent algorithm From regression to classification Your first learning program Hyperspace! Building a tiny supervised learning program Multiple linear regression Getting real Recognize a single digit using MNSIT And today… Walking the gradient A discerning machine Gradient descent algorithm From regression to classification Your first learning program Hyperspace! Building a tiny supervised learning program Multiple linear regression Getting real Recognize a single digit using MNSIT Today's agenda • What is Classification? • Logistic Regression • Implementation of Logistic Regression • Intuition behind the loss function • Transforming linear regression to logistic regression Machine Learning Programming, KHU 4 Some motivation: Default data default student balance income Has the customer defaulted on their debt? (values: Yes, No) Is the customer a student? (values: Yes, No) Average credit card balance remaining after monthly payment Income of customer Goal: Use student, balance and income information to predict whether a customer is likely to default on their debts. Machine Learning Programming, KHU 5 Classification While introductory classes in statistics tend to focus on Regression, many real-world tasks are Classification problems. • An online banking service needs to determine whether a particular transaction is fraudulent or not. • A firm running a phone marketing campaign needs to identify which individuals would be most likely to adopt their product if contacted • A law firm has obtained all the incoming and outgoing emails from a large corporation accused of discriminatory hiring and promotion practices.They must identify which documents are responsive (relevant to the case), and which are irrelevant. • A judge deciding whether to release a defendant on bail may want to know the likelihood that the defendant will show up for their trial. • A client using satellite imaging wants to classify patches of images according to whether they’re water, rural, or urban. Machine Learning Programming, KHU 6 What is the classification task? • For the purpose of this discussion, we’ll assume that we’re doing binary classification: Our response Y has two possible values {0, 1} • E.g., in the Default classification task, Y = 0 if No 1 if Yes • Our goal is to predict Y using the input variables student, balance and income Machine Learning Programming, KHU 7 Is there an “ideal” classifier? • When we talked about Regression, our goal was to minimize the Mean-Squared-Error, and we saw that the best we could ever do at modeling an outcome Y from inputs X = (X 1 , X 2 , . . . , X p), is to use the regression function f (x ) = E (Y | X = x ) • In Classification, our goal is now to minimize the error rate: ๐๐ 1 ๏ฟฝ ๐ผ๐ผ(๐ฆ๐ฆ๐๐ ≠ ๐ฆ๐ฆ๏ฟฝ๐๐ ) ๐๐ ๐๐=1 • It turns out that the ideal classifier assigns a test observation with predictor values x 0 to the class j that has the largest value of: P(Y = j | X = x 0 ) i.e., the class that has the highest probability (is the most likely) given the observed predictor vector x 0 . • This classification rule is called the Bayes classifier Machine Learning Programming, KHU 8 Ideal regressor vs. Ideal classifier • The Bayes classifier in classification plays the same role as the regression function in prediction: They are they best we could ever hope to do, but are unknown functions that we seek to estimate from the data • It is interesting to note that when Y is binary, E (Y | X = x) = P(Y = 1 | X = x), so our target becomes p(x ) = E (Y | X = x ) = P(Y = 1 | X = x ) so by accurately estimating the regression function we would be able to mimic the Bayes classifier. • We can think of our intermediate goal as one of constructing good estimators of the unknown true conditional probability function p(x).3 • Once we have an estimator ๐๐(๐ฅ๐ฅ), ฬ we can classify an observation by predicting default = Yes if ๐๐ฬ ๐ฅ๐ฅ > ๐ผ๐ผ for some choice of α โฆ We’ll see later why we may want to use ๐ผ๐ผ ≠ 0.5 3 While there exist classification methods that do not proceed by first estimating p(x), most methods do. Machine Learning Programming, KHU 9 Is there an “ideal” classifier? Part II • We just argued that we want to construct ๐๐ฬ ๐ฅ๐ฅ that are good estimators of the true conditional probability function p(x ) = E (Y | X = x ) = P(Y = 1 | X = x ) • If we could do this, we could use ๐๐ฬ ๐ฅ๐ฅ to classify inputs according to the rule: 1 ๐๐๐๐ ๐๐ฬ ๐ฅ๐ฅ > ๐ผ๐ผ ๏ฟฝ ๐๐ = ๏ฟฝ 0 ๐๐๐๐ ๐๐ฬ ๐ฅ๐ฅ ≤ ๐ผ๐ผ α = 0.5 is a popular choice, and is what the Bayes classifier uses. • Now... suppose we define ๐๐๏ฟฝ ๐ฅ๐ฅ = 1 ๐๐ฬ 10 ๐ฅ๐ฅ • The rule ๐๐ฬ ๐ฅ๐ฅ > α is the same as ๐๐๏ฟฝ ๐ฅ๐ฅ > same way 1 ๐ผ๐ผ, 10 so both ๐๐ฬ ๐ฅ๐ฅ and ๐๐๏ฟฝ ๐ฅ๐ฅ classify the • Thus, we can have estimators ๐๐ฬ ๐ฅ๐ฅ that result in good classification rules, but which aren’t actually good estimators of p(x) Machine Learning Programming, KHU 10 The Default data set Let’s look at a scatterplot of default (0 = No,1 = Yes) vs. balance This is really hard to make sense of. Since the outcome Y is binary, we don’t get a good sense of the trends in the data from looking at scatterplots like this. Machine Learning Programming, KHU 11 The Default data set How about a histogram, where the bars are partitioned by default status? That’s a little better. It clearly shows that most of the individuals in our data set did not default. And it looks like the greater the average monthly balance an individual carries, the more likely they are to default. Machine Learning Programming, KHU 12 The Default data set Here’s what’s called a conditional density plot. This plot is formed by first binning the x-axis variable (balance), and then calculating the probability of default within each bin. Now we can clearly see what P(default = Yes | balance) looks like Machine Learning Programming, KHU 13 Does linear regression work? What happens if we simply run linear regression here? Machine Learning Programming, KHU 14 An issue with linear regression • • • • For balance < 500, the probability estimates are negative! Linear regression appears to do a poor job of estimating p(x) However, it’s actually going to perform OK as a classifier. Additionally, for problems with more than two classes, encoding the classes impose an order that has no meaning in general. Machine Learning Programming, KHU 15 Logistic regression Much better! Machine Learning Programming, KHU 16 Linear regression vs. Logistic regression Linear regression Logistic regression • Logistic regression does a much better job of estimating the conditional probability of default • However, the Linear regression fit may still produce good classifications. Machine Learning Programming, KHU 17 How does logistic regression work? • Logistic regression is an example of a Generalized Linear Model (GLM) • Remember we have a variable Y that’s 0 (if default = No),or 1 (if default = Yes) • Instead of modeling ๐๐ = ๐ฝ๐ฝ0 + ๐ฝ๐ฝ1 ๏ฟฝ balance + ๐๐ Logistic regression models: ๐๐(default = Yes|balance log = ๐ฝ๐ฝ0 + ๐ฝ๐ฝ1 ๏ฟฝ balance + ๐๐ 1 − ๐๐(default = Yes|balance) • The quantity on the LHS of the equation is called the log-odds of default. • This turns out to be “right scale” on which to view things Machine Learning Programming, KHU 18 What is logistic regression doing? ๐๐(default = Yes|balance log = ๐ฝ๐ฝ0 + ๐ฝ๐ฝ1 ๏ฟฝ balance + ๐๐ 1 − ๐๐(default = Yes|balance) • Logistic regression provides us with coefficient estimates ๐ฝ๐ฝฬ0 and ๐ฝ๐ฝฬ1 • By rearranging the equation ๐๐ = ๐๐ log( ) 1−๐๐ to get ๐๐ = estimate of the conditional probability of default: ๐๐ ๐๐ (1+๐๐ ๐๐ ) , we can get an ๐๐ ๐ฝ๐ฝ0+๐ฝ๐ฝ1๏ฟฝbalance ๐๐๏ฟฝ default = Yes balance = 1 + ๐๐๐ฝ๐ฝ๏ฟฝ0+๐ฝ๐ฝ๏ฟฝ1๏ฟฝbalance ๏ฟฝ ๏ฟฝ • i.e., Logistic regression gives us a way of estimating the probability that an individual will default given their average monthly balance. Machine Learning Programming, KHU 19 Implementing Logistic Regression • On busy nights, it’s common for noisy customers to hang out in front of the pizzeria, until the neighbors eventually call the police. • Roberto suspects that the same input variables that impact the number of pizzas sold also affect the number of loud customers by the entrance • He wants to know in advance whether a police call is likely to happen • What type of machine learning task are we going to do to help him? Why? Machine Learning Programming, KHU 20 Let’s help Roberto (Lab Session 05) Goal: Build a program on top of our previous implementation that looks at historical data, grasps the relation between predictors and the likelihood a police call, and uses it to classify tonight’s outcome, call or not call. Let’s do it! • https://classroom.github.com/a/KoLlsV_a Machine Learning Programming, KHU 21 Another problem using linear regression Machine Learning Programming, KHU 22 Sigmoid function • We want that 0 ≤ ๐ฆ๐ฆ ๐ฅ๐ฅ ≤ 1 ๐ฆ๐ฆ ๐ฅ๐ฅ = ๐ฝ๐ฝ๐๐ ๐ฅ๐ฅ ๏ for linear regression ๐ฆ๐ฆ ๐ฅ๐ฅ = ๐๐(๐ฝ๐ฝ๐๐ ๐ฅ๐ฅ) ๐๐ acts as a wrapper function that transform its input into a values between 0 and 1 as described in the figure. When we learn neural networks, we will call them activation functions. Machine Learning Programming, KHU 23 Training set: n examples ๐ฆ๐ฆ ๐ฅ๐ฅ = 1 1 + ๐๐ −๐ฝ๐ฝ ๐ฅ๐ฅ0 ๐ฅ๐ฅ1 ๐ฅ๐ฅ ∈ ๐ฅ๐ฅ2 โฎ ๐ฅ๐ฅ๐๐ ๐ฅ๐ฅ 1 , ๐ฆ๐ฆ 1 , ๐ฅ๐ฅ 2 , ๐ฆ๐ฆ 2 , … , ๐ฅ๐ฅ ๐๐ , ๐ฆ๐ฆ ๐๐ ๐ฅ๐ฅ0 = 1, ๐ฆ๐ฆ ∈ {0,1} ๐๐ ๐ฅ๐ฅ How to choose parameters ๐ฝ๐ฝ ? Machine Learning Programming, KHU 24 Logistic regression cost/loss function ๐๐ฬ ๐ฅ๐ฅ if ๐ฆ๐ฆ = 1 − log ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ๐ฟ๐ฟ ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ , ๐ฆ๐ฆ = ๏ฟฝ − log 1 − ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ if ๐ฆ๐ฆ = 1 if ๐ฆ๐ฆ = 0 Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). if ๐ฆ๐ฆ = 0 The more the predicted probability diverges from the actual value, the higher is the log-loss value. Higher cost ↔ Lower cost Machine Learning Programming, KHU Lower cost ↔ Higher cost 25 Log-loss function − log ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ๐ฟ๐ฟ ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ , ๐ฆ๐ฆ = ๏ฟฝ − log 1 − ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ๐๐ 1 ๐ฟ๐ฟ ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ , ๐ฆ๐ฆ = − ๏ฟฝ(๐ฆ๐ฆ ๏ฟฝ log ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ๐๐ ๐๐=1 if ๐ฆ๐ฆ = 1 if ๐ฆ๐ฆ = 0 ๐๐ 1 ๐ฟ๐ฟ ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ , ๐ฆ๐ฆ = − ๏ฟฝ(๐ฆ๐ฆ ๏ฟฝ log ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ๐๐ ๐๐=1 ๐๐ 0 1 ๐ฟ๐ฟ ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ , ๐ฆ๐ฆ = − ๏ฟฝ(๐ฆ๐ฆ ๏ฟฝ log ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ๐๐ ๐๐=1 if ๐ฆ๐ฆ = 1 if ๐ฆ๐ฆ = 0 + 1 − ๐ฆ๐ฆ ๏ฟฝ log 1 − ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ) 0 + 1 − ๐ฆ๐ฆ ๏ฟฝ log 1 − ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ) + 1 − ๐ฆ๐ฆ ๏ฟฝ log 1 − ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ ) Machine Learning Programming, KHU 26 Log-loss function – Gradient Descent • The gradient of log-loss is: Repeat until convergence { } ๐๐ ๐ฟ๐ฟ(๐ฝ๐ฝ) ๐๐๐ฝ๐ฝ๐๐ (simultaneously update all ๐ฝ๐ฝ๐๐ ) ๐ฝ๐ฝ๐๐ = ๐ฝ๐ฝ๐๐ − ๐ผ๐ผ ๐๐ ๐๐๐๐ 1 = ๏ฟฝ ๐ฅ๐ฅ (๐๐) ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ (๐๐) − ๐ฆ๐ฆ (๐๐) ๐๐๐ฝ๐ฝ ๐๐ ๐๐=1 Repeat until convergence { ๐๐ ๐ฝ๐ฝ๐๐ = ๐ฝ๐ฝ๐๐ − ๐ผ๐ผ ∑๐๐๐๐=1 ๐ฅ๐ฅ๐๐ ๐ฆ๐ฆ๐ฝ๐ฝ ๐ฅ๐ฅ (๐๐) − ๐ฆ๐ฆ (๐๐) } (simultaneously update all ๐ฝ๐ฝ๐๐ ) Full explanation https://medium.com/analytics-vidhya/derivative-of-logloss-function-for-logistic-regression-9b832f025c2d Machine Learning Programming, KHU 27 Acknowledgements Some of the lectures notes for this class feature content borrowed with or without modification from the following sources: • 95-791Data Mining Carneige Mellon University, Lecture notes (Prof. Alexandra Chouldechova) • An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani • Machine learning online course from Andrew Ng Machine Learning Programming, KHU 28