Session 5

Machine Learning Programming BDA712-00 Logistic Regression I: Classification with Logistic Regression Lecturer: Josué Obregón PhD Kyung Hee University Department of Big Data Analytics October 5, 2022 Machine Learning Programming, KHU 1 Previously, in our course… Walking the gradient A discerning machine Gradient descent algorithm From regression to classification Your first learning program Hyperspace! Building a tiny supervised learning program Multiple linear regression Getting real Recognize a single digit using MNSIT And today… Walking the gradient A discerning machine Gradient descent algorithm From regression to classification Your first learning program Hyperspace! Building a tiny supervised learning program Multiple linear regression Getting real Recognize a single digit using MNSIT Today's agenda • What is Classification? • Logistic Regression • Implementation of Logistic Regression • Intuition behind the loss function • Transforming linear regression to logistic regression Machine Learning Programming, KHU 4 Some motivation: Default data default student balance income Has the customer defaulted on their debt? (values: Yes, No) Is the customer a student? (values: Yes, No) Average credit card balance remaining after monthly payment Income of customer Goal: Use student, balance and income information to predict whether a customer is likely to default on their debts. Machine Learning Programming, KHU 5 Classification While introductory classes in statistics tend to focus on Regression, many real-world tasks are Classification problems. • An online banking service needs to determine whether a particular transaction is fraudulent or not. • A firm running a phone marketing campaign needs to identify which individuals would be most likely to adopt their product if contacted • A law firm has obtained all the incoming and outgoing emails from a large corporation accused of discriminatory hiring and promotion practices.They must identify which documents are responsive (relevant to the case), and which are irrelevant. • A judge deciding whether to release a defendant on bail may want to know the likelihood that the defendant will show up for their trial. • A client using satellite imaging wants to classify patches of images according to whether they’re water, rural, or urban. Machine Learning Programming, KHU 6 What is the classification task? • For the purpose of this discussion, we’ll assume that we’re doing binary classification: Our response Y has two possible values {0, 1} • E.g., in the Default classification task, Y = 0 if No 1 if Yes • Our goal is to predict Y using the input variables student, balance and income Machine Learning Programming, KHU 7 Is there an “ideal” classifier? • When we talked about Regression, our goal was to minimize the Mean-Squared-Error, and we saw that the best we could ever do at modeling an outcome Y from inputs X = (X 1 , X 2 , . . . , X p), is to use the regression function f (x ) = E (Y | X = x ) • In Classification, our goal is now to minimize the error rate: 𝑛𝑛 1 � 𝐼𝐼(𝑦𝑦𝑖𝑖 ≠ 𝑦𝑦�𝑖𝑖 ) 𝑛𝑛 𝑖𝑖=1 • It turns out that the ideal classifier assigns a test observation with predictor values x 0 to the class j that has the largest value of: P(Y = j | X = x 0 ) i.e., the class that has the highest probability (is the most likely) given the observed predictor vector x 0 . • This classification rule is called the Bayes classifier Machine Learning Programming, KHU 8 Ideal regressor vs. Ideal classifier • The Bayes classifier in classification plays the same role as the regression function in prediction: They are they best we could ever hope to do, but are unknown functions that we seek to estimate from the data • It is interesting to note that when Y is binary, E (Y | X = x) = P(Y = 1 | X = x), so our target becomes p(x ) = E (Y | X = x ) = P(Y = 1 | X = x ) so by accurately estimating the regression function we would be able to mimic the Bayes classifier. • We can think of our intermediate goal as one of constructing good estimators of the unknown true conditional probability function p(x).3 • Once we have an estimator 𝑝𝑝(𝑥𝑥), ̂ we can classify an observation by predicting default = Yes if 𝑝𝑝̂ 𝑥𝑥 > 𝛼𝛼 for some choice of α ◦ We’ll see later why we may want to use 𝛼𝛼 ≠ 0.5 3 While there exist classification methods that do not proceed by first estimating p(x), most methods do. Machine Learning Programming, KHU 9 Is there an “ideal” classifier? Part II • We just argued that we want to construct 𝑝𝑝̂ 𝑥𝑥 that are good estimators of the true conditional probability function p(x ) = E (Y | X = x ) = P(Y = 1 | X = x ) • If we could do this, we could use 𝑝𝑝̂ 𝑥𝑥 to classify inputs according to the rule: 1 𝑖𝑖𝑓𝑓 𝑝𝑝̂ 𝑥𝑥 > 𝛼𝛼 � 𝑌𝑌 = � 0 𝑖𝑖𝑖𝑖 𝑝𝑝̂ 𝑥𝑥 ≤ 𝛼𝛼 α = 0.5 is a popular choice, and is what the Bayes classifier uses. • Now... suppose we define 𝑝𝑝� 𝑥𝑥 = 1 𝑝𝑝̂ 10 𝑥𝑥 • The rule 𝑝𝑝̂ 𝑥𝑥 > α is the same as 𝑝𝑝� 𝑥𝑥 > same way 1 𝛼𝛼, 10 so both 𝑝𝑝̂ 𝑥𝑥 and 𝑝𝑝� 𝑥𝑥 classify the • Thus, we can have estimators 𝑝𝑝̂ 𝑥𝑥 that result in good classification rules, but which aren’t actually good estimators of p(x) Machine Learning Programming, KHU 10 The Default data set Let’s look at a scatterplot of default (0 = No,1 = Yes) vs. balance This is really hard to make sense of. Since the outcome Y is binary, we don’t get a good sense of the trends in the data from looking at scatterplots like this. Machine Learning Programming, KHU 11 The Default data set How about a histogram, where the bars are partitioned by default status? That’s a little better. It clearly shows that most of the individuals in our data set did not default. And it looks like the greater the average monthly balance an individual carries, the more likely they are to default. Machine Learning Programming, KHU 12 The Default data set Here’s what’s called a conditional density plot. This plot is formed by first binning the x-axis variable (balance), and then calculating the probability of default within each bin. Now we can clearly see what P(default = Yes | balance) looks like Machine Learning Programming, KHU 13 Does linear regression work? What happens if we simply run linear regression here? Machine Learning Programming, KHU 14 An issue with linear regression • • • • For balance < 500, the probability estimates are negative! Linear regression appears to do a poor job of estimating p(x) However, it’s actually going to perform OK as a classifier. Additionally, for problems with more than two classes, encoding the classes impose an order that has no meaning in general. Machine Learning Programming, KHU 15 Logistic regression Much better! Machine Learning Programming, KHU 16 Linear regression vs. Logistic regression Linear regression Logistic regression • Logistic regression does a much better job of estimating the conditional probability of default • However, the Linear regression fit may still produce good classifications. Machine Learning Programming, KHU 17 How does logistic regression work? • Logistic regression is an example of a Generalized Linear Model (GLM) • Remember we have a variable Y that’s 0 (if default = No),or 1 (if default = Yes) • Instead of modeling 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 � balance + 𝜖𝜖 Logistic regression models: 𝑃𝑃(default = Yes|balance log = 𝛽𝛽0 + 𝛽𝛽1 � balance + 𝜖𝜖 1 − 𝑃𝑃(default = Yes|balance) • The quantity on the LHS of the equation is called the log-odds of default. • This turns out to be “right scale” on which to view things Machine Learning Programming, KHU 18 What is logistic regression doing? 𝑃𝑃(default = Yes|balance log = 𝛽𝛽0 + 𝛽𝛽1 � balance + 𝜖𝜖 1 − 𝑃𝑃(default = Yes|balance) • Logistic regression provides us with coefficient estimates 𝛽𝛽̂0 and 𝛽𝛽̂1 • By rearranging the equation 𝑟𝑟 = 𝑝𝑝 log( ) 1−𝑝𝑝 to get 𝑝𝑝 = estimate of the conditional probability of default: 𝑒𝑒 𝑟𝑟 (1+𝑒𝑒 𝑟𝑟 ) , we can get an 𝑒𝑒 𝛽𝛽0+𝛽𝛽1�balance 𝑃𝑃� default = Yes balance = 1 + 𝑒𝑒𝛽𝛽�0+𝛽𝛽�1�balance � � • i.e., Logistic regression gives us a way of estimating the probability that an individual will default given their average monthly balance. Machine Learning Programming, KHU 19 Implementing Logistic Regression • On busy nights, it’s common for noisy customers to hang out in front of the pizzeria, until the neighbors eventually call the police. • Roberto suspects that the same input variables that impact the number of pizzas sold also affect the number of loud customers by the entrance • He wants to know in advance whether a police call is likely to happen • What type of machine learning task are we going to do to help him? Why? Machine Learning Programming, KHU 20 Let’s help Roberto (Lab Session 05) Goal: Build a program on top of our previous implementation that looks at historical data, grasps the relation between predictors and the likelihood a police call, and uses it to classify tonight’s outcome, call or not call. Let’s do it! • https://classroom.github.com/a/KoLlsV_a Machine Learning Programming, KHU 21 Another problem using linear regression Machine Learning Programming, KHU 22 Sigmoid function • We want that 0 ≤ 𝑦𝑦 𝑥𝑥 ≤ 1 𝑦𝑦 𝑥𝑥 = 𝛽𝛽𝑇𝑇 𝑥𝑥  for linear regression 𝑦𝑦 𝑥𝑥 = 𝜎𝜎(𝛽𝛽𝑇𝑇 𝑥𝑥) 𝜎𝜎 acts as a wrapper function that transform its input into a values between 0 and 1 as described in the figure. When we learn neural networks, we will call them activation functions. Machine Learning Programming, KHU 23 Training set: n examples 𝑦𝑦 𝑥𝑥 = 1 1 + 𝑒𝑒 −𝛽𝛽 𝑥𝑥0 𝑥𝑥1 𝑥𝑥 ∈ 𝑥𝑥2 ⋮ 𝑥𝑥𝑝𝑝 𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2 , 𝑦𝑦 2 , … , 𝑥𝑥 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 𝑥𝑥0 = 1, 𝑦𝑦 ∈ {0,1} 𝑇𝑇 𝑥𝑥 How to choose parameters 𝛽𝛽 ? Machine Learning Programming, KHU 24 Logistic regression cost/loss function 𝑝𝑝̂ 𝑥𝑥 if 𝑦𝑦 = 1 − log 𝑦𝑦𝛽𝛽 𝑥𝑥 𝐿𝐿 𝑦𝑦𝛽𝛽 𝑥𝑥 , 𝑦𝑦 = � − log 1 − 𝑦𝑦𝛽𝛽 𝑥𝑥 if 𝑦𝑦 = 1 if 𝑦𝑦 = 0 Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). if 𝑦𝑦 = 0 The more the predicted probability diverges from the actual value, the higher is the log-loss value. Higher cost ↔ Lower cost Machine Learning Programming, KHU Lower cost ↔ Higher cost 25 Log-loss function − log 𝑦𝑦𝛽𝛽 𝑥𝑥 𝐿𝐿 𝑦𝑦𝛽𝛽 𝑥𝑥 , 𝑦𝑦 = � − log 1 − 𝑦𝑦𝛽𝛽 𝑥𝑥 𝑛𝑛 1 𝐿𝐿 𝑦𝑦𝛽𝛽 𝑥𝑥 , 𝑦𝑦 = − �(𝑦𝑦 � log 𝑦𝑦𝛽𝛽 𝑥𝑥 𝑛𝑛 𝑖𝑖=1 if 𝑦𝑦 = 1 if 𝑦𝑦 = 0 𝑛𝑛 1 𝐿𝐿 𝑦𝑦𝛽𝛽 𝑥𝑥 , 𝑦𝑦 = − �(𝑦𝑦 � log 𝑦𝑦𝛽𝛽 𝑥𝑥 𝑛𝑛 𝑖𝑖=1 𝑛𝑛 0 1 𝐿𝐿 𝑦𝑦𝛽𝛽 𝑥𝑥 , 𝑦𝑦 = − �(𝑦𝑦 � log 𝑦𝑦𝛽𝛽 𝑥𝑥 𝑛𝑛 𝑖𝑖=1 if 𝑦𝑦 = 1 if 𝑦𝑦 = 0 + 1 − 𝑦𝑦 � log 1 − 𝑦𝑦𝛽𝛽 𝑥𝑥 ) 0 + 1 − 𝑦𝑦 � log 1 − 𝑦𝑦𝛽𝛽 𝑥𝑥 ) + 1 − 𝑦𝑦 � log 1 − 𝑦𝑦𝛽𝛽 𝑥𝑥 ) Machine Learning Programming, KHU 26 Log-loss function – Gradient Descent • The gradient of log-loss is: Repeat until convergence { } 𝜕𝜕 𝐿𝐿(𝛽𝛽) 𝜕𝜕𝛽𝛽𝑗𝑗 (simultaneously update all 𝛽𝛽𝑗𝑗 ) 𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼 𝑛𝑛 𝜕𝜕𝜕𝜕 1 = � 𝑥𝑥 (𝑖𝑖) 𝑦𝑦𝛽𝛽 𝑥𝑥 (𝑖𝑖) − 𝑦𝑦 (𝑖𝑖) 𝜕𝜕𝛽𝛽 𝑛𝑛 𝑖𝑖=1 Repeat until convergence { 𝑖𝑖 𝛽𝛽𝑗𝑗 = 𝛽𝛽𝑗𝑗 − 𝛼𝛼 ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑗𝑗 𝑦𝑦𝛽𝛽 𝑥𝑥 (𝑖𝑖) − 𝑦𝑦 (𝑖𝑖) } (simultaneously update all 𝛽𝛽𝑗𝑗 ) Full explanation https://medium.com/analytics-vidhya/derivative-of-logloss-function-for-logistic-regression-9b832f025c2d Machine Learning Programming, KHU 27 Acknowledgements Some of the lectures notes for this class feature content borrowed with or without modification from the following sources: • 95-791Data Mining Carneige Mellon University, Lecture notes (Prof. Alexandra Chouldechova) • An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani • Machine learning online course from Andrew Ng Machine Learning Programming, KHU 28

Session 5

Related documents

Products

Support

Session 5

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib