ECLT 5810 Brief introduction on Logistic Regression 1 Review on Ordinary Least Square (OLS) Regression A “curve fitting on data points” procedure Achieved by minimizing the total squared distance between the curve and the data points The model usually looks like y = β0 + β1x1 + β2x2 + · · · + βnxn 2 Our analysis on such models are usually: If the beta coefficients are significantly positive/ negative/ different from a certain value, with estimation errors considered. (done by t-statistic on beta estimates) If the model has good explanatory power to describe the dependent variable, with estimation errors considered. (done by F-statistic on R^2 measures) The implication from the model, i.e., does y depends on x ? In what extent? Are there any interaction effect? (Done by differentiation/differencing on the estimated model) Prediction (in interval) given the dependent variable. 3 Classical Assumptions for OLS However, all those analysis is done under the following assumptions. A1 (Linear in Parameter) y = β0 + β1x1 + β2x2 + · · · + error . A2 (No perfect collinearity) No independent variable is constant or a perfect linear combination of the others A1 and A2 could be fulfilled by choosing a suitable form of equation. 4 A3 (Zero conditional mean of errors) E(error t |X) = 0, t = 1, 2, · · · # of data, where X is a collection of all independent variables X = (x1, x2, · · · , xn) Under A1-A3 the OLS estimators are unbiased, i.e. E( estimated βj ) = βj for all j. A4 (Homoskedasticity in errors) Var(error t |X) = σ^2 (i.e. independent of X), t = 1, 2, · · ·. A5 (No serial correlation in errors) Corr (errort , errors |X) = 0, for t not equal to s. Under A1-A5, the OLS estimators are the minimumvariance linear unbiased estimators conditional on X. 5 A6 (Normality of errors) ut are are independently and identically distributed as N (0, σ^2). Under A1-A6, the OLS estimators are normally distributed conditional on X. And t-statistic on parameters and F-statistic on the R^2 can be used for different statistical reasoning. A3-A6 are usually assumed to be true unless there is significant evidence/ reason against them. 6 Early models for classification As our main target is make prediction in data mining, the dependent variable is usually nominal/ ordinal/ binary in nature. Usually we use a binary y to represent this, i.e. y=1 for yes and 0 for no. An early model is the linear probability model, which regress binary y on other explanatory variable X. As y is binary, the predicted value is usually around the range 0 and 1. So people used this model to predict the probability for an event. However, such model violates A3, A4 and A6. Also, the predicted value could be out of the range 0 and 1. The model become not so useful. 7 The problem could be rectified by introducing a threshold such that when the predicted y is greater than the threshold, we classify y as 1. This become the most simple neural network model, which will be introduced later. However, what we obtain become a decision rather than a probability, which might be useful in some cases. Also, the relation between the probability and the explanatory variable become less clear. Statisticians invented logistic regression to solve the problem. 8 Logistic Regression The idea is to use a 1 to 1 mapping to map the probability from range between [0,1] to all real numbers. Then, there will be no problem no matter what the right hand side is. 3 common transformation/ link function (provided by SAS): Logit : ln(p/1-p) (We call this log of odd ratio) Probit: Normal inverse of p (Recall: normal table’s mapping scheme) Complementary log-log: ln(-ln(1-p)) The choice of link function depends on your purpose rather than performance. They all perform equally good but the implications is a bit different. 9 However, as the model is no longer in linear form, ordinary least square cannot be used. Furthermore, if we put y directly into transformation, we get positive/negative infinity. We use Maximum Likelihood Estimator (MLE) methods instead. In which we choose beta coefficients that maximize the probability that the data as we see now. MLE needs fewer assumptions than OLS, but much less inference could be made, especially for logistic regression. Also, as both MLE and OLS use only one beta coefficient to describe the effect of an explanatory variable brings about, data scaling/ normalization is particular important. 10 Example on Logit Assume we believe the relation between probability p of an event is “yes” and independent variable x can be described by the equation Then, ln(p(x)/1-p(x)) = a+bx p(x) = exp(a+bx) / [1+exp(a+bx)] If we have 4 data points :(Yes,x1) ,(No,x2) ,(Yes,x3), (No,x4) and assume they are mutually independent , then the probability that we see these 4 data point is the product: p(x1)[1-p(x2)]p(x3 )[1-p(x4)] and MLE tries to maximize this by choosing suitable a and b. 11 Reading the Report Akaike’s Information Criteria (AIC) and Schwarz’s Bayesian Criteria (SBC) : (Compare to: F-test on Adjusted R^2 for OLS) - both has smaller value for higher maximized likelihood, and higher value if more explanatory variable is used (to penalize over-fitting). - So smaller of it is preferred. (though is not the only consideration for choosing model) T-score (Compare to: t-test on estimated betas for OLS) It is the estimate divided by its standard error. We may treat it like t-test as in OLS, and construct a confidence interval for the betas. But in practice, it works only asymptotically. We just consider large t-score as an indicator for possibly significant 12 effect but no hypothesis testing could be done. Wald’s Chi-square (Compare to t-test for OLS) We could treat an effect as significant if the tail probability is small enough (< 5%). If we are using the model for predicting the outcome rather than the probability for that outcome (the case when the criterion is set to minimize loss), the interpretation for misclassification rate/ profit and loss/ ROC curve/ lift chart is similar to those for decision tree. Some scholars suggest prediction interval for the probability P of the event given independent variable be Pestimated + Z1-a/2 [Pestimated (1-Pestimated)/#data]^(1/2) Z being the Z-score for normal table and a being the significance level. But we do not have this in SAS. 13 The interpretation for the model form is similar for OLS by techniques like differentiation and differencing. One common use is, for Logit model with form: f(x) = ln(P(x)/1-P(x)) = a+bx, x being binary f(1) = a+b, f(0)= a f(1)/f(0) ~ ln(P(1)/P(0)) = b for small P(0), P(1) P(1) = exp(b) * P(0) Hence P(1) is exp(b) as big as P(0). We can draw conclusion like “Having something (x) done increases the probability to exp(b) times for not having it done” 14