Classical Assumptions

advertisement
ECLT 5810
Brief introduction on
Logistic Regression
1
Review on Ordinary Least Square (OLS)
Regression
 A “curve fitting on data points” procedure
 Achieved by minimizing the total squared
distance between the curve and the data
points
The model usually looks like
y = β0 + β1x1 + β2x2 + · · · + βnxn
2
 Our analysis on such models are usually:
 If the beta coefficients are significantly positive/ negative/
different from a certain value, with estimation errors
considered. (done by t-statistic on beta estimates)
 If the model has good explanatory power to describe the
dependent variable, with estimation errors considered.
(done by F-statistic on R^2 measures)
 The implication from the model, i.e., does y depends on
x ? In what extent? Are there any interaction effect?
(Done by differentiation/differencing on the estimated
model)
 Prediction (in interval) given the dependent variable.
3
Classical Assumptions for OLS
However, all those analysis is done under the
following assumptions.
A1 (Linear in Parameter)
y = β0 + β1x1 + β2x2 + · · · + error .
A2 (No perfect collinearity) No independent
variable is constant or a perfect linear
combination of the others
A1 and A2 could be fulfilled by choosing a
suitable form of equation.
4




A3 (Zero conditional mean of errors)
E(error t |X) = 0, t = 1, 2, · · · # of data,
where X is a collection of all independent variables
X = (x1, x2, · · · , xn)
 Under A1-A3 the OLS estimators are unbiased, i.e.
 E( estimated βj ) = βj for all j.
 A4 (Homoskedasticity in errors)
Var(error t |X) = σ^2 (i.e. independent of X), t = 1, 2, · · ·.
 A5 (No serial correlation in errors)
Corr (errort , errors |X) = 0, for t not equal to s.
 Under A1-A5, the OLS estimators are the minimumvariance linear unbiased estimators conditional on X.
5
 A6 (Normality of errors) ut are are independently
and identically distributed as N (0, σ^2).
 Under A1-A6, the OLS estimators are normally
distributed conditional on X. And t-statistic on
parameters and F-statistic on the R^2 can be
used for different statistical reasoning.
 A3-A6 are usually assumed to be true unless
there is significant evidence/ reason against
them.
6
Early models for classification
 As our main target is make prediction in data mining, the
dependent variable is usually nominal/ ordinal/ binary in
nature. Usually we use a binary y to represent this, i.e. y=1
for yes and 0 for no.
 An early model is the linear probability model, which
regress binary y on other explanatory variable X. As y is
binary, the predicted value is usually around the range 0
and 1. So people used this model to predict the probability
for an event.
 However, such model violates A3, A4 and A6. Also, the
predicted value could be out of the range 0 and 1. The
model become not so useful.
7
 The problem could be rectified by introducing a
threshold such that when the predicted y is greater
than the threshold, we classify y as 1. This become
the most simple neural network model, which will be
introduced later.
 However, what we obtain become a decision rather
than a probability, which might be useful in some
cases. Also, the relation between the probability and
the explanatory variable become less clear.
 Statisticians invented logistic regression to solve the
problem.
8
Logistic Regression
 The idea is to use a 1 to 1 mapping to map the probability from
range between [0,1] to all real numbers. Then, there will be no
problem no matter what the right hand side is.
 3 common transformation/ link function (provided by SAS):
 Logit : ln(p/1-p)
(We call this log of odd ratio)
 Probit: Normal inverse of p
(Recall: normal table’s mapping scheme)
 Complementary log-log: ln(-ln(1-p))
 The choice of link function depends on your purpose rather than
performance. They all perform equally good but the implications
is a bit different.
9
 However, as the model is no longer in linear form,
ordinary least square cannot be used. Furthermore, if we
put y directly into transformation, we get
positive/negative infinity.
 We use Maximum Likelihood Estimator (MLE) methods
instead. In which we choose beta coefficients that
maximize the probability that the data as we see now.
 MLE needs fewer assumptions than OLS, but much less
inference could be made, especially for logistic
regression.
 Also, as both MLE and OLS use only one beta
coefficient to describe the effect of an explanatory
variable brings about, data scaling/ normalization is
particular important.
10
Example on Logit
 Assume we believe the relation between probability p of
an event is “yes” and independent variable x can be
described by the equation
Then,
ln(p(x)/1-p(x)) = a+bx
p(x) = exp(a+bx) / [1+exp(a+bx)]
If we have 4 data points :(Yes,x1) ,(No,x2) ,(Yes,x3), (No,x4)
and assume they are mutually independent , then the
probability that we see these 4 data point is the product:
p(x1)[1-p(x2)]p(x3 )[1-p(x4)]
and MLE tries to maximize this by choosing suitable a
and b.
11
Reading the Report
 Akaike’s Information Criteria (AIC) and Schwarz’s Bayesian
Criteria (SBC) :
(Compare to: F-test on Adjusted R^2 for OLS)
- both has smaller value for higher maximized likelihood, and
higher value if more explanatory variable is used (to penalize
over-fitting).
- So smaller of it is preferred. (though is not the only
consideration for choosing model)
 T-score
(Compare to: t-test on estimated betas for OLS)
It is the estimate divided by its standard error. We may treat it
like t-test as in OLS, and construct a confidence interval for the
betas. But in practice, it works only asymptotically. We just
consider large t-score as an indicator for possibly significant
12
effect but no hypothesis testing could be done.
 Wald’s Chi-square (Compare to t-test for OLS)
We could treat an effect as significant if the tail probability is
small enough (< 5%).
 If we are using the model for predicting the outcome rather
than the probability for that outcome (the case when the
criterion is set to minimize loss), the interpretation for
misclassification rate/ profit and loss/ ROC curve/ lift chart is
similar to those for decision tree.
 Some scholars suggest prediction interval for the probability
P of the event given independent variable be
Pestimated + Z1-a/2 [Pestimated (1-Pestimated)/#data]^(1/2)
Z being the Z-score for normal table and a being the
significance level. But we do not have this in SAS.
13
 The interpretation for the model form is similar for OLS
by techniques like differentiation and differencing.
 One common use is, for Logit model with form:
f(x) = ln(P(x)/1-P(x)) = a+bx, x being binary
f(1) = a+b, f(0)= a
f(1)/f(0) ~ ln(P(1)/P(0)) = b for small P(0), P(1)
P(1) = exp(b) * P(0)
Hence P(1) is exp(b) as big as P(0). We can draw
conclusion like “Having something (x) done increases
the probability to exp(b) times for not having it done”
14
Download