Introduction to Machine Learning 2 Logistic Regression

advertisement
2. Logistic Regression
2
2
Logistic Regression
• y|x is a Bernoulli distribution with parameter θ = sigmoid(w> x)
Introduction to Machine Learning
• When a new input x∗ arrives, we toss a coin which has sigmoid(w> x∗ ) as the probability of heads
• If outcome is heads, the predicted class is 1 else 0
CSE474/574: Logistic Regression
• Learns a linear boundary
Varun Chandola <chandola@buffalo.edu>
Learning Task for Logistic Regression
Given training examples hxi , yi iD
i=1 , learn w
Outline
Bayesian Interpretation
Contents
1 Generative vs. Discriminative Classifiers
1
2 Logistic Regression
2
3 Logistic Regression - Training
3.1 Using Gradient Descent for Learning Weights
3.2 Using Newton’s Method . . . . . . . . . . . .
3.3 Regularization with Logistic Regression . . . .
3.4 Handling Multiple Classes . . . . . . . . . . .
3.5 Bayesian Logistic Regression . . . . . . . . . .
3.6 Laplace Approximation . . . . . . . . . . . . .
3.7 Posterior of w for Logistic Regression . . . . .
3.8 Approximating the Posterior . . . . . . . . . .
3.9 Getting Prediction on Unseen Examples . . .
2
3
3
4
4
4
4
5
5
6
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Generative vs. Discriminative Classifiers
• Directly model p(y|x) (y ∈ {0, 1})
• p(y|x) ∼ Bernoulli(θ = sigmoid(w> x))
Geometric Interpretation
• Use regression to predict discrete values
• Squash output to [0, 1] using sigmoid function
• Output less than 0.5 is one class and greater than 0.5 is the other
3
Logistic Regression - Training
• MLE Approach
• Assume that y ∈ {0, 1}
• What is the likelihood for a bernoulli sample?
• Probabilistic classification task:
p(Y = benign|X = x), p(Y = malicious|X = x)
– If yi = 1, p(yi ) = θi =
1
1+exp(−w> xi )
– If yi = 0, p(yi ) = 1 − θi =
• How do you estimate p(y|x)?
p(y|x) =
p(x|y)p(y)
p(y, x)
=
p(x)
p(x)
• Two step approach - Estimate generative model and then posterior for y (Naı̈ve Bayes)
• Solving a more general problem [2, 1]
• Why not directly model p(y|x)? - Discriminative approach
• Number of training examples needed to learn a PAC-learnable classifier ∝ VC-dimension of the
hypothesis space
• VC-dimension of a probabilistic classifier ∝ Number of parameters [2] (or a small polynomial in the
number of parameters)
• Number of parameters for p(y, x) > Number of parameters for p(y|x)
Discriminative classifiers need lesser training examples to for PAC learning than generative classifiers
1
1+exp(w> xi )
– In general, p(yi ) = θiyi (1 − θi )1−yi
Log-likelihood
LL(w) =
N
X
i=1
yi log θi + (1 − yi ) log (1 − θi )
• No closed form solution for maximizing log-likelihood
To understand why there is no closed form solution for maximizing the log-likelihood, we first differentiate
LL(w) with respect to w. We make use of the useful result for sigmoid:
dθi
= θi (1 − θi )xi
dw
3.1 Using Gradient Descent for Learning
Weights
Using this result we obtain:
3
3.3
N
X
d
LL(w) =
dw
i=1
N
X
=
3.3 Regularization with Logistic Regression
i=1
yi
(1 − yi )
θi (1 − θi )xi −
θi (1 − θi )xi
θi
1 − θi
(yi (1 − θi ) − (1 − yi )θi )xi
Regularization with Logistic Regression
• Overfitting is an issue, especially with large number of features
• Add a Gaussian prior ∼ N (0, τ 2 )
• Easy to incorporate in the gradient descent based approach
1
LL0 (w) = LL(w) − λw> w
2
N
X
=
(yi − θi )xi
d
d
LL0 (w) =
LL(w) − λw
dw
dw
H 0 = H − λI
i=1
Obviously, given that θi is a non-linear function of w, a closed form solution is not possible.
3.1
Using Gradient Descent for Learning Weights
• Compute gradient of LL with respect to w
• A convex function of w with a unique global maximum
N
X
d
LL(w) =
(yi − θi )xi
dw
i=1
• Update rule:
wk+1
4
where I is the identity matrix.
3.4
Handling Multiple Classes
• One vs. Rest and One vs. Other
• p(y|x) ∼ M ultinoulli(θ)
• Multinoulli parameter vector θ is defined as:
d
LL(wk )
= wk + η
dwk
exp(wj> x)
θj = PC
>
k=1 exp(wk x)
• Multiclass logistic regression has C weight vectors to learn
0
−0.5
3.5
• How to get the posterior for w?
10
5
0 0
3.2
Bayesian Logistic Regression
2
4
6
8
10
• Not easy - Why?
Using Newton’s Method
Laplace Approximation
• Setting η is sometimes tricky
• We do not know what the true posterior distribution for w is.
• Too large – incorrect results
• Is there a close-enough (approximate) Gaussian distribution?
• Too small – slow convergence
One should note that we used a Gaussian prior for w which is not a conjugate prior for the Bernoulli
distribution used in the logistic regression. In fact there is no convenient prior that may be used for
logistic regression.
• Another way to speed up convergence:
Newton’s Method
wk+1 = wk + ηH−1
k
d
LL(wk )
dwk
• Hessian or H is the second order derivative of the objective function
• Newton’s method belong to the family of second order optimization algorithms
• For logistic regression, the Hessian is:
H=−
X
i
θi (1 − θi )xi x>
i
3.6
Laplace Approximation
Problem Statement
How to approximate a posterior with a Gaussian distribution?
• When is this needed?
– When direct computation of posterior is not possible.
– No conjugate prior /
3.7 Posterior of w for Logistic Regression
5
• Assume that posterior is:
1 −E(w)
e
Z
E(w) is energy function ≡ negative log of unnormalized log posterior
3.9 Getting Prediction on Unseen Examples
3.9
6
Getting Prediction on Unseen Examples
p(w|D) =
p(y|x) =
Z
• Let w∗ be the mode or expected value of the posterior distribution of w
1. Use a point estimate of w (MLE or MAP)
• Taylor series expansion of E(w) around the mode
2. Analytical Result
E(w) = E(w∗ ) + (w − w∗ )> E 0 (w) + (w − w∗ )> E 00 (w)(w − w∗ ) + . . .
≈ E(w∗ ) + (w − w ∗ )> E 0 (w) + (w − w∗ )> E 00 (w)(w − w∗ )
E 0 (w) = ∇ - first derivative of E(w) (gradient) and E 00 (w) = H is the second derivative (Hessian)
3. Monte Carlo Approximation
• Numerical integration
• Sample finite “versions” of w using p(w|D)
p(w|D) = N (wM AP , H−1 )
• Since w∗ is the mode, the first derivative or gradient is zero
E(w) ≈ E(w∗ ) + (w − w∗ )> H(w − w∗ )
• Posterior p(w|D) may be written as:
1 −E(w∗ )
1
e
exp − (w − w∗ )> H(w − w∗ )
Z
2
= N (w∗ , H−1 )
p(w|D) ≈
∗
w is the mode obtained by maximizing the posterior using gradient ascent
3.7
Posterior of w for Logistic Regression
• Prior:
p(D|w) =
N
Y
i=1
1
>
1+e−w xi
• Posterior:
p(w|D) =
3.8
• Compute p(y|x) using the samples and add
Using point estimate of w means that the integral disappears from the Bayesian averaging equation.
References
References
[1] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS, pages
841–848. MIT Press, 2001.
[2] V. Vapnik. Statistical learning theory. Wiley, 1998.
p(w) = N (0, τ 2 I)
• Likelihood of data
where θi =
p(y|x, w)p(w|D)dw
θiyi (1 − θi )1−yi
N (0, τ 2 I)
R
QN
yi
i=1 θi (1
− θi )1−yi
p(D|w)dw
Approximating the Posterior
• Approximate posterior distribution
p(w|D) = N (wM AP , H−1 )
• H is the Hessian of the negative log-posterior w.r.t. w
Hessian (or H) is a matrix that is obtained by double differentiation of a scalar function of a vector. In
this case the scalar function is the negative log-posterior of w:
H = ∇2 (− log p(D|w) − log p(w))
Download