linearclass

advertisement
Logistic Regression
Rong Jin
Logistic Regression
• Generative models often lead to linear
decision boundary
• Linear discriminatory model
• Directly model the linear decision boundary
• w is the parameter to be decided
Logistic Regression
Logistic Regression
Learn parameter w by Maximum Likelihood
Estimation (MLE)
• Given training data
Logistic Regression
• Convex objective function, global optimal
Classification error
• Gradient descent
Logistic Regression
• Convex objective function, global optimal
Classification error
• Gradient descent
Illustration of Gradient Descent
How to Decide the Step Size ?
• Back track line search
Example: Heart Disease
Number of People
10
8
No heart Disease
1: 25-29
Heart disease
2: 30-34
6
3: 35-39
4
4: 40-44
2
5: 45-49
0
1
2
3
4
5
6
7
8
6: 50-54
Age group
7: 55-59
• Input feature x: age group id
• Output y: if having heart disease
• y=1: having heart disease
• y=-1: no heart disease
8: 60-64
Example: Heart Disease
No heart Disease
Number of People
10
Heart disease
8
6
4
2
0
1
2
3
4
5
Age group
6
7
8
Example: Text Categorization
Learn to classify text into two categories
• Input d: a document, represented by a word
histogram
• Output y=1: +1 for political document, -1 for nonpolitical document
Example: Text Categorization
• Training data
Example 2: Text Classification
• Dataset: Reuter-21578
• Classification accuracy
• Naïve Bayes: 77%
• Logistic regression: 88%
Logistic Regression vs. Naïve Bayes
• Both are linear decision boundaries
• Naïve Bayes:
• Logistic regression: learn weights by MLE
• Both can be viewed as modeling p(d|y)
• Naïve Bayes: independence assumption
• Logistic regression: assume an exponential
family distribution for p(d|y) (a broad
assumption)
Logistic Regression vs. Naïve Bayes
Discriminative vs. Generative
Discriminative Models
Model P(y|x)
Pros
• Usually good performance
Cons
• Slow convergence
• Expensive computation
• Sensitive to noise data
Generative Models
Model P(x|y)
Pros
• Usually fast converge
• Cheap computation
• Robust to noise data
Cons
• Usually performs worse
Overfitting Problem
Consider text categorization
• What is the weight for a word j appears in only one
training document dk?
Overfitting Problem
Overfitting Problem
Using regularization
Without regularization
Decrease in the classification
accuracy of test data
Iteration
Solution: Regularization
Regularized log-likelihood
The effects of regularizer
• Favor small weights
• Guarantee bounded norm of w
• Guarantee the unique solution
Regularized Logistic Regression
Using regularization
Without regularization
Classification performance by
regularization
Iteration
Regularization as Robust Optimization
• Assume each data point is unknown but bounded in
a sphere of radius r and center xi
Sparse Solution by Lasso Regularization
RCV1 collection:
•
800K documents
•
47K unique words
Sparse Solution by Lasso Regularization
How to solve the optimization problem?
• Subgradient descent
• Minimax
Bayesian Treatment
• Compute the posterior distribution of w
• Laplacian approximation
Bayesian Treatment
• Laplacian approximation
Multi-class Logistic Regression
• How to extend logistic regression model to
multi-class classification ?
Conditional Exponential Model
• Let classes be
Normalization factor
(partition function)
• Need to learn
Conditional Exponential Model
• Learn weights ws by maximum likelihood
estimation
• Any problem ?
Modified Conditional Exponential Model
Download