Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois, Chicago Personalized Web Browser • Learn web pages that are of interest to you! • Information that is available to browser when it is installed: – Your bookmark (or cached documents) – Positive examples – All documents in the web – Unlabeled examples!! Direct Marketing • Company has database with details of its customer – positive examples • Want to find people who are similar to their own customer • Buy a database consisting of details of people, some of whom may be potential customers – unlabeled examples. Assumptions • All examples are drawn independently from a fixed underlying distribution • Negative examples are never labeled • With fixed probability , positive example is independently left unlabeled. Are Unlabeled Examples Helpful? • Function known to be either x1 < 0 or x2 > 0 • Which one is it? x1 < 0 ++u + u +u + + ++ + x2 > 0 uu u u uu uu Not learnable with only positive examples. However, addition of unlabeled examples makes it learnable. Related Works • Denis (1998) showed that function classes learnable in the statistical query model is learnable from positive and unlabeled examples. • Muggleton (2001) showed that learning from positive examples is possible if the distribution of inputs is known. • Liu et.al. (2002) give sample complexity bounds and an algorithm based on EM • Yu et.al. (2002) gives algorithm based on SVM • … Approach • Label all unlabeled examples as negative (Denis 1998) – Negative examples are always labeled negative – Positive examples are labeled negative with probability • Training with one-sided noise • Problem: is not known • Also, what if there is some noise on the negative examples? Negative examples occasionally labeled positive with small probability. Selecting Threshold and Robustness to Noise • Approach: Reweigh examples and learn conditional probability P(Y=1|X) • If you weight the examples by – Multiplying the negative examples with weight equal to the number of positive examples and – Multiplying the positive examples with weight equal to the number of negative examples Selecting Threshold and Robustness to Noise • Then P(Y=1|X) > 0.5 when X is a positive example and P(Y=1|X) < 0.5 when X is a negative example, as long as – + < 1 where • is probability that positive example is labeled negative • is probability that negative example is labeled positive • Okay, even if some of the positive examples are not actually positive (noise). Weighted Logistic Regression • Practical algorithm: Reweigh the examples and then do logistic regression with linear function to learn P(Y=1|X). – Compose linear function with sigmoid then do maximum likelihood estimation • Convex optimization problem • Will learn the correct conditional probability if it can be represented • Minimize upper bound to weighted classification error if cannot be represented – still makes sense. Selecting Regularization Parameter • Regularization important when learning with noise • Add c times sum of squared values of weights to cost function as regularization • How to choose the value of c? – When both positive and negative examples available, can use validation set to choose c. – Can use weighted examples in a validation set to choose c, but not sure if this makes sense? Selecting Regularization Parameter • Performance criteria pr/P(Y=1) can be estimated directly from validation set as r2/P(f(X) = 1) – Recall r = P(f(X) = 1| Y = 1) – Precision p = P(Y = 1| f(X) = 1) • Can use for – tuning regularization parameter c – also to compare different algorithms when only positive and unlabeled examples (no negative) available • Behavior similar to commonly used F-score F = 2pr/(p+r) – Reasonable when use of F-score reasonable Experimental Setup • 20 Newsgroup dataset • 1 group positive, 19 others negative • Term frequency as features, normalized to length 1 • Randomly split – 50% train – 20% validation – 30% test • Validation set used to select regularization parameter from small discrete set then retrain on training+validation set Results F-score averaged over 20 groups Opt 0.3 0.757 0.754 0.646 0.661 1-Cls SVM 0.15 0.7 0.675 0.659 0.619 0.59 0.153 pr/P(Y=1) Weighted S-EM Error Conclusions • Learning from positive and unlabeled examples by learning P(Y=1|X) after setting all unlabeled examples negative. – Reweighing examples allows threshold at 0.5 and makes it tolerant to negative examples that are misclassified as positive • Performance measure pr/P(Y=1) can be estimated from data – Useful when F-score is reasonable – Can be used to select regularization parameter • Logistic regression using linear regression and these methods works well on text classification