talk (ppt)

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois, Chicago Personalized Web Browser • Learn web pages that are of interest to you! • Information that is available to browser when it is installed: – Your bookmark (or cached documents) – Positive examples – All documents in the web – Unlabeled examples!! Direct Marketing • Company has database with details of its customer – positive examples • Want to find people who are similar to their own customer • Buy a database consisting of details of people, some of whom may be potential customers – unlabeled examples. Assumptions • All examples are drawn independently from a fixed underlying distribution • Negative examples are never labeled • With fixed probability , positive example is independently left unlabeled. Are Unlabeled Examples Helpful? • Function known to be either x1 < 0 or x2 > 0 • Which one is it? x1 < 0 ++u + u +u + + ++ + x2 > 0 uu u u uu uu Not learnable with only positive examples. However, addition of unlabeled examples makes it learnable. Related Works • Denis (1998) showed that function classes learnable in the statistical query model is learnable from positive and unlabeled examples. • Muggleton (2001) showed that learning from positive examples is possible if the distribution of inputs is known. • Liu et.al. (2002) give sample complexity bounds and an algorithm based on EM • Yu et.al. (2002) gives algorithm based on SVM • … Approach • Label all unlabeled examples as negative (Denis 1998) – Negative examples are always labeled negative – Positive examples are labeled negative with probability  • Training with one-sided noise • Problem:  is not known • Also, what if there is some noise on the negative examples? Negative examples occasionally labeled positive with small probability. Selecting Threshold and Robustness to Noise • Approach: Reweigh examples and learn conditional probability P(Y=1|X) • If you weight the examples by – Multiplying the negative examples with weight equal to the number of positive examples and – Multiplying the positive examples with weight equal to the number of negative examples Selecting Threshold and Robustness to Noise • Then P(Y=1|X) > 0.5 when X is a positive example and P(Y=1|X) < 0.5 when X is a negative example, as long as – + < 1 where •  is probability that positive example is labeled negative •  is probability that negative example is labeled positive • Okay, even if some of the positive examples are not actually positive (noise). Weighted Logistic Regression • Practical algorithm: Reweigh the examples and then do logistic regression with linear function to learn P(Y=1|X). – Compose linear function with sigmoid then do maximum likelihood estimation • Convex optimization problem • Will learn the correct conditional probability if it can be represented • Minimize upper bound to weighted classification error if cannot be represented – still makes sense. Selecting Regularization Parameter • Regularization important when learning with noise • Add c times sum of squared values of weights to cost function as regularization • How to choose the value of c? – When both positive and negative examples available, can use validation set to choose c. – Can use weighted examples in a validation set to choose c, but not sure if this makes sense? Selecting Regularization Parameter • Performance criteria pr/P(Y=1) can be estimated directly from validation set as r2/P(f(X) = 1) – Recall r = P(f(X) = 1| Y = 1) – Precision p = P(Y = 1| f(X) = 1) • Can use for – tuning regularization parameter c – also to compare different algorithms when only positive and unlabeled examples (no negative) available • Behavior similar to commonly used F-score F = 2pr/(p+r) – Reasonable when use of F-score reasonable Experimental Setup • 20 Newsgroup dataset • 1 group positive, 19 others negative • Term frequency as features, normalized to length 1 • Randomly split – 50% train – 20% validation – 30% test • Validation set used to select regularization parameter from small discrete set then retrain on training+validation set Results F-score averaged over 20 groups  Opt 0.3 0.757 0.754 0.646 0.661 1-Cls SVM 0.15 0.7 0.675 0.659 0.619 0.59 0.153 pr/P(Y=1) Weighted S-EM Error Conclusions • Learning from positive and unlabeled examples by learning P(Y=1|X) after setting all unlabeled examples negative. – Reweighing examples allows threshold at 0.5 and makes it tolerant to negative examples that are misclassified as positive • Performance measure pr/P(Y=1) can be estimated from data – Useful when F-score is reasonable – Can be used to select regularization parameter • Logistic regression using linear regression and these methods works well on text classification

talk (ppt)

Related documents

Products

Support

talk (ppt)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib