Co-training LING 572 Fei Xia 02/21/06 Overview • Proposed by Blum and Mitchell (1998) • Important work: – – – – – (Nigam and Ghani, 2000) (Goldman and Zhou, 2000) (Abney, 2002) (Sarkar, 2002) … • Used in document classification, parsing, etc. Outline • Basic concept: (Blum and Mitchell, 1998) • Relation with other SSL algorithms: (Nigam and Ghani, 2000) An example • Web-page classification: e.g., find homepages of faculty members. – Page text: words occurring on that page e.g., “research interest”, “teaching” – Hyperlink text: words occurring in hyperlinks that point to that page: e.g., “my advisor” Two views • Features can be split into two sets: – The instance space: X X1 X 2 – Each example: x (x , x ) 1 2 • D: the distribution over X • C1: the set of target functions over X1. f1 C1 • C2: the set of target function over X2. f 2 C2 Assumption #1: compatibility • The instance distribution D is compatible with the target function f=(f1, f2) if for any x=(x1, x2) with non-zero prob, f(x)=f1(x1)=f2(x2). Each set of features is sufficient for classification • The compatibility of f with D: p 1 PrD [( x1 , x2 ) : f1 ( x1 ) f 2 ( x2 )] Assumption #2: conditional independence Co-training algorithm Co-training algorithm (cont) • Why uses U’, in addition to U? – Using U’ yields better results. – Possible explanation: this forces h1 and h2 select examples that are more representative of the underlying distribution D that generates U. • Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D. • Choosing the iteration number and the size of U’. Intuition behind the co-training algorithm • h1 adds examples to the labeled set that h2 will be able to use for learning, and vice verse. • If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress. Experiments: setting • 1051 web pages from 4 CS depts – 263 pages (25%) as test data – The remaining 75% of pages • Labeled data: 3 positive and 9 negative examples • Unlabeled data: the rest (776 pages) • Manually labeled into a number of categories: e.g., “course home page”. • Two views: – View #1 (page-based): words in the page – View #2 (hyperlink-based): words in the hyperlinks • Learner: Naïve Bayes Naïve Bayes classifier (Nigam and Ghani, 2000) Experiment: results Pagebased classifier Supervised 12.9 training Co-training 6.2 p=1, n=3 # of iterations: 30 |U’| = 75 Hyperlinkbased classifier Combined classifier 12.4 11.1 11.6 5.0 Questions • Can co-training algorithms be applied to datasets without natural feature divisions? • How sensitive are the co-training algorithms to the correctness of the assumptions? • What is the relation between co-training and other SSL methods (e.g., self-training)? (Nigam and Ghani, 2000) EM • Pool the features together. • Use initial labeled data to get initial parameter estimates. • In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters. • Repeat until converge. Experimental results: WebKB course database EM performs better than co-training Both are close to supervised method when trained on more labeled data. Another experiment: The News 2*2 dataset • A semi-artificial dataset • Conditional independence assumption holds. Co-training outperforms EM and the “oracle” result. Co-training vs. EM • Co-training splits features, EM does not. • Co-training incrementally uses the unlabeled data. • EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data. Co-EM: EM with feature split • Repeat until converge – Train A-feature-set classifier using the labeled data and the unlabeded data with B’s labels – Use classifier A to probabilistically label all the unlabeled data – Train B-feature-set classifier using the labeled data and the unlabeled data with A’s labels. – B re-labels the data for use by A. Four SSL methods Results on the News 2*2 dataset Random feature split Co-training: 3.7% 5.5% Co-EM: 3.3% 5.1% When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well. Assumptions • Assumptions made by the underlying classifier (supervised learner): – Naïve Bayes: words occur independently of each other, given the class of the document. – Co-training uses the classifier to rank the unlabeled examples by confidence. – EM uses the classifier to assign probabilities to each unlabeled example. • Assumptions made by SSL method: – Co-training: conditional independence assumption. – EM: maximizing likelihood correlates with reducing classification errors. Summary of (Nigam and Ghani, 2002) • Comparison of four SSL methods: self-training, co-training, EM, co-EM. • The performance of the SSL methods depends on how well the underlying assumptions are met. • Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features. Variations of co-training • Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set. • Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner. • Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition. An alternative? • L L1, LL2 • U U1, U U2 • Repeat – Train h1 using L1 on Feat Set1 – Train h2 using L2 on Feat Set2 – Classify U2 with h1 and let U2’ be the subset with the most confident scores, L2 + U2’ L2, U2-U2’ U2 – Classify U1 with h2 and let U1’ be the subset with the most confident scores, L1 + U1’ L1, U1-U1’ U1 Yarowsky’s algorithm • one-sense-per-discourse View #1: the ID of the document that a word is in • one-sense-per-allocation View #2: local context of word in the document • Yarowsky’s algorithm is a special case of cotraining (Blum & Mitchell, 1998) • Is this correct? No, according to (Abney, 2002). Summary of co-training • The original paper: (Blum and Mitchell, 1998) – Two “independent” views: split the features into two sets. – Train a classifier on each view. – Each classifier labels data that can be used to train the other classifier. • Extension: – Relax the conditional independence assumptions – Instead of using two views, use two or more classifiers trained on the whole feature set. Summary of SSL • Goal: use both labeled and unlabeled data. • Many algorithms: EM, co-EM, self-training, co-training, … • Each algorithm is based on some assumptions. • SSL works well when the assumptions are satisfied. Additional slides Rule independence • H1 (H2) consists of rules that are functions of X1 (X2, resp) only. • EM: the data is generated according to some simple known parametric model. – Ex: the positive examples are generated according to an n-dimensional Gaussian D+ centered around the point