Co-training

advertisement
Co-training
LING 572
Fei Xia
02/21/06
Overview
• Proposed by Blum and Mitchell (1998)
• Important work:
–
–
–
–
–
(Nigam and Ghani, 2000)
(Goldman and Zhou, 2000)
(Abney, 2002)
(Sarkar, 2002)
…
• Used in document classification, parsing, etc.
Outline
• Basic concept: (Blum and Mitchell, 1998)
• Relation with other SSL algorithms:
(Nigam and Ghani, 2000)
An example
• Web-page classification: e.g., find
homepages of faculty members.
– Page text: words occurring on that page
e.g., “research interest”, “teaching”
– Hyperlink text: words occurring in hyperlinks
that point to that page:
e.g., “my advisor”
Two views
• Features can be split into two sets:
– The instance space: X  X1  X 2
– Each example:
x  (x , x )
1
2
• D: the distribution over X
• C1: the set of target functions over X1.
f1  C1
• C2: the set of target function over X2.
f 2  C2
Assumption #1: compatibility
• The instance distribution D is compatible
with the target function f=(f1, f2) if for any
x=(x1, x2) with non-zero prob,
f(x)=f1(x1)=f2(x2).
 Each set of features is sufficient for classification
• The compatibility of f with D:
p  1  PrD [( x1 , x2 ) : f1 ( x1 )  f 2 ( x2 )]
Assumption #2: conditional
independence
Co-training algorithm
Co-training algorithm (cont)
• Why uses U’, in addition to U?
– Using U’ yields better results.
– Possible explanation: this forces h1 and h2 select
examples that are more representative of the
underlying distribution D that generates U.
• Choosing p and n: the ratio of p/n should match
the ratio of positive examples and negative
examples in D.
• Choosing the iteration number and the size of
U’.
Intuition behind the co-training
algorithm
• h1 adds examples to the labeled set that h2
will be able to use for learning, and vice
verse.
• If the conditional independence
assumption holds, then on average each
added document will be as informative as
a random document, and the learning will
progress.
Experiments: setting
• 1051 web pages from 4 CS depts
– 263 pages (25%) as test data
– The remaining 75% of pages
• Labeled data: 3 positive and 9 negative examples
• Unlabeled data: the rest (776 pages)
• Manually labeled into a number of categories: e.g.,
“course home page”.
• Two views:
– View #1 (page-based): words in the page
– View #2 (hyperlink-based): words in the hyperlinks
• Learner: Naïve Bayes
Naïve Bayes classifier
(Nigam and Ghani, 2000)
Experiment: results
Pagebased
classifier
Supervised 12.9
training
Co-training 6.2
p=1, n=3
# of iterations: 30
|U’| = 75
Hyperlinkbased
classifier
Combined
classifier
12.4
11.1
11.6
5.0
Questions
• Can co-training algorithms be applied to
datasets without natural feature divisions?
• How sensitive are the co-training algorithms to
the correctness of the assumptions?
• What is the relation between co-training and
other SSL methods (e.g., self-training)?
(Nigam and Ghani, 2000)
EM
• Pool the features together.
• Use initial labeled data to get initial parameter
estimates.
• In each iteration use all the data (labeled and
unlabeled) to re-estimate the parameters.
• Repeat until converge.
Experimental results:
WebKB course database
EM performs better than co-training
Both are close to supervised method when trained
on more labeled data.
Another experiment:
The News 2*2 dataset
• A semi-artificial dataset
• Conditional independence assumption
holds.
Co-training outperforms EM and the “oracle” result.
Co-training vs. EM
• Co-training splits features, EM does not.
• Co-training incrementally uses the
unlabeled data.
• EM probabilistically labels all the data at
each round; EM iteratively uses the
unlabeled data.
Co-EM: EM with feature split
• Repeat until converge
– Train A-feature-set classifier using the labeled
data and the unlabeded data with B’s labels
– Use classifier A to probabilistically label all the
unlabeled data
– Train B-feature-set classifier using the labeled
data and the unlabeled data with A’s labels.
– B re-labels the data for use by A.
Four SSL methods
Results on the News 2*2 dataset
Random feature split
Co-training: 3.7%  5.5%
Co-EM:
3.3%  5.1%
When the conditional independence assumption does not hold, but
there is sufficient redundancy among the features,
co-training still works well.
Assumptions
• Assumptions made by the underlying classifier
(supervised learner):
– Naïve Bayes: words occur independently of each other, given
the class of the document.
– Co-training uses the classifier to rank the unlabeled examples by
confidence.
– EM uses the classifier to assign probabilities to each unlabeled
example.
• Assumptions made by SSL method:
– Co-training: conditional independence assumption.
– EM: maximizing likelihood correlates with reducing classification
errors.
Summary of (Nigam and Ghani,
2002)
• Comparison of four SSL methods: self-training,
co-training, EM, co-EM.
• The performance of the SSL methods depends
on how well the underlying assumptions are
met.
• Random splitting features is not as good as
natural splitting, but it still works if there is
sufficient redundancy among features.
Variations of co-training
• Goldman and Zhou (2000) use two learners of
different types but both takes the whole feature
set.
• Zhou and Li (2005) use three learners. If two
agree, the data is used to teach the third learner.
• Balcan et al. (2005) relax the conditional
independence assumption with much weaker
expansion condition.
An alternative?
• L  L1, LL2
• U U1, U  U2
• Repeat
– Train h1 using L1 on Feat Set1
– Train h2 using L2 on Feat Set2
– Classify U2 with h1 and let U2’ be the subset with the
most confident scores, L2 + U2’  L2, U2-U2’  U2
– Classify U1 with h2 and let U1’ be the subset with the
most confident scores, L1 + U1’  L1, U1-U1’  U1
Yarowsky’s algorithm
• one-sense-per-discourse
 View #1: the ID of the document that a word
is in
• one-sense-per-allocation
 View #2: local context of word in the
document
• Yarowsky’s algorithm is a special case of cotraining (Blum & Mitchell, 1998)
• Is this correct? No, according to (Abney, 2002).
Summary of co-training
• The original paper: (Blum and Mitchell, 1998)
– Two “independent” views: split the features into two
sets.
– Train a classifier on each view.
– Each classifier labels data that can be used to train
the other classifier.
• Extension:
– Relax the conditional independence assumptions
– Instead of using two views, use two or more
classifiers trained on the whole feature set.
Summary of SSL
• Goal: use both labeled and unlabeled
data.
• Many algorithms: EM, co-EM, self-training,
co-training, …
• Each algorithm is based on some
assumptions.
• SSL works well when the assumptions are
satisfied.
Additional slides
Rule independence
• H1 (H2) consists of rules that are
functions of X1 (X2, resp) only.
• EM: the data is generated according to some
simple known parametric model.
– Ex: the positive examples are generated according
to an n-dimensional Gaussian D+ centered around
the point 

Download