Semi-Supervised Learning

advertisement
Semi-Supervised Learning over Text
Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University
September 2006
Statistical learning methods require LOTS of
training data
Can we use all that unlabelled text?
Outline
• Maximizing likelihood in probabilistic models
– EM for text classification
• Co-Training and redundantly predictive features
– Document classification
– Named entity recognition
– Theoretical analysis
• Sample of additional tasks
– Word sense disambiguation
– Learning HTML-based extractors
– Large-scale bootstrapping: extracting from the web
Many text learning tasks
• Document classification.
– f: Doc  Class
– Spam filtering, relevance rating, web page classification, ...
– and unsupervised document clustering
• Information extraction.
– f: Sentence  Fact, f: Doc  Facts
• Parsing
– f: Sentence  ParseTree
– Related: part-of-speech tagging, co-reference res., prep phrase attachment
• Translation
– f: EnglishDoc  FrenchDoc
1. Semi-supervised Document classification
(probabilistic model and EM)
Document Classification: Bag of Words Approach
aardvark
0
about
2
all
2
Africa
1
apple
0
anxious
0
...
gas
1
...
oil
1
…
Zaire
0
Accuracy vs. # training examples
For code and data, see
www.cs.cmu.edu/~tom/mlbook.html
click on “Software and Data”
What if we have labels for only some documents?
X1
Learn P(Y|X)
Y
X1
X2
X3
X4
Y
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
?
0
1
1
0
?
0
1
0
1
X2
X3
X4
EM: Repeat until convergence
1. Use probabilistic labels to train classifier h
2. Apply h to assign probabilistic labels to unlabeled data
From [Nigam et al., 2000]
E Step:
M Step:
wt is t-th word in vocabulary
Using one
labeled
example per
class
Words sorted
by P(w|course) /
P(w| : course)
20 Newsgroups
20 Newsgroups
Elaboration 1: Downweight the influence of unlabeled
examples by factor l
New M step:
Chosen by cross
validation
Why/When will this work?
• What’s best case? Worst case? How can we test which
we have?
EM for Semi-Supervised Doc Classification
• If all data is labeled, corresponds to supervised training
of Naïve Bayes classifier
• If all data unlabeled, corresponds to mixture-ofmultinomial clustering
• If both labeled and unlabeled data, it helps if and only if
the mixture-of-multinomial modeling assumption is
correct
• Of course we could extend this to Bayes net models
other than Naïve Bayes (e.g., TAN tree)
• Other extensions: model negative class as mixture of N
multinomials
2. Using Redundantly Predictive Features
(Co-Training)
Redundantly Predictive Features
Professor Faloutsos
my advisor
Co-Training
Key idea: Classifier1 and ClassifierJ must:
1. Correctly classify labeled examples
2. Agree on classification of unlabeled
Answer1
Answer2
Classifier1
Classifier2
CoTraining Algorithm #1
[Blum&Mitchell, 1998]
Given: labeled data L,
unlabeled data U
Loop:
Train g1 (hyperlink classifier) using L
Train g2 (page classifier) using L
Allow g1 to label p positive, n negative examps from U
Allow g2 to label p positive, n negative examps from U
Add these self-labeled examples to L
CoTraining: Experimental Results
•
•
•
•
begin with 12 labeled web pages (academic course)
provide 1,000 additional unlabeled web pages
average error: learning from labeled data 11.1%;
average error: cotraining 5.0%
Typical run:
Co-Training for Named Entity Extraction
(i.e.,classifying which strings refer to people, places,
dates, etc.)
[Riloff&Jones 98; Collins et al., 98; Jones 05]
Answer1
Answer2
Classifier1
Classifier2
New York
I flew to ____ today
I flew to New York today.
CoTraining setting:
• wish to learn f: X  Y, given L and U drawn from P(X)
• features describing X can be partitioned (X = X1 x X2)
such that f can be computed from either X1 or X2
One result [Blum&Mitchell 1998]:
• If
– X1 and X2 are conditionally independent given Y
– f is PAC learnable from noisy labeled data
• Then
– f is PAC learnable from weak initial classifier plus unlabeled
data
Co-Training Rote Learner
pages
hyperlinks
My advisor
+
+
+
+
+
+ +
+
-
-
-
+
-
-
-
-
-
Co Training
• What’s the best-case graph? (most benefit from
unlabeled data)
• What the worst case?
• What does conditional-independence imply about
graph?
-
+
+
x1 x2
Expected Rote CoTraining error given m examples
CoTraining
setting :
learn
f : X Y
where
X  X1  X 2
where x drawn
and
g1 , g 2
from unknown distributi on
(x) g1 ( x1 )  g 2 ( x2 )  f ( x)
Eerror    P(x  g j )(1  P( x  g j ))
m
j
Where gj is the jth connected component of graph
of L+U, m is number of labeled examples
How many unlabeled examples suffice?
Want to assure that connected components in the underlying
distribution, GD, are connected components in the observed
sample, GS
GD
GS
O(log(N)/) examples assure that with high probability, GS has same
connected components as GD [Karger, 94]
N is size of GD,  is min cut over all connected components of GD
PAC Generalization Bounds on CoTraining
[Dasgupta et al., NIPS 2001]
This theorem assumes X1 and X2 are conditionally independent given Y
Co-Training Theory
How can we tune learning environment to enhance
effectiveness of Co-Training?
# labeled examples
# unlabeled examples
# Redundantly
predictive inputs
dependencies
among input
features
Final
Accuracy
Correctness of
confidence
assessments
 best: inputs conditionally
indep given class, increased
number of redundant inputs, …
What if CoTraining Assumption
Not Perfectly Satisfied?
+
+
-
+
• Idea: Want classifiers that produce a maximally
consistent labeling of the data
• If learning is an optimization problem, what function
should we optimize?
Example 2: Learning to extract named entities
location?
I arrived in Beijing on Saturday.
If: “I arrived in <X> on Saturday.”
Then: Location(X)
Co-Training for Named Entity Extraction
(i.e.,classifying which strings refer to people, places,
dates, etc.)
[Riloff&Jones 98; Collins et al., 98; Jones 05]
Answer1
Answer2
Classifier1
Classifier2
Beijing
I arrived in __ saturday
I arrived in Beijing saturday.
Bootstrap learning to extract named entities
[Riloff and Jones, 1999], [Collins and Singer, 1999], ...
Initialization
Australia
Canada
China
England
France
Germany
Japan Mexico
Switzerland
United_states
South Africa
United Kingdom
Warrenton
Far_East
Oregon
Lexington
Europe
U.S._A.
Eastern Canada
Blair
Southwestern_states
Texas
States
Singapore …
...
Thailand
Maine
production_control
northern_Los
New_Zealand
eastern_Europe
Americas
Michigan
New_Hampshire
Hungary
south_america
district
Latin_America
Florida ...
…
Iterations
locations in ?x
operations in ?x
republic of ?x
Co-EM
[Nigam & Ghani, 2000; Jones 2005]
Idea:
• Like co-training, use one set of features to label the other
• Like EM, iterate
– Assigning probabilistic values to unobserved class labels
– Updating model parameters (= labels of other feature set)
CoEM applied to Named Entity Recognition
[Rosie Jones, 2005], [Ghani & Nigam, 2000]
Update
rules:
CoEM applied to Named Entity Recognition
[Rosie Jones, 2005], [Ghani & Nigam, 2000]
Update
rules:
CoEM applied to Named Entity Recognition
[Rosie Jones, 2005], [Ghani & Nigam, 2000]
Update
rules:
[Jones, 2005]
Can use this for active learning...
[Jones, 2005]
What if CoTraining Assumption
Not Perfectly Satisfied?
+
+
-
+
• Idea: Want classifiers that produce a maximally
consistent labeling of the data
• If learning is an optimization problem, what function
should we optimize?
What Objective Function?
E  E1  E 2  c3 E 3  c4 E 4
E1 
2
ˆ
(
y

g
(
x
))

1 1
 x , y L
E2 
2
ˆ
(
y

g
(
x
))

2
2
Error on labeled examples
Disagreement over unlabeled
 x , y L
E 3   ( gˆ1 ( x1 )  gˆ 2 ( x2 )) 2
Misfit to estimated class priors
xU
 1

 
ˆ
ˆ

1
g
(
x
)

g
(
x
)
1
1
2
2
 
E 4   
y   


 | L |  x , y L  | L |  | U | xL U
2
 


2
What Function Approximators?
gˆ1 ( x) 
1
1 e
 w j ,1 x j
j
gˆ 2 ( x) 
1
 wj,2x j
1 e j
• Same functional form as logistic regression
• Use gradient descent to simultaneously learn g1 and g2, directly
minimizing E = E1 + E2 + E3 + E4
• No word independence assumption, use both labeled and
unlabeled data
Gradient CoTraining
Classifying Capitalized sequences as Person Names
Eg., “Company president Mary Smith said today…”
x1
x2
x1
Using
labeled data
only
Cotraining
Cotraining
without
fitting class
priors (E4)
25 labeled
5000 unlabeled
Error Rates
2300 labeled
5000 unlabeled
.13
.24
.15
*
.27
*
.11
*
* sensitive to weights of error terms E3 and E4
Example 3: Word sense disambiguation
[Yarowsky]
• “bank” = river bank, or financial bank??
• Assumes a single word sense per document
– X1: the document containing the word
– X2: the immediate context of the word (‘swim near the __’)
Successfully learns “context  word sense” rules when
word occurs multiples times in documents.
Example 4: Bootstrap learning for IE from HTML
structure
[Muslea, et al. 2001]
X1: HTML preceding
the target
X2: HTML following
the target
Example Bootstrap learning algorithms:
•
•
•
•
•
•
•
•
•
Classifying web pages [Blum&Mitchell 98; Slattery 99]
Classifying email [Kiritchenko&Matwin 01; Chan et al. 04]
Named entity extraction [Collins&Singer 99; Jones&Riloff 99]
Wrapper induction [Muslea et al., 01; Mohapatra et al. 04]
Word sense disambiguation [Yarowsky 96]
Discovering new word senses [Pantel&Lin 02]
Synonym discovery [Lin et al., 03]
Relation extraction [Brin et al.; Yangarber et al. 00]
Statistical parsing [Sarkar 01]
What to Know
• Several approaches to semi-supervised learning
–
–
–
–
–
EM with probabilistic model
Co-Training
Graph similarity methods
...
See reading list below
• Redundancy is important
• Much more to be done:
–
–
–
–
–
Better theoretical models of when/how unlabeled data can help
Bootstrap learning from the web (e.g. Etzioni, 2005, 2006)
Active learning (use limited labeling time of human wisely)
Never ending bootstrap learning?
...
Further Reading
•
Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨olkopf, and Alexander
Zien (eds.), MIT Press, 2006.
•
Semi-Supervised Learning Literature Survey, Xiaojin Zhu, 2006.
•
Unsupervised word sense disambiguation rivaling supervised methods D. Yarowsky
(1995)
"Semi-Supervised Text Classification Using EM," K. Nigam, A. McCallum, and T.
Mitchell, in Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨olkopf, and
Alexander Zien (eds.), MIT Press, 2006.
" Text Classification from Labeled and Unlabeled Documents using EM," K. Nigam,
Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, Kluwer
Academic Press, 1999.
" Combining Labeled and Unlabeled Data with Co-Training," A. Blum and T. Mitchell,
Proceedings of the 1998 Conference on Computational Learning Theory, July 1998.
Discovering Word Senses from Text Pantel & Lin (2002)
Creating Subjective and Objective Sentence Classifiers from Unannotated Texts by
Janyce Wiebe and Ellen Riloff (2005)
Graph Based Semi-Supervised Approach for Information Extraction by Hany Hassan,
Ahmed Hassan and Sara Noeman (2006)
The use of unlabeled data to improve supervised learning for text summarization by
MR Amini, P Gallinari (2002)
•
•
•
•
•
•
•
Further Reading
•
•
•
•
•
•
Yusuke Shinyama and Satoshi Sekine. Preemptive Information Extraction using
Unrestricted Relation Discovery
Alexandre Klementiev and Dan Roth. Named Entity Transliteration and Discovery
from Multilingual Comparable Corpora.
Rion L. Snow, Daniel Jurafsky, Andrew Y. Ng. Learning syntactic patterns for
automatic hypernym discovery
Sarkar. (1999). Applying Co-training methods to Statistical Parsing.
S. Brin, 1998. Extracting patterns and relations from the World Wide Web, EDBT'98
O. Etzioni et al., 2005. "Unsupervised Named-Entity Extraction from the Web: An
Experimental Study," AI Journal, 2005.
Download