number of correct positive predictions

advertisement
Text Classification from Labeled and
Unlabeled Documents using EM
-
Kamal Nigam
Andrew Kachites Mccallum
Sebastian Thrun
Tom Mitchell
Presented by
Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran
Job Hunting?
Roadmap

Part 1 – Text Classification

Part 2 – Incorporating Unlabeled data with EM

Part 3 – Results and Recap
Part I
Text Classification
Text Classification – the Definition

“Text classification systems categorize documents into
one (or several) of a set of pre-defined topics of interest”
How Are Automatic Text Classifiers
Created

Before: Manual construction of rule sets (Painful and
time-consuming )

Present : Supervised learning to construct a classifier
(efficient and successful)
What To Provide

An algorithm with an example set of documents for
each class and allow it to find a representation or
decision rule for classifying future documents
automatically

This approach will :
- give high-accuracy classifiers
- be significantly less expensive
What Data is Available

Key difficulty : A large number of labeled training
examples are required to learn accurately - What we
need but don't have

One would obviously prefer algorithms that can
provide accurate classifications after hand labeling
only a dozen articles, rather than thousands

What other sources of information can reduce the need
for labeled data?
Unlabeled data

How unlabeled data can be used to increase
classification accuracy, especially when labeled data
are scarce

An intuitive example
Goal And Merit

The goal –
To demonstrate that supervised learning algorithms
can use a small number of labeled examples with a
large number of unlabeled examples to create highaccuracy text classifiers

The merit –
Unlabeled examples are much less expensive and
easily available
Parametric Generative Model
Overview

Assumption : a statistical process generates the
documents (words and class labels)

statistical process - parametric generative model
Incorporating Unlabeled Data with
Generative Models

Using EM to find high-probability parameters of the
model given a combination of labeled and unlabeled
data

Experimental evidence shows that using unlabeled data
with EM can increase classification accuracy
Assumptions In the Model
(1) Documents are generated by a mixture of
multinomials model, where each mixture
component corresponds to a class (1 class to 1
component)
(2) The mixture components are multinomial
distributions of individual words - the words are
produced independently of each other given the
class
Two Multisided Dies

Let there be |C| classes and a vocabulary of size |V|;
each document d has |d| words in it.

First, we roll a biased |C|-sided die to determine the
class of our document.

We roll the biased |V|-sided die that corresponds to the
chosen class |d| times and write down the indicated
words. These words form the generated document.
Parametric Generative Model





- parameters for the mixture model
- mixture of components
- mixture weights or class probabilities
- document distribution of selected class
Equation (1)
Denotation


- the jth mixture component, as well as the jth class.
- the class label for a particular document (

A document

We write
for the word
- a word in the vocabulary


)
is considered to be an ordered list of word events,
in position k of
- document length, chosen independently of the component,
its own probability
Parametric Generative Model

Expanding the Equation (1) with document length and the words
in the document. Equation (2)

The words of a document are generated independently of context
Equation (3)

Combining these last two equations gives the naive Bayes
expression for the probability of a document given its class
Equation (4)
Model Parameters

Collection of word probabilities, each written


Document length is identically distributed, no need to
be parameterized for classification
denoted as the mixture weights (class probabilities)

The complete collection of model parameters
Naive Bayes Text Classification

Using a collection of labeled documents for training

Finding the most probable parameters for the statistical
model introduced
Training A Naive Bayes Classifier
With Labeled Data

Estimating the parameters of the generative model by
using a set of labeled training data
(the estimate of the parameters is written )

Finding
(MAP), the value of that
is most probable given the evidence of the training data
and a prior.
Training A Naive Bayes Classifier
With Labeled Data

The word probability estimates
Equation (6)

Class probabilities
Equation (7)
are given by
Classifying New Documents with
Naive Bayes
Equation (8)

If the task is to classify a test document into a single class, then
the class with the highest posterior probability
is selected.
Part Ⅱ
Incorporating Unlabeled Data with
EM
The Problem





The case that given only labeled data is explained already.
MAP– to maximize the posterior probability.
Naïve Bayes– do classification of labeled data.
Now the case is given both labeled and unlabeled data.
Searching for a solution? – Here it is!
Revision of EM



Recall the EM knowledge in PMR – Might be painful, but
helpful
Mixture Model
Hidden variable – z to active the components
Revision of EM





EM applied to Gaussian Mixture Model
Maximum Likelihood Estimation Parameters: µ andΣ
E step: evaluate the responsibilities using current
estimators/parameters
M step: re-estimate by using the maximum a posteriori
parameter
Run the demo
Back to the paper
Back to the paper




Collection of labeled and unlabeled documents. D  Dl  Du
MAP
Try to maximize P(θ|D)
Bayesian method -- P(θ|D) → P(θ) P(D| θ)
Back to the paper


Log likelihood
Incomplete equation
Back to the paper


z – binary indicator variables which is set to be 1 if y = c,
else zero.
Then problem of the incomplete log probability can be
transferred to complete log probability of parameters.
Back to the paper



Methods used in the paper
Basic EM
Augmented EM
(1) Weighting the unlabeled data
(2) Multiple mixture components per class
Basic EM


Initialize the NB classifier using MAP parameter estimation,
from only labeled dataset.
E step: estimate the component membership
z k 1  E ( z | D, k )


by calculating its expected value generated by P(c j | d i ,  )
from only unlabeled data.
M step: re-estimate the classifier for the whole data set,
k 1
k 1
using MAP, loop from E step:  arg max P( | D, z )
Look at lc ( | D, z ) to measure the improvement of the
parameters, decide when to stop the loop
Restrictions of Basic EM



Assumptions/Restrictions:
Large unlabeled data set, small labeled data set → if not true,
unlabeled data will hurt the accuracy.
One-to-one correspondence of components and classes →
not so accurate because subtopics exist.
Augmented EM – weighting unlabeled
data


Method: weakening the contribution of unlabeled data while
the labeled set is already good enough for classification.
Equation:
lc ( | D, z )  log( P( )) 
C
 z
d d D l j 1
ij
log( P (c j |  ) P (d i | c j , ))   (
ij
log( P(c j |  ) P(d i | c j ,  )))
C
 z
diD u j 1
Augmented EM – weighting unlabeled
data

λis decided by leave-one-out cross validation.
 (i ) is defined to tell whether it is labeled or unlabeled.

Modified MAP parameters:

Augmented EM -- multiple mixture
components per class


Method: Relax the assumption that one-to-one
correspondence of components and classes.
Many-to-one relationship between components and classes.
Augmented EM – multiple mixture
components per class



How?
Decide the number of components per class by again crossvalidation.
Mapping from components to classes: P(t a | c j , )  {0,1}
The complete algorithm








Collections of labeled, unlabeled documents. D  Dl  Du
Set λby cross-validation.
Set the number of components per class.
Randomly assign P(c j | d i , ) for mixture components.
Initialize the parameters θ of NB classifier using MAP.
Loop until complete log likelihood of labeled and unlabeled
data is satisfying enough.
E step: estimate the component membership of each doc
using θ
M step: re-estimate θgiven the membership, still MAP.
Comparison



Basic EM: performs well comparing with naïve bayes
classifier alone, with large unlabeled dataset and small set of
labeled data
EM-λ: can apparently improve the accuracy if the
assumption above doesn’t fit.
Multiple Components: dramatically outperforms than basic
EM.
Part III
Results and Recap
Experimental Results

Empirical evidence that on combining labeled with
unlabeled data using EM outperforms naive Bayes.

20 Newsgroups, WebKB, Reuters

Improvements in accuracy due to unlabeled data are
dramatic, especially when the number of labeled data is
low.

Augmented EM can increase performance even when
basic EM performs poor due to large number of
unlabeled data.
Data sets and Protocols
-
20 Newsgroup

20017 articles divided evenly among 20 different UseNet
discussion groups.
Task - to classify an article into the one newsgroup to
which it was posted.
Many categories fall into confusable clusters.
Stop words are removed – 62258 unique words
Word counts are normalized and scaled – each document
has constant length.




Data sets and Protocols
- WebKB





8145 Web pages gathered from university computer science
departments.
Choosing 4199 pages covering categories: student, faculty,
course and project.
Task - to classify a web page into one of the four categories.
Stemming and stoplist are not used.
Vocabulary is limited to 300 most informative words using
leave-one-out cross validation.
Data sets and Protocols
-
Reuters

12902 articles and 90 topic categories.
Task - to build a binary classifier for each of the ten most
populous classes to identify the news topic.
Words inside <TEXT> tags are used – REUTERS and &#
not used.
Stoplist are used, but no stemming.
Metrics are Recall and Precision instead of Accuracy.




Precision-Recall breakeven point
• Standard information retrieval measure
• Recall –
number of correct positive predictions
number of positive examples
• Precision - number of correct positive predictions
number of positive predictions
Wall-clock timing

EM usually converges after 10 iterations

Less than 1 minute for the WebKB

Less than 15 minutes for 20 Newsgroups – huge vocabulary
and more documents
EM with unlabeled data increases Accuracy
Figure 1:- Accuracy versus # of Labeled Documents. (20 Newsgroups)
Effect of varying the # of unlabeled documents
Figure 2:- Accuracy versus # of unlabeled documents. (20 Newsgroups)
EM algorithm in action
Figure 3:- ‘Course’ class for WebKB dataset
EM performance degradation
Figure 4:- As # of Labeled data increases, accuracy of classifier falls with
more # of unlabeled data. Importance of weighting factor λ. (WebKB)
Effects of different EM
Figure 5:- Comparison between EM, CV EM-λ and EM-λ (WebKB)
Performance of EM on different # of
mixture components
Figure 6:- Too few or too many mixture components result in poor
performance. Unlabeled data is used. (Reuters)
Precision-Recall breakeven points
Figure 7:- Comparison between NB and EM on Reuters dataset
Related Work





EM is a well-known family of algorithms that works by
treating unclassified data as incomplete.
According to Miller et al - EM on non-textual tasks using
mixture of Gaussians – assumed unlabeled data to be
sufficient to estimate parameter values.
Castelli and Cover - unlabeled data does not improve the
classification results in the absence of labeled data.
EM can be combined with active-learning to improve
performance – now only slightly more than half of labeled
data was enough!
EM can be applied with other machine learning algorithms
like SVM, kNN.
Punchwords
• Text classification
• Naive Bayes
• Expectation Maximisation Algorithm
• EM-λ
• Multiple Mixture models for subclass
• Leave-one-out cross validation
• Stemming and stoplist words
• Accuracy, Precision, Recall
Recap



A family of algorithms have been presented to address text
classification using voluminous unlabeled data and scarce
labeled data.
When data is consistent with the assumptions - Basic EM
performs well.
When data is not consistent - 2 extensions hold valid
- EM-λ: controlling the contribution of unlabeled data.
- Multiple Mixture Components per Class: “many-to-one”
constraint.
References
 Using Unlabeled Data to Improve Text Classification –


May 2001 at
www.kamalnigam.com/papers/thesis-nigam.pdf
Netlab toolkit - www.ncrg.aston.ac.uk/netlab/
Validation Lecture – Intelligent Sensor Systems,
RicardoGutierrez-Osuna, Wright State University
Question Time!!
Route further questions to ...
Ryan - 0789317
Neo - 0785401
Sandhya - 0671562

Download