Slides - Vivek Srikumar

advertisement
Importance of Semantic Representation:
Dataless Classification
Ming-Wei Chang
Lev Ratinov
Dan Roth
Vivek Srikumar
University of Illinois, Urbana-Champaign
Text Categorization
Classify the following sentence:
Syd Millar was the chairman of the International Rugby Board in 2003.
Pick a label:
Class1 vs. Class2

Traditionally, we need annotated data to train a classifier
Slide 1
Text Categorization

Humans don’t seem to need labeled data
Syd Millar was the chairman of the International Rugby Board in 2003.
Pick a label:
Sports vs. Finance
Label names carry a lot of information!
Slide 2
Text Categorization
Do we really always need labeled data?
Slide 3
Contributions

We can often go quite far without annotated data


… if we “know” the meaning of text
This works for text categorization

….and is consistent across different domains
Slide 4
Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains
Slide 5
Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains
Slide 6
Semantic Representation

One common representation is the Bag of Words
representation

All text is a vector in the space of words.
Slide 7
Semantic Representation

Explicit Semantic Analysis

[Gabrilovich & Markovitch, 2006, 2007]

Text is a vector in the space of concepts

Concepts are defined by Wikipedia articles
Slide 8
Explicit Semantic Analysis: Example
Apple IPod
Monetary Policy
ESA representation
ESA representation
IPod mini
International Monetary Fund
IPod photo
Monetary policy
IPod nano
Wikipedia
Economic
and Monetary Union
Apple Computer
Hong Kong Monetary Authority
IPod shuffle
Monetarism
ITunes
Central bank
article titles
Slide 9
Semantic Representation

Two semantic representations

Bag of words

ESA
Slide 10
Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains
Slide 11
Traditional Text Categorization
Labeled
corpus
Sports
Finance
Semantic space
A classifier
Slide 12
Dataless Classification
Labeled
Labels
corpus
Sports
Finance
What can we do using just the labels?
Slide 13
But labels are text too!
Slide 14
Dataless Classification
New unlabeled
Labels
document
Sports
Finance
Semantic space
Slide 15
What is Dataless Classification?

Humans don’t need training for classification

Annotated training data not always needed

Look for the meaning of words
Slide 16
What is Dataless Classification?

Humans don’t need training for classification

Annotated training data not always needed

Look for the meaning of words
Slide 17
On-the-fly Classification
New unlabeled
Labels
document
Sports
Finance
Semantic space
Slide 18
On-the-fly Classification

No training data needed

We know the meaning of label names

Pick the label that is closest in meaning to the document

Nearest neighbors
Slide 19
On-the-fly Classification
New unlabeled
New
labels
document
Hockey
Baseball
Semantic space
Slide 20
On-the-fly Classification

No need to even know labels before hand

Compare with traditional classification

Annotated training data for each label
Slide 21
Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains
Slide 22
Dataset 1: Twenty Newsgroups

Posts to newsgroups

Newsgroups have descriptive names
sci.electronics = Science Electronics
rec.motorbikes = Motorbikes
Slide 23
Dataset 2: Yahoo Answers

Posts to Yahoo! Answers



Posts categorized into a two level hierarchy
20 top level categories
Totally 280 categories at the second level
Arts and Humanities, Theater Acting
Sports, Rugby League
Slide 24
Experiments

20 Newsgroups

10 binary problems (from [Raina et al, ‘06])
Religion vs. Politics.guns
Motorcycles vs. MS Windows

Yahoo! Answers

20 binary problems
Health, Diet fitness vs. Health Allergies
Consumer Electronics DVRs vs. Pets Rodents
Slide 25
Results: On-the-fly classification
Dataset
Supervised
Baseline
Bag of
Words
ESA
Newsgroup
71.7
65.7
85.3
Yahoo!
84.3
66.8
88.6
Naïve Bayes classifier
Nearest neighbors,
Uses annotated data,
Ignores labels
Uses labels,
No annotated data
Slide 26
Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains
Slide 27
Using Unlabeled Data

Knowing the data collection helps


We can learn specific biases of the dataset
Potential for semi-supervised learning
Slide 28
Bootstrapping

Each label name is a “labeled” document


Train initial classifier


One “example” in word or concept space
Same as the on-the-fly classifier
Loop:


Classify all documents with current classifier
Retrain classifier with highly confident predictions
Slide 29
Co-training

Words and concepts are two independent “views”

Each view is a teacher for the other
[Blum & Mitchell ‘98]
Slide 30
Co-training

Train initial classifiers in word space and concept space

Loop


Classify documents with current classifiers
Retrain with highly confident predictions of both classifiers
Slide 31
Using unlabeled data

Three approaches

Bootstrapping with labels using Bag of Words

Bootstrapping with labels using ESA

Co-training
Slide 32
More Results
Co-training using just
labels does as well as
supervision with 100
examples
100
Accuracy (%)
95
90
Supervised, 10 examples
Supervised, 100 exmples
Bootstrap, Words
Bootstrap, Concepts
Co-Train
85
80
75
70
Newsgroup
Yahoo!
No annotated
data
Slide 33
Outline

Semantic Representation

On-the-fly Classification

Datasets

Exploiting unlabeled data

Robustness to different domains
Slide 34
Domain Adaptation

Classifiers trained on one domain and tested on another

Performance usually decreases across domains
Slide 35
But the label names are the same

Label names don’t depend on the domain

Label names are robust across domains

On-the-fly classifiers are domain independent
Slide 36
Example
Baseball vs. Hockey
98
Accuracy (%)
96
94
92
Test on Newsgroup
Test on Yahoo!
90
88
86
84
Train on
Newsgroup
Train on
Yahoo
Dataless
Slide 37
Conclusion

Sometimes, label names are tell us more about a class than annotated
examples


Standard learning practice of treating labels as unique identifiers loses
information
The right semantic representation helps
 What is the right one?
Slide 38
Download