Text classifiers

advertisement
Hands on Classification with Learning
Based Java
Gourab Kundu
Adapted from a talk by Vivek Srikumar
Goals of this tutorial
At the end of these lectures, you will be able to
1.
Get started with Learning Based Java
2.
Use a generic, black box text classifier for different
applications
…and write your own text classifier, if needed
3.
Understand how features can impact the classifier
performance
… and add features to improve your application
4.
Build a badge classifier based on character features
A Quick Recap

Given: Examples (x,f(x)) of some unknown function f
Find: A good approximation of f

x provides some representation of the input




The process of mapping a domain element into a representation is
called Feature Extraction. (Hard; ill-understood; important)
x €{0,1}n or
x € Rn
The target function (label)


f(x) € {-1,+1}
f(x) € {1,2,3,.,k-1}
Binary Classification
Multi-class classification
What is text classification?
✗
✗
✓
A document
A classifier
(black box)
✗
Some labels
Several applications fit this framework
Spam detection
Sentiment classification
What else can you do, if you had such a black box
system that can classify text?
Try to spend 30 seconds brainstorming
Outline of this session

Getting started with LBJ

Writing our first classifier: Spam/Ham

Playing with features

Looking inside the black box classifier for feature weights
Writing classifiers
LEARNING BASED JAVA
What is Learning Based Java?

A modeling language for learning and inference

Supports





Programming using learned models
High level specification of features and constraints between classifiers
Inference with constraints
Different learning algorithms
The learning operator


Classifiers are functions defined in terms of data
Learning happens at compile time
What does LBJ do for you?

Abstracts away the feature representation, learning and
inference

Allows you to write learning based programs

Application developers can reason about the application at
hand
Demo

A learning based program
First, we will write an application that assumes the existence of
a black box classifier
SPAM DETECTION
Spam detection
Which of these (if any) are email spam?
Subject: save over 70 % on name brand
software
ppharmacy devote fink tungstate
brown lexicon pawnshop crescent
railroad distaff cytosine barium
cain do
How
application elegy donnelly
hydrochloride common embargo
shakespearean bassett trustee nucleolus
chicano narbonne telltale tagging
swirly lank delphinus bragging bravery
cornea asiatic susanne
Subject: please keep in touch
just like to say that it has been great
meeting and working with you all . i
be leaving enron effective july 5 th
youwillknow?
to do investment banking in hong
kong . i will initially be based in new
york and will be moving to hong kong
after a few months . do contact me
when you are in the vicinity .
What do we need to build a classifier?
1.
Annotated documents*
2.
A feature representation of the documents
3.
A learning algorithm
* Here we are dealing with supervised learning
Our first LBJ program
Defines a classifier
/** A learned text classifier; its definition comes from data. */
discrete TextClassifier(Document d) <learn TextLabel
using WordFeatures
from new DocumentReader("data/spam/train")
with SparseAveragedPerceptron {
learningRate = 0.1 ;
thickness = 3.5;
}
5 rounds
testFrom new DocumentReader("data/spam/test”)
end
The object being
classified
The function being learned
The feature representation
The source of the
training data
The learning algorithm
Demo

Let’s build a spam detector

How to train?

How do different learning algorithms perform? Does this choice
matter much?
Features

Our current spam detector uses words as features

Can we do better?
Let’s try it out
MORE TEXT CLASSIFICATION
Sentiment classification
Which of these product reviews is positive?
I recently made the switch from PC
to Mac, and I can say that I'm not
sure why I waited so long.
Considering that I have only
had do
How
my computer a few weeks I can't
say much about the durability and
longevity of the hardware, but I can
say that the operating system
(mine shipped with Lion) and
software is top notch.
you
I've been an Apple user for a long
time, but my most recent
MacBook Pro purchase has
convinced me to reconsider. I've
know?
had several hardware issues,
including a failed keyboard,
battery failure, and a bad DVD
drive. Now, the backlight on the
display fails to turn on when
waking from sleep
Classifying news groups
Which mailing list should this message be posted to?
I am looking for Quick C or Microsoft C code for image decoding from file for
VGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I have
scoured the Internet, but its like trying to find a Dr. Seuss spell checker
TSR. It must be out there, and there's no need to reinvent the wheel.
How
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
do you know?
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Demo

Converting our spam classifier into a



Sentiment classifier
A newsgroup classifier
Note: How different are these at the implementation level?
Most of the engineering lies in the features
✗
✗
✓
A document
A classifier
(black box)
✗
Some labels
Summary

What is LBJ? How do we use it?

Writing a simple spam detector

Playing with features

How much do we need to change to move to a different
application?
Assignment before Next Class (Not Graded)

Download the code & data
(http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.html)
for this class and play with it

Try to solve the Badges game puzzle with LBJ



Think about what features are needed
Write a parser for reading the data
Write a classifier for solving the puzzle
Next Class

We will solve the Badges Game puzzle by Machine Learning

We will look at more text classification examples

We will think about a famous people classifier
Questions
Badge Classifier

Brainstorm the possible Features





Characters in entire name
Two consecutive Characters
Character as Vowel, Character as Consonant
….
…

Feature Engineering is Important (especially if labeled data is
small)

What is the baseline? 70 +, 24 -
THE FAMOUS PEOPLE
CLASSIFIER
The Famous People Classifier
f(
) = Politician
f(
) = Athlete
f(
) = Corporate Mogul
The NLP version of the fame classifier
All sentences in the news, which the
string Barack Obama occurs
Represented
by
All sentences in the news, which the
string Roger Federer occurs
All sentences in the news, which the
string Bill Gates occurs
Our goal

Find famous athletes, corporate moguls and politicians
Athlete
• Michael
Schumacher
• Michael Jordan
•…
Politician
• Bill Clinton
• George W. Bush
•…
Corporate Mogul
• Warren Buffet
• Larry Ellison
•…
Let’s brainstorm

How do we build a fame classifier?
Remember, we start off with just raw text from a news website
One solution

Let us label entities using features defined on mentions
All sentences in the news, which the
string Barack Obama occurs



Identify mentions using the named entity recognizer
Define features based on the words, parts of speech and
dependency trees
Train a classifier
Summary
1.
Get started with Learning Based Java
2.
Use a generic, black box text classifier for different
applications
…and write your own text classifier, if needed
3.
Understand how features can impact the classifier
performance
… and add features to improve your application
4.
Build a badge classifier based on character features
Questions
Download