Text Classification with Decision Trees - clic

advertisement
INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Massimo Poesio
LECTURE 14: Text categorization with
Decision Trees and Naïve Bayes
REMINDER: DECISION TREES
• A DECISION TREE is a classifier in the form of a
tree structure, where each node is either a:
– Leaf node - indicates the value of the target
attribute (class) of examples, or
– Decision node - specifies some test to be carried
out on a single attribute-value, with one branch
and sub-tree for each possible outcome of the
test.
• A decision tree can be used to classify an example
by starting at the root of the tree and moving
through it until a leaf node, which provides the
classification of the instance.
Decision Tree Example
Goal: learn when we can play Tennis and when we cannot
Day
Outlook
Temp.
Humidity
Wind
Play Tennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Weak
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cold
Normal
Weak
Yes
D10
Rain
Mild
Normal
Strong
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Each internal node tests an attribute
Normal
Yes
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny
Hot
High
PlayTennis
Weak
?
No
Outlook
Sunny
Humidity
High
Rain
Yes
Normal
www.math.tau.ac.il/~nin/
No
Courses/ML04/DecisionTreesC
LS.pp
Overcast
Yes
Wind
Strong
No
Weak
Yes
TEXT CLASSIFICATION WITH DT
• As an example of actual application of
decision trees, we’ll consider the problem of
TEXT CLASSIFICATION
IS THIS SPAM?
From: "" <takworlld@hotmail.com>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
TEXT CATEGORIZATION
• Given:
– A description of an instance, xX, where X is the
instance language or instance space.
• Issue: how to represent text documents.
– A fixed set of categories:
C = {c1, c2,…, cn}
• Determine:
– The category of x: c(x)C, where c(x) is a
categorization function whose domain is X and whose
range is C.
• We want to know how to build categorization functions
(“classifiers”).
Document Classification
“planning
language
proof
intelligence”
Testing
Data:
(AI)
(Programming)
(HCI)
Classes:
ML
Training
Data:
learning
intelligence
algorithm
reinforcement
network...
Planning
Semantics
Garb.Coll.
planning
temporal
reasoning
plan
language...
programming
semantics
language
proof...
Multimedia
garbage
...
collection
memory
optimization
region...
GUI
...
(Note: in real life there is often a hierarchy, not present in
the above problem statement; and you get papers on ML
approaches to Garb. Coll.)
Text Categorization Examples
Assign labels to each document or web-page:
• Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business"
• Labels may be genres
e.g., "editorials" "movie-reviews" "news“
• Labels may be opinion
e.g., “like”, “hate”, “neutral”
• Labels may be domain-specific binary
e.g., "interesting-to-me" : "not-interesting-to-me”
e.g., “spam” : “not-spam”
e.g., “is a toner cartridge ad” :“isn’t”
TEXT CATEGORIZATION WITH DT
• Build a separate decision tree for each
category
• Use WORDS COUNTS as features
Reuters Data Set
(21578 - ModApte split)
• 9603 training, 3299 test articles; ave. 200 words
• 118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
Common categories
(#train, #test)
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
14
AN EXAMPLE OF REUTERS TEXT
Foundations of Statistical
Natural Language Processing,
Manning and Schuetze
Decision Tree for Reuter classification
Foundations of Statistical
Natural Language Processing,
Manning and Schuetze
OTHER LEARNING METHODS USED
FOR TEXT CLASSIFICATION
• Bayesian methods (Naïve Bayes)
• Neural nets (e.g. ,perceptron)
• Vector-space methods (k-NN, Rocchio,
unsupervised)
• SVMs
BAYESIAN METHODS
• Learning and classification methods based on
probability theory.
• Bayes theorem plays a critical role in probabilistic
learning and classification.
• Build a generative model that approximates how
data is produced
• Uses prior probability of each category given no
information about an item.
• Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.
Bayes’ Rule
P(C , X )  P(C | X ) P( X )  P( X | C ) P(C )
P( X | C ) P(C )
P(C | X ) 
P( X )
Maximum a posteriori Hypothesis
hMAP  argmaxP(h | D)
hH
hMAP
P( D | h ) P( h )
 argmax
P( D )
hH
hMAP  argmax P( D | h) P(h)
hH
Naive Bayes Classifiers
Task: Classify a new instance based on a tuple of attribute values
x1, x2 ,, xn
cMAP  argmaxP(c j | x1 , x2 ,, xn )
c j C
cMAP  argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P(c1 , c2 ,, cn )
cMAP  argmaxP( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Classifier: Assumptions
• P(cj)
– Can be estimated from the frequency of classes in
the training examples.
• P(x1,x2,…,xn|cj)
– O(|X|n•|C|)
– Could only be estimated if a very, very large
number of training examples was available.
Conditional Independence Assumption:
 Assume that the probability of observing the conjunction of
attributes is equal to the product of the individual probabilities.
The Naïve Bayes Classifier
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
• Conditional Independence Assumption:
features are independent of each other
given the class:
P( X1,, X 5 | C)  P( X1 | C)  P( X 2 | C)  P( X 5 | C)
Learning the Model
C
X1
X2
X3
X4
X5
X6
• Common practice:maximum likelihood
– simply use the frequencies in the data
Pˆ (c j ) 
Pˆ ( xi | c j ) 
N (C  c j )
N
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X1,, X 5 | C)  P( X1 | C)  P( X 2 | C)  P( X 5 | C)
• What if we have seen no training cases where patient had no flu and
muscle aches?
N ( X 5  t , C  nf )
ˆ
P( X 5  t | C  nf ) 
0
N (C  nf )
• Zero probabilities cannot be conditioned away, no matter the other
evidence!
  arg max c Pˆ (c)i Pˆ ( xi | c)
Smoothing to Avoid Overfitting
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of Xi
• Somewhat more subtle version
Pˆ ( xi ,k | c j ) 
overall fraction in data
where Xi=xi,k
N ( X i  xi ,k , C  c j )  mpi ,k
N (C  c j )  m
extent of
“smoothing”
Using Naive Bayes Classifiers to
Classify Text: Basic method
• Attributes are text positions, values are words.
c NB  argmax P (c j ) P ( xi | c j )
c jC
i
 argmax P (c j ) P ( x1 " our"| c j )  P ( xn " text"| c j )
c jC

Naive Bayes assumption is clearly violated.



Example?
Still too many possibilities
Assume that classification is independent of the positions of the
words

Use same parameters for each position
Text Classification Algorithms: Learning
• From training corpus, extract Vocabulary
• Calculate required P(cj) and P(xk | cj) terms
– For each cj in C do
• docsj  subset of documents for which the target
class is cj
•


P(c j ) 
| docsj |
| total# documents|
Textj  single document containing all docsj
for each word xk in Vocabulary
 nk  number of occurrences of xk in Textj

P( xk | c j ) 
nk  1
n | Vocabulary|
Text Classification Algorithms:
Classifying
• positions  all word positions in current document
which contain tokens found in Vocabulary
• Return cNB, where
cNB  argmaxP(c j )
c jC
 P( x | c )
i positions
i
j
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
Naïve Bayes Posterior Probabilities
• Classification results of naïve Bayes (the class
with maximum posterior probability) are
usually fairly accurate.
• However, due to the inadequacy of the
conditional independence assumption, the
actual posterior-probability numerical
estimates are not.
– Output probabilities are generally very close to 0
or 1.
READINGS
• Fabrizio Sebastiani. Machine Learning in
Automated Text Categorization. ACM
Computing Surveys, 34(1):1-47, 2002
REMAINING LECTURES
DAY
HOUR
TOPIC
Wed 25/11
12-14
Text classification with
Artificial Neural Nets
Fri 27/11
14-16
Lab: Supervised ML
with Weka
Fri 4/12
10-12
Unsupervised methods
& text classification
Fri 4/12
14-16
Lab: Linear algebra
Wed 9/12
10-12
Lexical acquisition by
clustering
Thu 10/12
10-12
Psychological evidence
on learning
Fri 11/12
10-12
Psychological evidence
on language processing
Fri 11/12
14-16
Lab: Clustering
Mon 14/12
10-12
Intro to NLP
REMAINING LECTURES
DAY
HOUR
TOPIC
Tue 15/12
10-12
Ling. & psychological
evidence on anaphora
Thu 17/12
10-12
Machine learning appr.
to anaphora
Fri 18/12
10-12
Corpora for anaphora
Fri 18/12
14-16
Lab: BART
Mon 21/12
10-12
Lexical & commons.
knowledge for
anaphora
Tue 22/12
10-12
Salience
Tue 22/12
14-16
Discourse new
detection
ACKNOWLEDGMENTS
• Several slides come from Chris Manning &
Hinrich Schuetze’s course on IR and text
classification
Download