Text Categorization - Microsoft Research

advertisement
Research in IR at MS
Microsoft Research (http://research.microsoft.com)
Adaptive Systems and Interaction - IR/UI
Machine Learning and Applied Statistics
Data Mining
Natural Language Processing
Collaboration and Education
MSR Cambridge
MSR Beijing
Microsoft Product Groups … many IR-related
IR Research at MSR
Improvements to representations and
matching algorithms
New directions
User modeling
Domain modeling
Interactive interfaces
An example: Text categorization
Traditional View of IR
Query Words
Ranked List
What’s Missing?
User
Modeling
Query
Words
Ranked List
Domain
Modeling
Information
Use
Domain/Obj Modeling
Not all objects are equal … potentially big win
A priori importance
Information use (“readware”; collab filtering)
Inter-object relationships
Link structure / hypertext
Subject categories - e.g., text categorization.
text clustering
Metadata
E.g., reliability, recency, cost -> combining
User/Task Modeling
Demographics
Task -- What’s the user’s goal?
e.g., Lumiere
Short and long-term content interests
e.g., Implicit queries
Interest model = f(content_similarity, time, interest)
e.g., Letiza, WebWatcher, Fab
Information Use
Beyond batch IR model (“query->results”)
Consider larger task context
Knowledge management
Human attention is critical resource
Techniques for automatic information summarization,
organization, discover, filtering, mining, etc.
 Advanced UIs and interaction techniques
E.g, tight coupling of search, browsing to support
information management
The Broader View of IR
User
Modeling
Query
Words
Ranked List
Domain
Modeling
Information
Use
Text Categorization: Road Map
Text Categorization Basics
Inductive Learning Methods
Reuters Results
Future Plans
Text Categorization
Text Categorization: assign objects to one
or more of a predefined set of categories
using text features
Example Applications:
Sorting new items into existing structures (e.g.,
general ontologies, file folders, spam vs.not)
Information routing/filtering/push
Topic specific processing
Structured browsing & search
Text Categorization - Methods
Human classifiers (e.g., Dewey, LCSH,
MeSH, Yahoo!, CyberPatrol)
Hand-crafted knowledge engineered
systems (e.g., CONSTRUE)
Inductive learning of classifiers
(Semi-) automatic classification
Classifiers
A classifier is a function: f(x) = conf(class)
from attribute vectors, x=(x1,x2, … xd)
to target values, confidence(class)
Example classifiers
if (interest AND rate) OR (quarterly),
then confidence(interest) = 0.9
confidence(interest) = 0.3*interest + 0.4*rate +
0.1*quarterly
Inductive Learning Methods
Supervised learning to build classifiers
Labeled training data (i.e., examples of each category)
“Learn” classifier
Test effectiveness on new instances
Statistical guarantees of effectiveness
Classifiers easy to construct and update;
requires only subject knowledge
Customizable for individual’s categories
and tasks
Text Representation
Vector space representation of documents
word1 word2 word3 word4 ...
Doc 1 = <1,
Doc 2 = <0,
Doc 3 = <0,
0,
1,
0,
3,
0,
0,
0, … >
0, … >
5, … >
Text can have 107 or more dimensions
e.g., 100k web pages had 2.5 million distinct words
Mostly use: simple words, binary weights, subset
of features
Feature Selection
Word distribution - remove frequent and
infrequent words based on Zipf’s law:
log(frequency) * log(rank) ~ constant
# Words (f)
1 2 3 …
Words by rank order (r)
m
Feature Selection (cont’d)
Fit to categories - use mutual information
to select features which best discriminate
p( x, C )
MI
(
x
,
C
)

p
(
x
,
C
)
log(
)

category vs. not
p( x) p(C )
Designer features - domain specific,
including non-text features
Use 100-500 best features from this
process as input to learning methods
Inductive Learning Methods
Find Similar
Decision Trees
Naïve Bayes
Bayes Nets
Support Vector Machines (SVMs)
All support:
“Probabilities” - graded membership; comparability across categories
Adaptive - over time; across individuals
Find Similar
Aka, relevance feedback
xi , j
Rocchio
w 

j

irel
n

inon _ rel
xi , j
N n
Classifier parameters are a weighted
combination of weights in positive and
negative examples -- “centroid”
New items classified using:  w j x j
j
Use all features, idf weights,   0
Decision Trees
Learn a sequence of tests on features,
typically using top-down, greedy search
Binary (yes/no) or continuous decisions
f7
P(class) = .6
f1
!f1
!f7
P(class) = .9
P(class) = .2
Naïve Bayes
Aka, binary independence model
Maximize: Pr (Class | Features)

 P( x | class )  P(class )
P(class | x ) 

P( x )
Assume features are conditionally independent
- math easy; surprisingly effective
C
x1
x2
x3
xn
Bayes Nets
Maximize: Pr (Class | Features)
Does not assume independence of
features - dependency modeling
C
x1
x2
x3
xn
Support Vector Machines
Vapnik (1979)
Binary classifiers that maximize margin
Find hyperplane separating positive and negative examples
2  
 
Optimization for maximum margin: min w , w  x  b  1, w  x  b  1
 
Classify new items using: w  x

w
support vectors
Support Vector Machines
Extendable to:
Non-separable problems (Cortes & Vapnik, 1995)
Non-linear classifiers (Boser et al., 1992)
Good generalization performance
Handwriting recognition (LeCun et al.)
Face detection (Osuna et al.)
Text classification (Joachims)
Platt’s Sequential Minimal Optimization
(SMO) algorithm very efficient
SVMs: Platt’s SMO Algorithm
SMO is a very fast way to train SVMs
SMO works by analytically solving the smallest
possible QP sub-problem
Substantially faster than chunking
Scales somewhere between O( N ) and O( N 2 )
Text Classification Process
text files
Index Server
word counts per file
Find similar
Feature selection
data set
Decision tree Naïve Bayes
Learning Methods
Bayes nets
test classifier
Support vector
machine
Reuters Data Set
(21578 - ModApte split)
9603 training articles; 3299 test articles
Example “interest” article
2-APR-1987 06:35:19.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged after today's regular
meeting of its council, a spokesman said in answer to enquiries. The
West German discount rate remains at 3.0 pct, and the Lombard
emergency financing rate at 5.0 pct.
REUTER
Average article 200 words long
Reuters Data Set
(21578 - ModApte split)
118 categories
An article can be in more than one category
Learn 118 binary category distinctions
Most common categories (#train, #test)
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
Category: “interest”
rate=1
rate.t=1
lending=0
prime=0
discount=0
pct=1
year=0
year=1
Category: Interest

Example SVM features - w
• 0.70 prime
• 0.67 rate
• 0.63 interest
• 0.60 rates
• 0.46 discount
• 0.43 bundesbank
• 0.43 baker
• -0.71 dlrs
• -0.35 world
• -0.33 sees
• -0.25 year
• -0.24 group
• -0.24 dlr
• -0.24 january
Accuracy Scores
Based on contingency table
System: Yes
System: No
Truth: Yes Truth: No
a
b
c
d
Effectiveness measure for binary classification
error rate = (b+c)/n
accuracy = 1 - error rate
precision (P) = a/(a+b)
recall (R) = a/(a+c)
break-even = (P+R)/2
F measure = 2PR/(P+R)
Reuters - Accuracy ((R+P)/2)
earn
acq
money-fx
grain
crude
trade
interest
ship
wheat
corn
Avg Top 10
Avg All Cat
Findsim
NBayes
BayesNets Trees
LinearSVM
92.9%
95.9%
95.8%
97.8%
98.0%
64.7%
87.8%
88.3%
89.7%
93.6%
46.7%
56.6%
58.8%
66.2%
74.5%
67.5%
78.8%
81.4%
85.0%
94.6%
70.1%
79.5%
79.6%
85.0%
88.9%
65.1%
63.9%
69.0%
72.5%
75.9%
63.4%
64.9%
71.3%
67.1%
77.7%
49.2%
85.4%
84.4%
74.2%
85.6%
68.9%
69.7%
82.7%
92.5%
91.8%
48.2%
65.3%
76.4%
91.8%
90.3%
64.6%
61.7%
81.5%
75.2%
85.0%
80.0%
88.4%
N/A
Recall: % labeled in category among those stories that are really in category
Precision: % really in category among those stories labeled in category
Break Even: (Recall + Precision) / 2
92.0%
87.0%
ROC for Category - Grain
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
0.6
0.8
1
Recall
Recall: % labeled in category among those stories that are really in category
Precision: % really in category among those stories labeled in category
ROC for Category - Earn
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
Recall
0.6
0.8
1
ROC for Category - Acq
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
0.2
Naïve Bayes
Find Similar
0
0
0.2
0.4
0.6
Recall
0.8
1
ROC for Category - Money-Fx
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
Recall
0.6
0.8
1
ROC for Category - Grain
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
0.6
Recall
0.8
1
ROC for Category - Crude
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
Recall
0.6
0.8
1
ROC for Category - Trade
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
0.2
Naïve Bayes
Find Similar
0
0
0.2
0.4
0.6
Recall
0.8
1
ROC for Category - Interest
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
Recall
0.6
0.8
1
ROC for Category - Ship
1
Precision
0.8
0.6
Linear SVM
0.4
Decision Tree
Naïve Bayes
0.2
Find Similar
0
0
0.2
0.4
0.6
Recall
0.8
1
Precision
ROC for Category - Wheat
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Linear SVM
Decision Tree
Naïve Bayes
Find Similar
0
0.2
0.4
Recall
0.6
0.8
1
Precision
ROC for Category - Corn
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Linear SVM
Decision Tree
Naïve Bayes
Find Similar
0
0.2
0.4
Recall
0.6
0.8
1
SVM: Dumais et al. vs. Joachims
Top 10 Reuters categories, microavg BE
Dumais et al.
Accuracy
Features, type
Features, number
Stemming
Title/Body
Number of categs
Split
Joachims
92.0
binary
300/category
none
distinguished
118
ModApte (9603/3299)
84.2
86.5 (rbf, g=.8)
tf*idf
9947
stemmed
?
90
ModApte (9603/3299)
Reuters - Sample Size (SVM)
100%
category
sample
set
1
0-acq
1-earn
3-money-fx
4-grain
5-crude
6-trade
7-interest
8-ship
9-wheat
10-corn
microtop10
category
sample
set
2
0-acq
1-earn
3-money-fx
4-grain
5-crude
6-trade
7-interest
8-ship
9-wheat
10-corn
microtop10
10%
5%
1%
samp sz
(p+r)/2
samp sz
(p+r)/2
samp sz
(p+r)/2
samp sz
(p+r)/2
2876
1650
538
433
389
369
347
197
212
182
98.3%
97.0%
80.2%
95.9%
90.4%
80.9%
79.9%
85.5%
92.5%
93.0%
93.9%
281
162
55
46
45
40
32
20
24
23
97.8%
94.6%
66.3%
91.5%
82.9%
78.2%
68.4%
57.4%
84.8%
78.2%
89.7%
145
80
28
21
18
21
17
11
11
9
97.4%
90.4%
63.9%
87.0%
76.9%
76.4%
55.3%
53.9%
65.7%
60.3%
86.0%
35
14
3
3
3
2
2
2
2
1
93.6%
65.6%
41.9%
50.3%
??
12.0%
50.8%
??
50.7%
50.9%
70.3%
100%
samp sz
2876
1650
538
433
389
369
347
197
212
182
(p+r)/2
98.3%
97.0%
80.2%
95.9%
90.4%
80.9%
79.9%
85.5%
92.5%
93.0%
93.9%
10%
samp sz
264
184
62
40
35
41
30
21
19
16
(p+r)/2
97.6%
94.1%
72.9%
85.8%
81.6%
80.1%
71.8%
62.8%
80.1%
76.6%
89.6%
5%
samp sz
139
79
29
28
15
22
15
10
15
11
(p+r)/2
97.3%
90.6%
71.0%
88.6%
65.6%
72.7%
63.0%
54.5%
80.1%
69.6%
86.4%
1%
samp sz
26
16
5
7
2
5
3
2
5
4
(p+r)/2
94.6%
73.5%
49.9%
76.5%
??
51.6%
45.1%
50.6%
68.3%
43.4%
75.5%
Reuters - Other Experiments
Simple words vs. NLP-derived phrases
NLP-derived phrases
factoids (April_8, Salomon_Brothers_International)
mulit-word dictionary entries (New_York, interest_rate)
noun phrases (first_quarter, modest_growth)
No advantage for Find Similar, Naïve Bayes, SVM
Vary number of features
Binary vs. 0/1/2 features
No advantage of 0/1/2 for Decision Trees, SVM
Number of Features - NLP
Plain Words, NLP Phrases, LF
BreakEven Point
0.95
Words-all
Words-10
0.9
Words+Phrases-all
0.85
Words+Phrases-10
0.8
Words+Phrases+LF-all
Words+Phrases+LF-10
0.75
0
500
1000
1500
Number of Features
2000
Reuters Summary
 Accurate
classifiers can be learned
automatically from training examples
 Linear
SVMs are efficient and provide very
good classification accuracy
Best results for this test collection
Widely applicable, flexible, and adaptable
representations
Text Classification Horizon
Other applications
OHSUMED, TREC, spam vs. not-spam, Web
Text representation enhancements
Use of hierarchical category structure
UI for semi-automatic classification
Dynamic interests
More Information
General stuff http://research.microsoft.com/~sdumais
SMO http://research.microsoft.com/~jplatt
Optimization Problem to
Train SVMs
Let yi = desired output of ith example (+1/-1)
 
Let ui = actual output ui  w  xi  b

Margin = 1 / || w ||
 2
min || w || ,
1
2
yiui 1 i
Dual Quadratic
Programming Problem
Equivalent dual problem: Lagrangian saddle point
One-to-one relationship between Lagrange multipliers
a and examples
 
1
2
w  a j w  x j  b y j  1 , a j  0
2 i
i
min
 

 
1
y y  xn  xm a na m   a n
2 n m
j
n ,m
a n  0,


w   yna n xn
n
y a
n
n
n
n
0
 
b  w  xm  ym , a m  0
Download