Document

advertisement
Statistical Methods for
Text Mining
David Madigan
Rutgers University & DIMACS
www.stat.rutgers.edu/~madigan
David D. Lewis
www.daviddlewis.com
joint work with Alex Genkin, Vladimir Menkov, Aynur Dayanik,
Dmitriy Fradkin
Statistical Analysis of Text
•Statistical text analysis has a long history in literary analysis
and in solving disputed authorship problems
•First (?) is Thomas C. Mendenhall in 1887
Mendenhall
•Mendenhall was Professor of Physics at Ohio State and
at University of Tokyo, Superintendent of the USA Coast
and Geodetic Survey, and later, President of Worcester
Polytechnic Institute
Mendenhall Glacier,
Juneau, Alaska
X2 = 127.2, df=12
•Used Naïve Bayes with Poisson and Negative Binomial model
•Out-of-sample predictive performance
Today
• Statistical methods routinely used for textual
analyses of all kinds
• Machine translation, part-of-speech tagging,
information extraction, question-answering,
text categorization, etc.
• Not reported in the statistical literature (no
statisticians?)
Outline
• Part-of-Speech Tagging, Entity Recognition
• Text categorization
• Logistic regression and friends
• The richness of Bayesian regularization
• Sparseness-inducing priors
• Word-specific priors: stop words, IDF, domain
knowledge, etc.
• Polytomous logistic regression
Part-of-Speech Tagging
• Assign grammatical tags to words
• Basic task in the analysis of natural language
data
• Phrase identification, entity extraction, etc.
• Ambiguity: “tag” could be a noun or a verb
• “a tag is a part-of-speech label” – context
resolves the ambiguity
The Penn Treebank POS Tag Set
POS Tagging Process
Berlin Chen
POS Tagging Algorithms
• Rule-based taggers: large numbers of hand-crafted
rules
• Probabilistic tagger: used a tagged corpus to train
some sort of model, e.g. HMM.
tag1
tag2
tag3
word1
word2
word3
• clever tricks for reducing the number of parameters (aka priors)
some details…
Charniak et al., 1993, achieved 95% accuracy on the Brown Corpus with:
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been
seen with tag i gets tag i
number of such occurrences in total
plus a modification that uses word suffixes
r1 s1
Recent Developments
• Toutanova et al., 2003,
use a dependency
network and richer
feature set
• Log-linear model for ti | t-i, w
• Model included, for example, a feature for whether the word
contains a number, uppercase characters, hyphen, etc.
• Regularization of the estimation process critical
• 96.6% accuracy on the Penn corpus
Named-Entity Classification
•
•
•
•
“Mrs. Frank” is a person
“Steptoe and Johnson” is a company
“Honduras” is a location
etc.
• Bikel et al. (1998) from BBN “Nymble”
statistical approach using HMMs
nc1
nc2
nc3
word1
word2
word3
 [ wi | wi 1 , nci ] if nci  nci 1
[ wi | wi 1 , nci , nci 1 ]  
[ wi | nci , nci 1 ] if nci  nci 1
•
•
•
•
•
“name classes”: Not-A-Name, Person, Location, etc.
Smoothing for sparse training data + word features
Training = 100,000 words from WSJ
Accuracy = 93%
450,000 words  same accuracy
training-development-test
Text Categorization
•Automatic assignment of documents with respect to
manually defined set of categories
•Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay grading
•Dominant technology is supervised machine
learning:
Manually classify some documents, then learn a
classification rule from them (possibly with manual
intervention)
Terminology, etc.
•Binary versus Multi-Class
•Single-Label versus Multi-Label
•Document representation via “bag of words:”

4
5
d  (w1 ,, wN ) N  10  10
•wi’s might be 0/1, counts, or weights (e.g tf/idf, LSI)
•Phrases, syntactic information, synonyms, NLP, etc. ?
•Stopwords, stemming
Test Collections
•Reuters-21578
•9603 training, 3299 test, 90 categories, ~multi-label
•New Reuters – 800,000 documents
•Medline – 11,000,000 documents; MeSH headings
•TREC conferences and collections
•Newsgroups, WebKB
Reuters Evaluation
•binary classifiers:
recall=d/(b+d)
“sensitivity”
precision=d/(c+d)
true 0
true 1
predict 0
a
b
predict 1
c
d
“predictive value positive”
•multiple binary classifiers:
macro-precision = 1.0+0.5
2
micro-averaged precision = 2/3
true
predict cat 1
test doc 1 1 1
test doc 2
0
0
cat 2
1
1
0
p=1.0
r =1.0
F1 Measure – harmonic mean of precision and recall
1
p=0.5
r =1.0
Reuters Results
Model
F1
AdaBoost.MH 0.86
SVM
0.84-0.87
k-NN
0.82-0.86
Neural Net
0.84
“Naïve Bayes” 0.72-0.78
Rocchio
0.62-0.76
X0
Naïve Bayes
X1
X2
...
•Naïve Bayes for document classification dates back
to the early 1960’s
•The NB model assumes features are conditionally
independent given class
•Estimation is simple; scales well
•Empirical performance usually not bad
•High bias-low variance (Friedman, 1997; Domingos
& Pazzani, 1997)
Xp
Poisson NB
X0
Xc1
Xc2
X0
...
Xcp
X
X
X cj ~ Poisson(Cj )
•Natural extension of binary model to word
frequencies
•ML-equivalent to the multinomial model with Poissondistributed document length
•Bayesian equivalence requires constraints on conjugate
priors (Poisson NB has 2p hyper-parameters per class;
Multinomial-Poisson has p+2)
Poisson NB - Reuters
μPrecision
μRecall
SVM
0.89
0.84
Multinomial
0.78
0.76
Poisson NB
0.67
0.66
Multinomial+
logspline
0.79
0.76
Multinomial+
negative bin.
0.78
0.75
Negative Binomial
NB
0.77
0.76
Model
Different story for FAA dataset
overdispersion
AdaBoost.MH
•Multiclass-Multilabel
•At each iteration learns a simple score-producing classifier on
weighed training data and the updates the weights
•Final decision averages over the classifiers
Class
doc 1
A
B
+1 +1
C
D
-1
-1
Class
A
B
C
D
doc 1 0.25 0.25 0.25 0.25
data
initial weights
Class
A
B
C
D
doc 1
2
-2
-1
0.1
score from simple classifier
Class
A
B
C
D
doc 1 0.02 0.82 0.04 0.12
revised weights
AdaBoost.MH
Schapire and Singer, 2000
AdaBoost.MH’s weak learner is a stump
two words!
AdaBoost.MH Comments
•Software implementation: BoosTexter
•Some theoretical support in terms of bounds on
generalization error
•3 days of cpu time for Reuters with 10,000 boosting
iterations
Document Representation
•Documents usually represented as “bag of words:”
x i  ( xi1 ,
, xi j ,..., xi d )
•xi’s might be 0/1, counts, or weights (e.g. tf/idf, LSI)
•Many text processing choices: stopwords, stemming,
phrases, synonyms, NLP, etc.
Classifier Representation
•For instance, linear classifier:
IF
 x
j ij
  , THEN yi  1
j
ELSE yi  1
• xi’s derived from text of document
• yi indicates whether to put document in category
• βj are parameters chosen to give good classification
effectiveness
Logistic Regression Model
•Linear model for log odds of category membership:
P( yi  1| x i )
ln
   j xi j  βxi
P( yi  1| x i )
j
• Equivalent to
eβxi
P( yi  1| xi ) 
βxi
1 e
• Conditional probability model
Logistic Regression as a Linear
Classifier
•If estimated probability of category membership is
greater than p, assign document to category:
p
IF   j xi j  ln
, THEN yi  1
1 p
j
•Choose p to optimize expected value of your
effectiveness measure (may need different form of
test)
•Can change measure w/o changing model
Maximum Likelihood Training
• Choose parameters (βj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xi’s)
arg max  ( ln(1  exp(β xi yi )))
T
β
i
• Maximizing (log-)likelihood can be viewed as
minimizing a loss function
Hastie, Friedman & Tibshirani
Shrinkage Methods
► Subset selection is a discrete process – individual variables
are either in or out. Combinatorial nightmare.
► This method can have high variance – a different dataset
from the same source can result in a totally different
model
► Shrinkage methods allow a variable to be partly included in
the model. That is, the variable is included but with a
shrunken co-efficient
► Elegant way to tackle over-fitting
Ridge Regression
ˆ ridge  arg min

N
p
i 1
j 1
2
(
y



x

)
 i 0  ij j
p
subject to:
2

 j s
j 1
Equivalently:
ˆ
ridge
p
p
 N

2
2
 arg min   ( yi   0   xij  j )     j 

j 1
j 1
 i 1

This leads to:
ˆ ridge  ( X T X  I ) 1 X T y
Choose  by cross-validation.
works even when
XTX is singular
0.10
Posterior Modes with Varying Hyperparameter - Gaussian
0.00
glu
bp
bmi/100
-0.05
ped
npreg
skin
-0.10
posterior mode
0.05
age/100
intercept
0
0.05
0.1
0.15
tau
0.2
0.25
0.3
Ridge Regression = Bayesian MAP Regression
►Suppose we believe each βj is a small value near 0
►Encode this belief as separate Gaussian probability
distributions over values of βj
►Choosing maximum a posteriori value of the β gives
same result as ridge logistic regression
yi ~ N (  0  xiT  ,  2 )
 j ~ N (0, 2 )
same as ridge with    2  2
Least Absolute Shrinkage & Selection Operator (LASSO)
ˆ ridge  arg min

N
p
i 1
j 1
2
(
y



x

)
 i 0  ij j
p
subject to:

j 1
j
s
Quadratic programming algorithm needed to solve for the parameter
estimates
p
p
 N
~
  arg min   ( yi   0   xij  j ) 2     j
 i 1

j 1
j 1

q




q=0: var. sel.
q=1: lasso
q=2: ridge
Learn q?
0.10
Posterior Modes with Varying Hyperparameter - Laplace
0.00
glu
bp
bmi/100
-0.05
ped
npreg
skin
-0.10
posterior mode
0.05
age/100
intercept
120
100
80
60
lambda
40
20
0
Ridge & LASSO - Theory
► Lasso estimates are consistent
► But, Lasso does not have the “oracle property.” That is, it
does not deliver the correct model with probability 1
► Fan & Li’s SCAD penalty function has the Oracle property
LARS
► New geometrical insights into Lasso and “Stagewise”
► Leads to a highly efficient Lasso algorithm for linear
regression
LARS
► Start with all coefficients bj = 0
► Find the predictor xj most correlated with y
► Increase bj in the direction of the sign of its correlation
with y. Take residuals r=y-yhat along the way. Stop
when some other predictor xk has as much correlation
with r as xj has
► Increase (bj,bk) in their joint least squares direction
until some other predictor xm has as much correlation
with the residual r.
► Continue until all predictors are in the model
Zhang & Oles Results
Model
F1
Naïve Bayes
0.852
Ridge Logistic
Regression+FS
0.914
SVM
0.911
•Reuters-21578 collection
•Ridge logistic regression plus feature selection
Bayes!
• MAP logistic regression with Gaussian prior gives
state of the art text classification effectiveness
• But Bayesian framework more flexible than SVM
for combining knowledge with data :
–
–
–
–
Feature selection
Stopwords, IDF
Domain knowledge
Number of classes
• (and kernels.)
Data Sets
• ModApte subset of Reuters-21578
– 90 categories; 9603 training docs; 18978 features
• Reuters RCV1-v2
– 103 cats; 23149 training docs; 47152 features
• OHSUMED heart disease categories
– 77 cats; 83944 training docs; 122076 features
• Cosine normalized TFxIDF weights
Dense vs. Sparse Models
(Macroaveraged F1, Preliminary)
ModApte
RCV1-v2
OHSUMED
Lasso
52.03
56.54
51.30
Ridge
39.71
51.40
42.99
Ridge/500
38.82
46.27
36.93
Ridge/50
45.80
41.61
42.59
Ridge/5
46.20
28.54
41.33
SVM
53.75
57.23
50.58
0
0
2
1
2
3
3
ridge
lasso
SVM
ridge
lasso
SVM
4
5
log(Number of Errors + 1) plus jitter
4
log(Number of Errors + 1) plus jitter
2
log(Number of Errors + 1) plus jitter
6
6
4
7
8
ModApte (90 categories)
RCV1-v2 (103 categories)
OHSUMED (77 categories)
ridge
lasso
SVM
20
15
10
5
0
Number of Categories
ModApte - 21,989 features
0
100
200
300
400
500
Number of Features w ith non-zero posterior mode
10
8
6
4
2
0
Number of Categories
RCV1 - 47,152 features
0
500
1000
1500
Number of Features w ith non-zero posterior mode
12
0 2 4 6 8
Number of Categories
OHSUMED - 122,076 features
0
200
400
600
Number of features w ith non-zero posterior mode
800
1000
Bayesian Unsupervised Feature
Selection and Weighting
• Stopwords : low content words that typically
are discarded
– Give them a prior with mean 0 and low variance
• Inverse document frequency (IDF) weighting
– Rare words more likely to be content indicators
– Make variance of prior inversely proportional to
frequency in collection
• Experiments in progress
Bayesian Use of Domain
Knowledge
• Often believe that certain words are positively
or negatively associated with category
• Prior mean can encode strength of positive
or negative association
• Prior variance encodes confidence
First Experiments
• 27 RCV1-v2 Region categories
• CIA World Factbook entry for country
– Give content words higher mean and/or variance
• Only 10 training examples per category
– Shows off prior knowledge
– Limited data often the case in applications
Results (Preliminary)
Macro F1
ROC
Gaussian w/
standard prior
0.242
87.2
Gaussian w/ DK
prior #1
0.608
91.2
Gaussian w/ DK
prior #2
0.542
90.0
Polytomous Logistic Regression
• Logistic regression trivially generalizes to 1-of-k
problems
– Cleaner than SVMs, error correcting codes, etc.
• Laplace prior particularly cool here:
– Suppose 99 classes and a word that predicts class 17
– Word gets used 100 times if build 100 models, or if use
polytomous with Gaussian prior
– With Laplace prior and polytomous it's used only once
• Experiments in progress, particularly on author id
1-of-K Sample Results: brittany-l
Feature Set
%
errors
Number of
Features
“Argamon” function
words, raw tf
74.8
380
POS
75.1
44
1suff
64.2
121
1suff*POS
50.9
554
2suff
40.6
1849
2suff*POS
34.9
3655
3suff
28.7
8676
3suff*POS
27.9
12976
3suff+POS+3suff*POS+Arga
mon
27.6
22057
All words
23.9
52492
89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.
BMR-Laplace classification, default hyperparameter
1-of-K Sample Results: brittany-l
Feature Set
%
errors
Number of
Features
“Argamon” function
words, raw tf
74.8
380
POS
75.1
44
1suff
64.2
121
1suff*POS
50.9
554
2suff
40.6
1849
2suff*POS
34.9
3655
3suff
28.7
8676
3suff*POS
27.9
12976
3suff+POS+3suff*POS+Arga
mon
27.6
22057
All words
23.9
52492
4.6 million parameters
89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.
BMR-Laplace classification, default hyperparameter
Future
• Choose exact number of features desired
• Faster training algorithm for polytomous
– Currently using cyclic coordinate descent
• Hierarchical models
– Sharing strength among categories
– Hierarchical relationships among features
• Stemming, thesaurus classes, phrases, etc.
Text Categorization Summary
• Conditional probability models (logistic, probit, etc.)
• As powerful as other discriminative models
(SVM, boosting, etc.)
• Bayesian framework provides much richer ability
to insert task knowledge
• Code: http://stat.rutgers.edu/~madigan/BBR
• Polytomous, domain-specific priors soon
The Last Slide
• Statistical methods for text mining work well on
certain types of problems
• Many problems remain unsolved
•Which financial news stories are likely to impact
the market?
•Where did soccer originate?
•Attribution
Approximate Online Sparse Bayes
Shooting algorithm (Fu, 1988)
Download