Distributional Clustering of Words for Text Classification

advertisement
Distributional Clustering of
Words for Text Classification
L.Douglas Baker
(Carnegie Mellon
University)
Andrew Kachites McCallum
(Justsystem Pittsburgh
Research Center)
Presentation by:
Thomas Walsh
(Rutgers University)
Clustering
• Define what it means for words to be
“similar”.
• “Collapse” the word space by grouping
similar words in “clusters”.
• Key Idea for Distributional Clustering:
– Class probabilities given the words in a labeled
document collection P(C|w) provide rules for
correlating words to classifications.
Voting
• Can be understood by a voting model:
• Each word in a document casts a weighted
vote for classification.
• Words that normally vote similarly can be
clustered together and vote with the average
of their weighted votes without negatively
impacting performance.
Benefits of Word Clustering
• Useful Semantic word clustering
– Automatically generates a “Thesaurus”
• Higher classification accuracy
– Sort of, we’ll discuss in the results section
• Smaller classification models
– size reductions as dramatic as 50000  50
Benefits of Smaller Models
• Easier to compute – with the constantly increasing
amount of available text, reducing the memory
space is clutch.
• Memory constrained devices like PDA’s could
now use text classification algorithms to organize
documents.
• More complex algorithms can be unleashed that
would be infeasible in 50000 dimensions.
The Framework
• Start with Training Data with:
– Set of Classes C = {c1, c2… cm}
– Set of Documents D ={d1… dn}
– Each Document has a class label
Mixture Models
• f(xi|q) = Spkh(xi|lk)
• Sum of pk’s is 1
• h is a distriution function for x (such as a
Gausian) with lk as the parameter (m, S) in
the Gausian case.
• Thus q = (p1…pk, l1… lk)
What is q in this case?
• Assumption: one-to-one correspondence
between the mixture model components and
the classes.
• The class priors are contained in the vector
q0
• Instances of each class / number of
documents
What is q in this case?
• The rest of the entries in q correspond to
disjoint sets. The jth entry contains the
probability of each word wt in the
vocabulary V given the class cj.
• N(wt, di) is the number of times a word
appears in document di.
• P(cj|di) = {0, 1}
Prob. of a given Document in the
Model
• The mixture model can be used to produce
documents with probability:
• Just the sum of the probability of generating
this document in the model over each class.
Documents as Collections of
Words
• Treat each document as an ordered
collection of word events.
• Dik = work in document di at place k.
• Each word is dependent on preceding words
Apply Naïve Bayes Assumption
• Assume each word is independent of both
content and position
• Where dik = wt
• Update Formulas 2 and 1:
– (2) P(di | cj ; q) = P P(wt|cj ; q)
– (1) P(di| q) = S P(cj|q) P P(wt|cj; q)
Incorporate Expanded Formulae
for q
• We can calculate the model parameter q
from the training data.
• Now we wish to calculate P(cj|di; q), the
probability of document di belonging to
class cj.
Final Equation
Class prior * (2)Product of all the probabilities of each
word in the document assuming we are in class cj
------------------------------------------------------------(1/2/3) Sum of all class priors * product of all word
probabilities assuming we are in class cr
 Maximize and that value of cj is the class for the
document
Shortcomings of the Framework
• In real world data (documents) there isn’t
actually an underlying mixture model and
the independence assumption doesn’t
actually hold.
• But empirical evidence and some theoretical
writing (Domingos and Pazzani 1997)
indicates the damage from this is negligible.
What about clustering?
• So assuming the Framework holds… how
does clustering fit into all this?
How Does Clustering affect
probabilities?
• Fraction of cluster from wt + fraction of
cluster from ws
Vs. other forms of learning
• Measures similarity based on the property it
is trying to estimate (the classes)
– Makes the supervision in the training data
really important.
• Clustering is based on the similarity of the
class variable distributions
• Key Idea: Clustering preserves the “shape”
of the class distributions.
Kullock-Liebler Divergence
• Measures the similarity between class
distributions
• D( P(C | wt) || P(C | ws)) =
• If P(cj | wt) = P(cj | ws) then log(1) = 0
Problems with K-L Divergence
• Not symmetric
• Denominator can be 0 if ws does not appear
in any documents of class cj.
K-L Divergence from the Mean
• Ratio of each words occurrence in the cluster * KL divergence of that word within the cluster
• New and improved: uses a weighted average
instead of just the mean
• Justification: fits clustering because independent
distributions now form combined statistics.
Minimizing Error in Naïve Bayes
Scores
• Assuming uniform class priors allows us to
drop P(cj | q) and the whole denominator
from (6)
• Then performing a little algebra gets us the
cross entropy:
• So error can be measured in the difference
in cross-entropy caused by clustering.
Minimizing this equation results in equation
(9), so clustering in this method minimizes
error.
The Clustering Algorithm
• Comparing similarity of all possible word clusters
would be O(V2)
• Instead, a number M is set as the total number of
desired clusters
– More supervision
• M clusters initialized with the M words with the
highest mutual information to the class variable
• Properties: Greedy, scales efficiently
Algorithm
S P(C | wt)
Related Work
• Chi Merge / Chi 2
– Use D. Clustering to discretize numbers
• Class-based clustering
– Uses amount that mutual information is reduced to
determine when to cluster
– Not effective in text classification
• Feature Selection by Mutual Information
– cannot capture dependencies between words
• Markov-blanket-based Feature Selection
– Also attempts to Preserve P(C | wt) shapes
• Latent Semantic Indexing
– Unsupervised, using PCA
The Experiment :
Competitors to Distributional
Clustering
• Clustering with LSI
• Information Gain Based Feature Selection
• Mutual-Information Feature Selection
• Feature Selection involves cutting out
redundant instances
• Clustering combines these redundancies
The Experiment: Testbeds
• 20 Newsgroups
– 20,000 articles from 20 usenet groups (apx 62000
words)
• ModApte “Reuters-21578”
– 9603 training docs, 3299 testing docs, 135 topics
(apx. 16000 words)
• Yahoo! Science (July 1997)
– 6294 pages in 41 classes (apx. 44000 words)
– Very noisy data
20 Newsgroups Results
• Averaged over 5-20 trials
• Computational constraints forced Markov blanket
to a smaller data set (second graph)
• LSI uses only 1/3 training ratio
20 Newsgroups Analysis
• Distributional Clustering achieves 82.1% accuracy at
50 features, almost as good as having the full
vocabulary.
• More accurate then all non-clustering approaches
• LSI did not add any improvement to clustering
(claim: because it is unsupervised)
• On the smaller data set, D.C. achieves 80% accuracy
far quicker then the others, in some cases doubling
their performance for small numbers of features.
• Claim: Clustering outperforms Feature selection
because it conserves information rather than
discarding it.
Speed in 20-Newsgroups Test
•
•
•
•
Distributional Clustering: 7.5 minutes
LSI: 23 minutes
Makov Blanket: 10 hours
Mutual information feature selection (???):
30 seconds
Reuters-21578 Results
• D.C. outperforms others for small numbers of
features
• Information-Gain based feature selection does better
for larger feature sets.
• In this data set, documents can have multiple labels.
Yahoo! Results
• Feature selection performs almost as well or better
in these cases
• Claim: The data is so noisy that it is actually
beneficial to “lose data” via feature selection.
Performance Summary
• Only slight loss in accuraccy despite despite
the reduction in feature space
• Preserves “redundent” information better
than feature selection.
• The improvement is not as drastic with
noisy data.
Improvements on Earlier D.C.
Work
• Does not show much improvement on
sparse data because the performance
measure is related to the data distribution
– D.C. preserves class distributions, even if these
are poor estimates to begin with.
• Thus this whole method relies on accurate
values for P(C | wi)
Future Work
• Improve D.C.’s handling of sparse data
(ensure good estimates of P(C | wi)
• Find ways to combine feature selection and
D.C. to utilize the strengths of both (perhaps
increase performance on noisy data sets?)
Some Thoughts
• Extremely supervised
• Needs to be retrained when new documents
come in
• In a paper with a lot of topics, does Naïve
Bayes (word independent of context) make
sense?
• Didn’t work well in noisy data
• How can we ensure proper theta values?
Download