Uploaded by mohanram swaminathan

Text classification

advertisement
Text classification
Text Classification-Definition
• Text classification is the assignment of text
documents to one or more predefined categories
based on their content.
Text document
Classifier
Class A
Text
document
Class B
Class C
Text
document
• The classifier:
– Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym)
– Output: a learned classifier f:x  y
Text Classification-Applications
• Classify news stories as World, US, Business, SciTech, Sports,
Entertainment, Health, Other.
• Classify business names by industry.
• Classify student essays as A,B,C,D, or F.
• Classify email as Spam, Other.
• Classify pdf files as ResearchPaper, Other
• Classify documents as WrittenByReagan, GhostWritten
• Classify movie reviews as Favorable,Unfavorable,Neutral.
• Classify technical papers as Interesting, Uninteresting.
• Classify jokes as Funny, NotFunny.
• Classify web sites of companies by Standard Industrial
Classification (SIC) code.
Cost of Manual Text Categorization
◦ Yahoo!
 200 (?) people for manual labeling of Web pages
 using a hierarchy of 500,000 categories
◦ MEDLINE (National Library of Medicine)
 $2 million/year for manual indexing of journal articles
 using MEdical Subject Headings (18,000 categories)
◦ Mayo Clinic
 $1.4 million annually for coding patient-record events
 using the International Classification of Diseases (ICD) for billing
insurance companies
◦ US Census Bureau decennial census (1990: 22 million
responses)
 232 industry categories and 504 occupation categories
 $15 million if fully done by hand
Text Classification Framework
Documents
Preprocessing
Indexing
Performance
measure
Applying
classification
algorithms
Feature
selection
5
Introduction to Information Retrieval
Classification Using Vector Spaces
• Each document is a vector, one component for
each term.
• Terms are axes.
• High dimensionality: 100,000s of dimensions
• Normalize vectors (documents) to unit length
• How can we do classification in this space?
Classification Using Vector Spaces
• In vector space classification, training set
corresponds to a labeled set of points (equivalently,
vectors)
• Premise 1: Documents in the same class form a
contiguous region of space
• Premise 2: Documents from different classes don’t
overlap (much)
• Learning a classifier: build surfaces to delineate
classes in the space
Sec.14.1
Documents in a Vector Space
Government
Science
Arts
8
Sec.14.1
Test Document of what class?
Government
Science
Arts
9
Sec.14.1
Test Document = Government
Is this
similarity
hypothesis
true in
general?
Government
Science
Arts
Our focus: how to find good separators
10
Relevance feedback vs Text classification
• Relevance feedback is a form of text
classification
• The principal difference between relevance
feedback and text classification:
– The training set is given as part of the input in text
classification.
– It is interactively created in relevance feedback.
Rocchio classification: Basic idea
• Compute a centroid for each class The
centroid is the average of all documents in the
class.
• Assign each test document to the class of its
closest centroid.
Sec.14.2
Definition of centroid
1
m (c) =
v (d)
å
| Dc | d ÎDc
• Where Dc is the set of all documents that belong
to class c and v(d) is the vector space
representation of d.
• Note that centroid will in general not be a unit
vector even when the inputs are unit vectors.
13
Sec.14.2
Rocchio classification
 Rocchio forms a simple representative for
each class: the centroid/prototype
 Classification: nearest prototype/centroid
 It does not guarantee that classifications are
consistent with the given training data
14
Rocchio algorithm
Rocchio cannot handle multimodal class
Sec.14.2
Rocchio classification
• Little used outside text classification
– It has been used quite effectively for text
classification
– But in general worse than Naïve Bayes
• Again, cheap to train and test documents
17
Bayes Classifiers
• Consider a document D, whose class is given by C. In the case
of email spam filtering there are two classes C = S (spam) and
C = H (ham). We classify D as the class which has the highest
posterior probability P(C|D), which can be re-expressed using
Bayes’ Theorem:
Bayes Classifiers
Task: Classify a new instance D based on a tuple of
attribute value D  x1 , x2 ,, xn into one of the
classes cj  C
cMAP  argmax P(c j | x1 , x2 ,, xn )
c j C
 argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
 argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Assumption
• P(cj)
– Can be estimated from the frequency of classes in the
training examples.
• P(x1,x2,…,xn|cj)
--- Naïve Bayes Conditional Independence Assumption:
• Assume that the probability of observing the conjunction
of attributes is equal to the product of the individual
probabilities P(xi|cj).
The Naïve Bayes Classifier
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
• Conditional Independence Assumption: features detect term
presence and are independent of each other given the class:
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
Learning the Model
C
X1
X2
X3
X4
X5
X6
• First attempt: maximum likelihood estimates
– simply use the frequencies in the data
N (C  c j )
ˆ
P (c j ) 
N
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
• What if we have seen no training cases where patient had no
flu and muscle aches?
N ( X 5  t , C  nf )
ˆ
P( X 5  t | C  nf ) 
0
N (C  nf )
• Zero probabilities cannot be conditioned away, no matter the
other evidence!
Smoothing to Eliminate Zeros
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  k
N (C  c j )  k . | X |
# of values of X,
• Add one smooth (Laplace smoothing)
• As a uniform prior (each attribute occurs once for
each class) that is then updated as evidence from the
training data comes in.
• Two probabilistic models of documents, both of which
represent documents as a bag of words, using the Naive Bayes
assumption.
• Both models represent documents using feature vectors
whose components correspond to word types. If we have a
vocabulary V, containing |V| word types, then the feature
vector dimension d=|V|.
• Bernoulli document model: a document is
represented by a feature vector with binary elements
taking value 1 if the corresponding word is present in
the document and 0 if the word is not present.
• Multinomial document model: a document is
represented by a feature vector with integer
elements whose value is the frequency of that word
in the document.
• In the Bernoulli model a document is represented by a binary
vector.
• Let 𝑏𝑖 be the feature vector for the 𝑖 𝑡ℎ document 𝐷𝑖 ; then the 𝑡𝑡ℎ
element of 𝑏𝑖 , written 𝑏𝑖𝑡 , is either 0 or 1 representing the absence
or presence of word 𝑤𝑡 in the 𝑖 𝑡ℎ document.
• Let P(𝑤𝑡 |C) be the probability of word 𝑤𝑡 occurring in a document
of class C; the probability of 𝑤𝑡 not occurring in a document of this
class is given by (1−P(𝑤𝑡 |C)). If we make the naive Bayes
assumption, that the probability of each word occurring in the
document is independent of the occurrences of the other words,
then we can write the document likelihood P(𝐷𝑖 | C) in terms of the
individual word likelihoods P(𝑤𝑡 |C):
• N the total number of documents
• 𝑁𝑘 is the number of documents labelled with
class C=k, for k=1, . . . , K
• nk(wt) the number of documents of class C=k
containing word wt for every class and for each
word in the vocabulary
1. Define the vocabulary V; the number of words in the
vocabulary defines the dimension of the feature vectors
2. Count the following in the training set:
• N the total number of documents
• Nk the number of documents labelled with class C=k, for
k=1, . . . , K
• nk(wt) the number of documents of class C=k containing
word wt for every class and for each word in the vocabulary
3. Estimate the likelihoods P(wt | C=k)
4. Estimate the priors P(C=k)
• To classify an unlabelled document Dj, we
estimate the posterior probability for each
class
NB Example
• c(5)=?
Bernoulli NB Classifier
• Feature likelihood estimate
• Posterior
• Result: c(5) <> China
Parameter estimation
• Binomial model:
Pˆ ( X w  t | c j ) 
fraction of documents of topic cj
in which word w appears
• Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
Multinomial NB Classifier
• Feature likelihood estimate
• Posterior
• Result: c(5) = China
Introduction to Information Retrieval
Naive Bayes is Not So Naive
 Very fast learning and testing (basically just
count words)
 Low storage requirements
 Very good in domains with many equally
important features
 More robust to irrelevant features than many
learning methods
Introduction to Information Retrieval
Naive Bayes is Not So Naive
• Naive Bayes won 1st and 2nd place in KDD-CUP
97 competition out of 16 systems
Goal: Financial services industry direct mail
response prediction: Predict if the recipient of mail
will actually respond to the advertisement –
750,000 records.
• A good dependable baseline for text
classification (but not the best)!
Sec.14.3
k Nearest Neighbor Classification
• kNN = k Nearest Neighbor
• To classify a document d:
• Define k-neighborhood as the k nearest
neighbors of d
• Pick the majority class label in the kneighborhood
37
Sec.14.3
Example: k=6 (6NN)
Government
Science
Arts
38
Sec.14.3
Nearest-Neighbor Learning
• Learning: just store the labeled training examples D
• Testing instance x (under 1NN):
– Compute similarity between x and all examples in D.
– Assign x the category of the most similar example in D.
• Does not compute anything beyond storing the
examples
• Also called:
– Case-based learning
– Memory-based learning
– Lazy learning
• Rationale of kNN: contiguity hypothesis
39
Sec.14.3
k Nearest Neighbor
• Using only the closest example (1NN)
subject to errors due to:
– A single atypical example.
– Noise (i.e., an error) in the category label of
a single training example.
• More robust: find the k examples and
return the majority category of these k
• k is typically odd to avoid ties; 3 and 5
are most common
40
Sec.14.3
kNN decision boundaries
Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra
Government
Science
Arts
kNN gives locally defined decision boundaries between
classes – far away points do not influence each classification decision
41
(unlike in Naïve Bayes, Rocchio, etc.)
Sec.14.3
kNN: Discussion
• No feature selection necessary
• No training necessary
• Scales well with large number of classes
– Don’t need to train n classifiers for n classes
• Classes can influence each other
– Small changes to one class can have ripple effect
• May be expensive at test time
• In most cases it’s more accurate than NB or
Rocchio
42
Download