Classifying text

advertisement
Classifying text
NLTK Chapter 6
Chapter 6 topics
• How can we identify particular features
of language data that are salient for
classifying it?
• How can we construct models of
language that can be used to perform
language processing tasks
automatically?
• What can we learn about language from
these models?
From words to larger units
• We looked at how words are indentified
with a part of speech. That is an
essential part of “understanding” textual
material
• Now, how can we classify whole
documents.
– These techniques are used for spam
detection, for identifying the subject matter
of a news feed, and for many other tasks
related to categorizing text
A supervised classifier
We saw a smaller version of this in our part of speech taggers
Case study
Male and female names
• Note this is language biased (English)
• These distinctions are harder given
modern naming conventions
– I have a granddaughter named Sydney, for
example
Step 1: features and encoding
• Deciding what features to look for and how
to represent those features is the first step,
and is critical.
– All the training and classification will be based
on these decisions
• Initial choice for name identification: look
at the last letter:
>>> def gender_features(word):
... return {'last_letter': word[-1]}
>>> gender_features('Shrek')
{'last_letter': 'k'}
returns a dictionary (note the { } ) with a feature
name and the corresponding value
Step 2: Provide training values
• We provide a list of examples and their
corresponding feature values.
>>> from nltk.corpus import names
>>> import random
>>> names = ([(name,'male') for name in names.words('male.txt')] +
...
[(name, 'female') for name in names.words('female.txt')])
>>> random.shuffle(names)
>>> names
[('Kate', 'female'), ('Eleonora', 'female'), ('Germaine', 'male'), ('Helen', 'female'),
('Rachelle', 'female'), ('Nanci', 'female'), ('Aleta', 'female'), ('Catherin', 'female'),
('Clementia', 'female'), ('Keslie', 'female'), ('Callida', 'female'), ('Horatius', 'male'),
('Kraig', 'male'), ('Cindra', 'female'), ('Jayne', 'female'), ('Fortuna', 'female'),
('Yovonnda', 'female'), ('Pam', 'female'), ('Vida', 'female'), ('Margurite', 'female'),
('Maryellen', 'female'), …
>>> featuresets = [(gender_features(n), g) for (n,g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
• Try it. Apply the classifier to your name:
>>> classifier.classify(gender_features('Sydney'))
'female'
• Try it on the test data and see how it
does:
>>> print nltk.classify.accuracy(classifier, test_set)
0.758
Your turn
• Modify the gender_features function to
look at more of the name than the last
letter. Does it help to look at the last
two letters? the first letter? the length
of the name? Try a few variations
What is most useful
• There is even a function to show what
was most useful in the classification:
>>> classifier.show_most_informative_features(10)
Most Informative Features
last_letter = 'k'
male : female =
last_letter = 'a'
female : male
=
last_letter = 'f'
male : female =
last_letter = 'v'
male : female =
last_letter = 'p'
male : female =
last_letter = 'd'
male : female =
last_letter = 'm'
male : female =
last_letter = 'o'
male : female =
last_letter = 'r'
male : female =
last_letter = 'g'
male : female =
45.7
38.4
28.7
11.2
11.2
9.8
8.9
8.3
6.7
5.6
:
:
:
:
:
:
:
:
:
:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
What features to use
• Overfitting
– Being too specific about the characteristics
that you search for
– Picks up idiosyncrasies of the training data
and may not transfer well to the test data
• Choose an initial feature set and then
test.
Dev test
• Divide the corpus into three parts: training,
development testing, final testing
Testing stages
>>> train_names = names[1500:]
>>> devtest_names = names[500:1500]
>>> test_names = names[:500]
>>> train_set = [(gender_features(n), g) for (n,g) in train_names]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
>>> test_set = [(gender_features(n), g) for (n,g) in test_names]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.765
Accuracy noted, but where were the problems?
• Check the classifier against the known
values and see where it failed:
>>> errors = []
>>> for (name, tag) in devtest_names:
... guess = classifier.classify(gender_features(name))
... if guess != tag:
...
errors.append( (tag, guess, name) )
>>> for (tag, guess, name) in sorted(errors):
... print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)...
correct=female guess=male name=Cindely ...
correct=female guess=male name=Katheryn
correct=female guess=male name=Kathryn ...
correct=male guess=female name=Aldrich ...
correct=male guess=female name=Mitch
Error analysis
• It turns out that using the last two
letters improves the accuracy.
• Did you find that in your
experimentation?
Document classification
• Many uses.
• Case study, classifying movie reviews
>>> from nltk.corpus import movie_reviews
>>> documents = [(list(movie_reviews.words(fileid)), category)
...
for category in movie_reviews.categories()
...
for fileid in movie_reviews.fileids(category)]
>>> random.shuffle(documents)
• Feature extraction for documents will use
words
• Find most common words in the
document set and see which words are in
which types of documents
Feature extractor. Are the words
present in the documents
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{'contains(waste)': False, 'contains(lot)': False, ...}
Compute accuracy and see what
are the most useful feature values
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.81
>>> classifier.show_most_informative_features(5)
Most Informative Features
contains(outstanding) = True
pos : neg
contains(seagal) = True
neg : pos
contains(wonderfully) = True
pos : neg
contains(damon) = True
pos : neg
contains(wasted) = True
neg : pos
=
=
=
=
=
11.1
7.7
6.8
5.9
5.8
:
:
:
:
:
1.0
1.0
1.0
1.0
1.0
There is more
• As time allows, let’s look at other
sections of this chapter. We do not have
time to do justice to all the topics, but
we can take a few and look into them.
Download