Classification

advertisement
Classifying text
NLTK Chapter 6
Chapter 6 topics
• How can we identify particular features of
language data that are salient for classifying
it?
• How can we construct models of language
that can be used to perform language
processing tasks automatically?
• What can we learn about language from these
models?
From words to larger units
• We looked at how words are indentified with
a part of speech. That is an essential part of
“understanding” textual material
• Now, how can we classify whole documents.
– These techniques are used for spam detection, for
identifying the subject matter of a news feed, and
for many other tasks related to categorizing text
A supervised classifier
We saw a smaller version of this in our part of speech taggers
Case study
Male and female names
• Note this is language biased (English)
• These distinctions are harder given modern
naming conventions
– I have a granddaughter named Sydney, for
example
Step 1: features and encoding
• Deciding what features to look for and how to
represent those features is the first step, and is
critical.
– All the training and classification will be based on
these decisions
• Initial choice for name identification: look at the
last letter:
>>> def gender_features(word):
... return {'last_letter': word[-1]}
>>> gender_features('Shrek')
{'last_letter': 'k'} returns a dictionary (note the {
} ) with a feature
name and the corresponding value
First gender check
import nltk
def gender_features(word):
return {'last_letter':word[-1]}
name=raw_input("What name shall we check?")
features=gender_features(name)
print "Gender features for ", name, ":",
features
Step 2: Provide training values
• We provide a list of examples and their
corresponding feature values.
>>> from nltk.corpus import names
>>> import random
>>> names = ([(name,'male') for name in names.words('male.txt')] +
...
[(name, 'female') for name in names.words('female.txt')])
>>> random.shuffle(names)
>>> names
[('Kate', 'female'), ('Eleonora', 'female'), ('Germaine', 'male'), ('Helen', 'female'), ('Rachelle',
'female'), ('Nanci', 'female'), ('Aleta', 'female'), ('Catherin', 'female'), ('Clementia', 'female'),
('Keslie', 'female'), ('Callida', 'female'), ('Horatius', 'male'), ('Kraig', 'male'), ('Cindra', 'female'),
('Jayne', 'female'), ('Fortuna', 'female'), ('Yovonnda', 'female'), ('Pam', 'female'), ('Vida', 'female'),
('Margurite', 'female'), ('Maryellen', 'female'), …
>>> featuresets = [(gender_features(n), g) for (n,g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
• Try it. Apply the classifier to your name:
>>> classifier.classify(gender_features('Sydney'))
'female'
• Try it on the test data and see how it does:
>>> print nltk.classify.accuracy(classifier, test_set)
0.758
Your turn
• Modify the gender_features function to look
at more of the name than the last letter. Does
it help to look at the last two letters? the first
letter? the length of the name? Try a few
variations
What is most useful
• There is even a function to show what was most
useful in the classification:
>>> classifier.show_most_informative_features(10)
Most Informative Features
last_letter = 'k'
male : female =
last_letter = 'a'
female : male
=
last_letter = 'f'
male : female =
last_letter = 'v'
male : female =
last_letter = 'p'
male : female =
last_letter = 'd'
male : female =
last_letter = 'm'
male : female =
last_letter = 'o'
male : female =
last_letter = 'r'
male : female =
last_letter = 'g'
male : female =
45.7
38.4
28.7
11.2
11.2
9.8
8.9
8.3
6.7
5.6
:
:
:
:
:
:
:
:
:
:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
What features to use
• Overfitting
– Being too specific about the characteristics that
you search for
– Picks up idiosyncrasies of the training data and
may not transfer well to the test data
• Choose an initial feature set and then test.
The chair example.
What features would you use?
Dev test
• Divide the corpus into three parts: training,
development testing, final testing
Testing stages
>>> train_names = names[1500:]
>>> devtest_names = names[500:1500]
>>> test_names = names[:500]
From 1500 to end
First 500 items
>>> train_set = [(gender_features(n), g) for (n,g) in train_names]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
>>> test_set = [(gender_features(n), g) for (n,g) in test_names]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.765
Accuracy noted, but where were the problems?
import nltk
from nltk.corpus import names
import random
def gender_features(word):
return {'last_letter':word[-1]}
names = ([(name, 'male') for name in names.words('male.txt')] + \
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
print "Number of names: ", len(names)
train_names=names[1500:]
devtest_names=names[500:1500]
test_names = names[:500]
train_set=[(gender_features(n),g) for (n,g) in train_names]
devtest_set=[(gender_features(n),g) for (n,g) in devtest_names]
test_set = [(gender_features(n),g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier,devtest_set)
print classifier.show_most_informative_features(10)
Output from previous code
Number of names: 7944
0.771
Most Informative Features
last_letter =
last_letter =
last_letter =
last_letter =
last_letter =
last_letter =
last_letter =
last_letter =
last_letter =
last_letter =
'k'
'a'
'f'
'v'
'd'
'p'
'm'
'o'
'r'
'w'
male
female
male
male
male
male
male
male
male
male
:
:
:
:
:
:
:
:
:
:
female
male
female
female
female
female
female
female
female
female
=
=
=
=
=
=
=
=
=
=
39.7
31.4
16.0
14.1
10.3
9.8
8.6
7.8
6.6
4.8
:
:
:
:
:
:
:
:
:
:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Checking where the errors are
• Next slide
import nltk
from nltk.corpus import names
import random
def gender_features(word):
return {'last_letter':word[-1]}
names = ([(name, 'male') for name in names.words('male.txt')] + \
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
print "Number of names: ", len(names)
train_names=names[1500:]
devtest_names=names[500:1500]
test_names = names[:500]
train_set=[(gender_features(n),g) for (n,g) in train_names]
devtest_set=[(gender_features(n),g) for (n,g) in devtest_names]
test_set = [(gender_features(n),g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Look for error cases:”
errors = []
for (name,tag) in devtest_names:
guess = classifier.classify(gender_features(name))
if guess != tag:
errors.append((tag, guess, name))
for (tag, guess, name) in sorted(errors):
print 'correct= %-8s guess= %-8s name =%-30s'%(tag,guess,name)
print "Number of errors: ", len(errors)
print nltk.classify.accuracy(classifier,devtest_set)
• Check the classifier against the known
values and see where it failed:
Number of names: 7944
Look for error cases:
correct= female guess= male
correct= female guess= male
correct= female guess= male
correct= female guess= male
correct= female guess= male
correct= female guess= male
correct= female guess= male
correct= female guess= male
correct= female guess= male
…
name =Abagail
name =Adrian
name =Alex
name =Amargo
name =Anabel
name =Annabal
name =Annabel
name =Arabel
name =Ardelis
Finding the error cases
• Look through the list of error cases.
• Do you see any patterns?
• Are there adjustments that we could make in
our feature extractor to make it more
accurate?
Error analysis
• It turns out that using the last two letters
improves the accuracy.
• Did you find that in your experimentation?
Summarize the process
• Train on a subset of the available data
– Look for characteristics that relate to the “right” answer.
Write the feature extractor to look at those characteristics
• Run the classifier on other data – whose
characteristics are known! – to see how well it
performs
– You have to know the answers to know whether the
classifier got them right.
• When satisfied with the performance of the classifier,
run it on new data for which you do not know the
answer.
The disease example. If 98% of your cases
– How confident can you be?
are disease free …
Document classification
• So far, classified names as Male/Female
– Not much to work with, not much to look at
• Now, look at whole documents
– How can you classify a document?
– Subject matter in a syllabus collection, positive and
negative movie/restaurant/other reviews, bias in a
summary or review, subject matter in a news feed,
separate works by author, …
• Case study, classifying movie reviews
Classifying documents
• To classify words (names), we looked at
letters.
• Feature extraction for documents will use
words
• Find the most common words in the
document set and see which words are in
which types of documents
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), \
category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
cats = list(cat for cat in \
movie_reviews.categories())
print "Movie review Categories:", cats
print "Number of reviews:", len(documents)
Feature extractor.
Are the words present in the documents
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words= nltk.FreqDist(w.lower() for w in \
movie_reviews.words())
word_features = all_words.keys()[:2000]
Line by line, what
does this do?
This is something different,
def document_features(document):
but we have seen its like
document_words = set(document)
before
features = {}
for word in word_features:
features['contains(%s)'% word] = (word in document_words)
return features
What is this?
print document_features(movie_reviews.words('pos/cv957_8737.txt'))
And if you are not sure …
• What do you do?
– Enter the code and run it
– Go to a search engine and type “Python <issue
description>”
Compute accuracy and see what are
the most useful feature values
• Just as we did with classifying names
• Create a feature set
• Create a training set and a testing set
• Apply to new data
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
0.81
Most Informative Features
contains(outstanding) =
contains(seagal) =
contains(mulan) =
contains(damon) =
contains(wonderfully) =
True
True
True
True
True
pos
neg
pos
pos
pos
:
:
:
:
:
neg
pos
neg
neg
neg
=
=
=
=
=
11.1
8.3
8.3
8.1
6.8
:
:
:
:
:
1.0
1.0
1.0
1.0
1.0
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words= nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)'% word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
print classifier.show_most_informative_features(5)
Full code for this
example
From the text
• This note from the text attracted my
attention:
Note
The reason that we compute the set of all words in a document in
<figure reference>, rather than just checking if word in document, is
that checking whether a word occurs in a set is much faster than
checking whether it occurs in a list (4.7).
• What does that suggest?
The time has come …
• We have learned a lot of Python
• Something about object-oriented
programming
• A bit about Text Analysis
• A bit about network programming, web
crawling, servers, etc.
• There is lots more to all of those subjects.
I am happy to review or discuss anything we did this semester.
If you are doing some Python programming later and want to discuss it, I will be
happy to talk to you about it.
Download