lecture_5-1 - ODU Computer Science

advertisement
Python & Web Mining
Lecture 5
10-03-12
Old Dominion University
Department of Computer Science
CS 495 Fall 2012
Presented & Prepared by: Justin F. Brunelle
jbrunelle@cs.odu.edu
Hany SalahEldeen Khalil hany@cs.odu.edu
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Chapter 6:
“Document Filtering”
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Document Filtering
In a nutshell: It is classifying documents
based on their content. This
classification could be binary (good/bad,
spam/not-spam) or n-ary (schoolrelated-emails, work-related,
commercials…etc)
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Why do we need Document filtering?
• Eliminate spam.
• Removing unrelated comments in forums
and public message boards.
• Classifying social /work-related emails
automatically.
• Forwarding information-request emails
to the expert who is most capable of
answering the email.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Spam Filtering
• First it was rule-based classifiers:
• Overuse capital letters
• Words related to pharmaceutical
products
• Garish HTML colors
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Cons of using Rule-based classifiers
• Easy to trick by just avoiding patterns of
capital letters…etc.
• What is considered spam varies from one
to another.
• Ex: Inbox of a medical rep Vs. email of
a house-wife.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Solution
• Develop programs that learn.
• Teach them the differences and how to
recognize each class by providing
examples of each class.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Features
• We need to extract features from
documents to classify them.
• Feature: Is anything that you can
determine as being either present or
absent in the item.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Definitions
• item = document
• feature = word
• classification = {good|bad}
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Dictionary Building
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Dictionary Building
• Remember:
• Removing capital letters reduce the
total number of features by removing
the SHOUTING style.
• Size of the features also is crucial
(using entire email as feature Vs. each
letter a feature)
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Classifier Training
• It is designed to start off very uncertain.
• Increase certainty upon learning
features.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Classifier Training
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Probabilities
• It’s a number between 0-1 indicating
how likely an event is.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Probabilities
• ‘quick’ appeared in 2 documents as good
and the total number of good
documents is 3
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Conditional Probabilities
Pr(A|B) = “probability of A given B”
fprob(quick|good) = “probability of quick given good”
= (quick classified as good) / (total good items)
=2/3
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Starting with Reasonable guess
• Using the info we seen so far makes it
extremely sensitive in early training
stages
• Ex: “money”
• Money appeared in casino training
document as bad
• It appears with probability = 0 for
good which is not right!
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Solution: Start with assumed probability
• Start for instance with 0.5 probability for
each feature
• Also decide the weight chosen for the
assumed probability you will take.
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Assumed Probability
>>> cl.fprob('money','bad')
0.5
>>> cl.fprob('money','good')
0.0
we have data for bad, but should we start
with 0 probability for money given good?
>>> cl.weightedprob('money','good',cl.fprob)
0.25
>>> docclass.sampletrain(cl)
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
>>> cl.weightedprob('money','good',cl.fprob)
0.16666666666666666
>>> cl.fcount('money','bad')
3.0
>>> cl.weightedprob('money','bad',cl.fprob)
0.5
define an assumed probability of 0.5
then weightedprob() returns
the weighted mean of fprob and
the assumed probability
weightedprob(money,good) =
(weight * assumed + count * fprob())
/ (count + weight)
= (1*0.5 + 1*0) / (1+1)
= 0.5 / 2
= 0.25
(double the training)
= (1*0.5 + 2*0) / (2+1)
= 0.5 / 3
= 0.166
Pr(money|bad) remains = (0.5 + 3*0.5) / (3+1) = 0.5
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Naïve Bayesian Classifier
• Move from terms to documents:
Pr(document) = Pr(term1) * Pr(term2) * … * Pr(termn)
• Naïve because we assume all terms occur
independently
• we know this is as simplifying assumption; it is
naïve to think all terms have equal probability for
completing this phrase:
• “Shave and a hair cut ___ ____”
• Bayesian because we use Bayes’ Theorem to
invert the conditional probabilities
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Bayes Theorem
• Given our training data, we know:
Pr(feature|classification)
• What we really want to know is:
Pr(classification|feature)
• Bayes’ Theorem* :
Pr(A|B) = Pr(B|A) Pr(A) / Pr(B)
Or:
we know how to
calculate this
#good / #total
we skip this since
it is the same for
each classification
Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc)
* http://en.wikipedia.org/wiki/Bayes%27_theorem
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Our Bayesian Classifier
>>> import docclass
>>> cl=docclass.naivebayes(docclass.getwords)
>>> docclass.sampletrain(cl)
Nobody owns the water.
the quick rabbit jumps fences
buy pharmaceuticals now
make quick money at the online casino
the quick brown fox jumps
>>> cl.prob('quick rabbit','good')
quick rabbit
0.15624999999999997
we use these values
>>> cl.prob('quick rabbit','bad')
only for comparison,
quick rabbit
not as “real”
0.050000000000000003
probabilities
>>> cl.prob('quick rabbit jumps','good')
quick rabbit jumps
0.095486111111111091
>>> cl.prob('quick rabbit jumps','bad')
quick rabbit jumps
0.0083333333333333332
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Bayesian Classifier
• http://en.wikipedia.org/wiki/Naive_Bayes
_classifier#Testing
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Classification Thresholds
>>> cl.prob('quick rabbit','good')
quick rabbit
0.15624999999999997
>>> cl.prob('quick rabbit','bad')
quick rabbit
0.050000000000000003
>>> cl.classify('quick rabbit',default='unknown')
quick rabbit
u'good'
>>> cl.prob('quick money','good')
quick money
0.09375
>>> cl.prob('quick money','bad')
quick money
0.10000000000000001
>>> cl.classify('quick money',default='unknown')
quick money
u'bad'
>>> cl.setthreshold('bad',3.0)
>>> cl.classify('quick money',default='unknown')
quick money
'unknown'
>>> cl.classify('quick rabbit',default='unknown')
quick rabbit
u'good'
Hany SalahEldeen
CS495 – Python & Web Mining
only classify something as
bad if it is 3X more likely to
be bad than good
Fall 2012
Classification Thresholds…cont
>>> for i in range(10): docclass.sampletrain(cl)
>>> cl.prob('quick money','good')
quick money
0.016544117647058824
>>> cl.prob('quick money','bad')
quick money
0.10000000000000001
>>> cl.classify('quick money',default='unknown')
quick money
u'bad'
>>> cl.prob('quick rabbit','good')
quick rabbit
0.13786764705882351
>>> cl.prob('quick rabbit','bad')
quick rabbit
0.0083333333333333332
>>> cl.classify('quick rabbit',default='unknown')
quick rabbit
u'good'
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Fisher Method
• Normalize the frequencies for each category
• e.g., we might have far more “bad” training data
than good, so the net cast by the bad data will be
“wider” than we’d like
• Calculate normalized Bayesian probability, then
fit the result to an inverse chi-square function
to see what is the probability that a random
document of that classification would have
those features (i.e., terms)
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Fisher Example
>>> import docclass
>>> cl=docclass.fisherclassifier(docclass.getwords)
>>> cl.setdb('mln.db')
>>> docclass.sampletrain(cl)
>>> cl.cprob('quick','good')
0.57142857142857151
>>> cl.fisherprob('quick','good')
quick
0.5535714285714286
>>> cl.fisherprob('quick rabbit','good')
quick rabbit
0.78013986588957995
>>> cl.cprob('rabbit','good')
1.0
>>> cl.fisherprob('rabbit','good')
rabbit
0.75
>>> cl.cprob('quick','good')
0.57142857142857151
>>> cl.cprob('quick','bad')
0.4285714285714286
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Fisher Example
>>> cl.cprob('money','good')
0
>>> cl.cprob('money','bad')
1.0
>>> cl.cprob('buy','bad')
1.0
>>> cl.cprob('buy','good')
0
>>> cl.fisherprob('money buy','good')
money buy
0.23578679513998632
>>> cl.fisherprob('money buy','bad')
money buy
0.8861423315082535
>>> cl.fisherprob('money quick','good')
money quick
0.41208671548422637
>>> cl.fisherprob('money quick','bad')
money quick
0.70116895256207468
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Classification with Inverse Chi-Square
>>> cl.fisherprob('quick rabbit','good')
quick rabbit
0.78013986588957995
>>> cl.classify('quick rabbit')
quick rabbit
u'good'
>>> cl.fisherprob('quick money','good')
in practice, we’ll tolerate false
quick money
positives for “good” more than
0.41208671548422637
false negatives
>>> cl.classify('quick money')
quick money
for “good” -- we’d rather see a
u'bad'
mesg that is spam rather than
>>> cl.setminimum('bad',0.8)
lose a mesg that is not spam.
>>> cl.classify('quick money')
quick money
u'good'
>>> cl.setminimum('good',0.4)
>>> cl.classify('quick money')
quick money
this version of the classifier does not
u'good'
print “unknown” as a classification
>>> cl.setminimum('good',0.42)
>>> cl.classify('quick money')
quick money
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Fisher -- Simplified
• Reduces the signal – to – noise ratios
• Assumes document occur with normal
distribution
• Estimates differences in corpus size with Xsquared
• “Chi”-squared is a “goodness-of-fit” b/t an
observed distribution and theoretical distribution
• Utilizes confidence interval & std. dev. estimations
for a corpus
•
http://en.wikipedia.org/w/index.php?title=File:Chisquare_pdf.svg&page=1
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Assignment 4
• Pick one question from the end of the
chapter.
• Implement the function and state
briefly the differences.
• Utilize the python files associated
with the class if needed.
• Deadline: Next week
Hany SalahEldeen
CS495 – Python & Web Mining
Fall 2012
Download