Bayesian Spam Filter - University of St. Thomas

advertisement
Bayesian Spam Filter
By
Joshua Spaulding
Statement of Problem
“Spam email now accounts for
more than half of all messages
sent and imposes huge
productivity costs…By 2007,
Spam-stopping should grow to a
$2.4 Billion Business.”
Technology Review 8/03
Objective
Using Bayes’ rule I will attempt to
classify an email message as spam
or non-spam (ham). I will use a
corpus of spam and ham to
determine the probability that a new
email is spam given the tokens in the
message.
Definition of Spam
Unsolicited automated email
Bayes’ Rule
P(A|B) = P(B|A)P(A) / P(B)
P(A|B) is the conditional probability that event A occurs
given that event B has occurred;
P(B|A) is the conditional probability of event B occurring
given that event A has occurred;
P(A) is the probability of event A occurring;
P(B) is the probability of event B occurring.
Bayes’ Rule
P(spam|token) = P(token|spam)P(spam) / P(token)
P(spam|token) – probability that email is spam given a token
P(token|spam) – probability token exists given email is spam
P(spam) – probability of email being spam
P(token) – probability of token in email
Project Design (orig)





Read in large text file containing 1000 spam.
Read in large text file containing 1000 ham.
Create a file for each corpus consisting of the
token and it’s occurrence in the corpus.
I'll then create another file with the token and the
probability that an email containing it is spam
using Bayesian rule.
When an email arrives I will parse the email. I
will look up the probability that the email is spam
given the token. I’ll then combine all the
probabilities to determine the probability that the
email is spam.
Project Design



Create Narl model from 100 spam and 100
ham contained in two separate CSV files.
Used Narl’s built-in Excel Model function.
(emailCorpus.narl)
Parse body slot from emailCorpus.narl, create
word nodes and calculate the probability.
(kb.narl)
Examine incoming text body, tokenize and
create nodeNames. If nodeName is already
in the kb then lookup the probability.
Otherwise assign probability value of “0.5”.
Model
Email node
Word Node
Issues
 Text is unknown and often
incomplete.
 Java data structures
 Vector, StringTokenizer, floating-point
operations
 Unfamiliar with Narl
Enhancements
 Read slots other than body.
 Read data in from another format. Gain
more knowledge about the email.
 Better error handling.
 Read email as they enter the mail server.
 Regular expression matching of
Stringtokenizer.
 Performance tuning with more data.
 Take advantage of Narl functionality??
Demonstration
Questions?
Download