CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly Objective • Create a text-filtering algorithm that can accurately and efficiently identify spam emails based on data collected from past spam emails. Background • spam - : e-mail that is not wanted : e-mail that is sent to large numbers of people and that consists mostly of advertising : unsolicited usually commercial e-mail sent to a large number of addresses • Spam is estimated to account for anywhere from 70 – 95% of all emails Method • Create a word bank from parsing through the body of spam emails in database – Our methods disregard sender address, subject line • Each word is associated with a frequency of appearance within all emails evaluated during the learning phase • Use this data to evaluate emails with one of two methods: – Naïve bayes classifier – Markov model Naïve Bayes Classifier - Background • One of the most popular/oldest methods of spam detection, first known use in 1996 • Common text identification method – utilizing features from the “bag of words” model – Disregards grammar, word order but not multiplicity • Assumes independence among features - value of any particular feature is unrelated to the presence or absence of any other feature • Tailored to a specific user • Offers low false-positive detection rate Naïve Bayes Classifier - Process • Each word has a probability of being in a spam email – Training phase accounts for building these probabilities (email user marking an email as spam) • Probabilities of individual words are used to compute the probability that an email with a particular set of words is spam or not • If this probability meets a certain threshold – the email is determined to be spam Naïve Bayes Classifier - Process Considering one word’s effect on an email being spam: Pr(S|W) – probability an email is spam knowing it contains word X Pr(W|S)- probability that word X appears in spam Pr(S) – probability any given message is spam Pr(W|H)- probability that word X doesn’t appear in spam Pr(H) – probability any given message isn’t spam Pr(S) = .8, Pr(H) = .2 ? Pr(S) = .9 Pr(H) = .1 ? (based on recent statistics) Most bayesian spam software makes no assumptions about incoming emails So the formula can be simplified to : Naïve Bayes Classifier - Process Combining individual probabilities: p = probability the email in question is spam p1 = probability of a word being in a spam email n = number of words being evaluated *multiplication shown here is actually done as addition in the log domain because the numbers involved are very small Compare p to a determined threshold, if p is below threshold – email cannot be classified spam if p is equal to or above threshold – email can be classified as spam Naïve Bayes Classifier - Results • 15,000 spam emails evaluated during learning phase • Average classifier value of emails in learning phase used as threshold – 2.86% success rate in testing (86/3000 emails could be confidently identified as spam) • Median – better summary statistic for data that is not normally distributed – 52.03% success rate when using median value as threshold (1561/3000) SAS output shown on the right displays results from a PROC UNIVARIATE procedure ran on a data set containing the bayes classifier values for the 15,000 emails in the learning set. This data is highly skewed and three different normality tests support that this data is not normally distributed. This evidence supports that the model considering individual probabilities of every word within an email is not the best fit for our data. Naïve Bayes Classifier - Results • Only consider the 15 most “interesting” (highest) probabilities for each email in the classifier • Neutral words (words associated with a low spam probability) should not effect the statistical significance of highly incriminating words, no matter how many there are • 97.13% success rate (2914/3000 spam emails correctly identified) – using average bayes value from learning set as threshold Markhov Model - Background • Models the statistical behaviors of spam emails. • Widely used in current spam classification systems. • In essence, a Bayes filter works on single words alone, while a Markovian filter works on phrases or possibly whole sentences. Markhov Model - Process • Training – Analyze a training set of emails that are all known to be spam • Examining adjacent words, ‘A’ and ‘B’, compute the frequency that word ‘B’ follows word ‘A’ , for every word in the body of a email. • If word ‘A’ is followed by a period, question mark, or exclamation point, skip it. Markhov Model - Process • Calculate and store the average occurrence rate of word ‘B’ following word ‘A’, for every word in each email in training set. avgPerEmail(‘A’’B’) = • Summing all of the average occurrence rates of ‘B’ following ‘A’ and dividing by the total number of emails in the training set, results in the final average rate that word ‘B’ followed word ‘A’ in the training set. • Final Avg. Occurrence (‘B’ Follows ‘A’) = Email 1 + … + avgPerEmail(‘A’’B’)Email n Number of Emails in Training Set • Using a weighted directed graph, store each word encountered as a vertex, with edges between adjacent words containing the average rate of occurrence in all spam emails from training set. Markhov Model - Process Classification: When “grading” an email in question, • Examine adjacent words • Lookup the corresponding edge weight in the graph (The average rate that a word follows another word in the training set collection.) • Accumulate these weights per each email and calculate the average weight as a final grade for the email. • If this grade is greater than or equal to a determined threshold, consider this email as spam, if less than, consider this email as not spam. • If an edge does not exist, (two words were never adjacent in training collection) It is skipped, having no affect on the overall grade. • Skip common words that could potentially be frequent in both spam and non-spam emails.(ie. the, this, I, etc. ) Markhov Model - Results • 3000 spam emails evaluated during learning phase • 1000 test spam emails used in Testing Set. • Average classifier grade of emails in learning phase used as threshold. • 920 spam emails correctly identified as spam. • 92% Success rate.