FinalProject

advertisement

2007 年 6 月 9

[

DATA MINING & MACNINE LEARNING FINAL PROJECT

]

Group 2

R95922027 李庭閣

R95922034 孔垂玖

R95922081 許守傑

R95942129 鄭力維

Contents

Experiment setting

Feature extraction

Date of mail

Number of receivers

Mail with attachment

Mail with image

Mail with URL

Mail Title

Mail body

Model training

Naïve Bayes

Knn

Maximum Entropy

SVM

Hybrid-Model

Vote

Neural network

Conclusion

Reference

Experiment setting

We select a corpus on internet, which is called enron-spam-preprocessed, derived from enron. It contains six folders (enron1 ~ enron6), with 13496 spam mails and 15045 ham mails. It is a preprocessed mail corpus—removing the html tags and factoring all the important headers.

Feature Extration

Feature 1 : Date of the mail

Figure 1

Figure 1 shows the distribution of the date of the mail in one day. Spam mails are nearly uniform distributed in twenty four hours. Ham mails are concentrated in the daytime, from 7 am to 6 pm. It is reasonable because most people work in the daytime.

Feature 2 : Number of receivers

Figure 2 shows the number of receiver in each mail. We can see that most spam mails have only 1 receiver and there are not any spam mails whose receivers are more than 20.

However, some ham mails have many receivers. Because

Figure 2 sometimes we send information to a group of people such as coworkers in the company or classmates in the school.

The maximum number of receiver in the training data is 206.

When given a mail, we can check the date of it and the number of receiver and assign it a probability of being ham or spam.

I assign the probability of being ham = P [ ham | date

 h ]

P [ h |

P [ h | ham ] spam ]

P [ h | ham ]

P [ ham |# ofreceiver

 r ]

P [ r |

P [ r | ham ] spam ]

P [ r | ham ]

2

Figure 3

Attachment (with / without : ratio)

Image(with without : ratio)

/ URL (with / without : ratio)

Spam 5 / 13491 : 0.0307% 92 / 13404 : 0.6816% 4154 / 9342 : 30.779%

Ham 1109 / 13936 : 7.3712% 0 / 15045 : 0% 1061 / 13984 : 7.0521%

Feature 3 : Mail with Attachment

Table 1

Ham : Spam =

7.3712

0.0307

= 229 : 1

 Mail with attachment : 𝑃(𝑠𝑝𝑎𝑚|𝑤𝑖𝑡ℎ 𝑎𝑡𝑡𝑎𝑐ℎ 𝑓𝑖𝑙𝑒) =

1

1+229

= 𝟎. 𝟎𝟎𝟒

We can compare column1 and see that while an e-mail with attachment, the probability that it is a spam mail is extremely low.

Feature 4 : Mail with Image (img src = )

Ham : Spam = 0 : 1

 Mail with image : 𝑃(𝑠𝑝𝑎𝑚 | 𝑤𝑖𝑡ℎ 𝑖𝑚𝑎𝑔𝑒) = 0.999

Although in column2, we see that no a file with image implies it is 100% spam. To avoid zero-probability problem, we assign the conditional probability to 0.999.

Feature 5: Mail with URL (http://)

Ham : Spam =

30.779

7.0521

= 4 : 1

 Mail with URL : 𝑃(𝑠𝑝𝑎𝑚 |𝑓𝑖𝑙𝑒 𝑤𝑖𝑡ℎ 𝑢𝑟𝑙) = 0.8

Given a file with URL, the probability which it is a spam mail is still higher.

Feature 6 : Mail Title

Previous research works have mentioned that non-alphanumeric character,

Arabic numerals, punctuation marks in mails’ titles, or even no title can be viewed as discriminate features between spam and ham.

Some papers said spams sometimes without titles. That’s true, but sometimes people forget to write titles, too. In our mail corpus, mails without titles take 6% and 7 % for hams and spams, respectively.

3

For the Arabic parts in our mail corpus, it is not a powerful feature to discriminate the two classes. Many commerce hams have some IDs (receiver’s ID, product ID) or serial numbers (part #). Date is another common numeral in mail title, which is equally possible to show in titles of both spam and ham. We do some other analysis by counting the numerals larger and smaller than 31. For both types of numerals, the two mail types are almost fifty to fifty.

In the experiment, punctuation marks and non alphanumeric character can be classified into three types—spam-bias, ham-bias, and non-bias. For example, “!” and “?” are wide-used marks in the spam since spam often preferred a surprising mode. In our computations, “~ ^ | * % [] ! ? =” are spam-bias punctuation mark.

And, “\ / ; &” are ham preferred. The others, such as “,” and “-”, are non-bias since they are preferred punctuation marks in the writing.

Marks

~ ^ | * % [] ! ? =

Probability of being Spam Mail

0.911

Feature Showing Rate

28% in spams

\ / ; & 0.182 16% in hams

Feature 7 : Mail-body

Table 2

Once we get the mail documents, there is an important issue about word morphology, which is the field within linguistic that studies the internal structure of words (ex: verb “do” can be represented as “did” 、 “doing”). In order to eliminate the data sparseness problem caused by the morphology of English words, we need to propose word stemming in preprocessing. There are many useful tools in the web and here we use a good NLP tool called Treetagger developed by the Institute for

Computational Linguistics of the University of Stuttgart. Treetagger provides two main functions: word stemming and part-of-speech tagging. The input is a sequence of words, and it will return the corresponding prototype and part-of-speech of each word. For example:

POS Prototype Word

“the”

Treetagger is

DT

NP

VBZ the

Treetagger be

Model training

Table 3

Our experiments are based on enron1 mail corpus, including 1406 spam mails and

3671 ham mails.

Naïve Bayes method:

Given a bag of words, Naïve Bayes is a common technique in NLP for document classification. So we first use this method to solve the spam mail filtering problem: given a document X=(x

1

, x

2

, x

3

,…,x n

), we need to calculate the posterior probability

4

and expand it by the Bayes’ theorem, independent assumption and ignore the evidence P(X):

C

P C i

| X ), C i

{ , }

C i

( | i

) ( i

) arg max

C i arg max

C i

 j

( j

| C P C i

) ( i

) arg max

C i

 j

P x j

| C i

P C i

)

So our task is to calculate the likelihood P(x j

|C i

) by simply counting:

P x j

| C i

)

 log

( , i

)

 c C i

) c x C j i

 c C i

)

The logarithm used here transforms the formula to a summation and avoids underflow by successive floating point number multiplications.

Figure 4

Vector space method with k-nearest neighbor:

Another popular approach for document processing is to transform a document to high dimensional super vector. So we use this idea to solve the problem. First we concatenate all the mails in our corpus to a big file and then use a famous language model tool – SRIlm, which provides a lot of useful functions such as language model building, word counts, viterbi algorithm…etc. Here we use the function

ngram-count” to create the dictionary we need from the big file: ngram-count –text big.file –write dict –order 1

By the dictionary, we processed each document and create a word-document matrix:

5

w ij

 c ij n j w ij

is the normalized word count. After matrix is prepared, we use k-nearest neighbor with cosine similarity function to solve our problem: similarity d d i j

)

 i j

|| d i

|| * || d j

||

In document processing, the cosine distance function is more reasonable than common Euclidean distance function.

Figure 5

Maximum Entropy method:

Maximum Entropy is a state-of-art logistic regression extension in machine learning and nature language processing area. It has the dual property which not only maximizes the log likelihood but also maximizes the entropy, minimizes the

Kullback-Leiber distance between model and the real distribution. Similar to naïve

Bayes method, it needs to make the independent assumption.

C

C i

P C i

|

X

)

arg max

C i

arg max

C i

 j

P C x i j

)

 j

P C x j

)

arg max

log

C i j

C i

e

 k

 e

 k

 f ( x C i

) f ( x C i

)

So we tried to adopt this model to solve our problem and here we use a good tool --

6

SVM:

Maxent developed by Zhang Le in University of Edinburgh. Because the ME model only accept nominal attribute for the feature function f(x j

, C i

), so we need to modified the element in word-document matrix to the binary value {0, 1}. In the model training, we set the iteration parameter for convex optimization to 30.

Figure 6

We’ve found a tool called svm-ligh t, which can help us do svm model training and classifying. We fed our extracted files (sparse format) to do cross-validation, and used two ways to represents our mail-body features :

1.

Binary : using just 0 or 1 to represent that this word appears or not

2.

Normalized : counting the appearance of each word, divided by their maximum appearance counts.

Figure is the result of svm model using only mail-body

Figure 7 features.. Compare the blue line and red line, we can see obviously binary format, with accuracy around 97%, outperforms better than binary format, with accuracy about 92%.

Figure is the svm model adding mail-header features. Compare the blue line with green line, we found that adding header information indeed help us

7

Figure 8

to judge correctly. The red and purple line are the results of normalized method, still perform worse than binary method.

We think that the reason why binary method performs better is that using a vector to represent the occurrence of words in a file can somewhat in a sense represent the “semantic of a document”. Since binary method is better, we’ll use this way to do our latter work.

Hybrid Model

Committee-based approach

From the above experiment, now we have 3 classifiers to filter the spam mail.

Instead of using single classifier independently, we can build a committee-based classifier:

So we propose two kinds of decision makers: vote and single layer neural network.

Single layer neural network:

We can use linear combination as a decision maker:

C

 

(

 i

 c i i

 

)

But the question is: how to decide the weights α and β efficiently? Here we use the popular neural network learning method: backpropagation

algorithm to learn the weights iteratively. Because ordinary backpropagation is a gradient decent method, and it is quite slow if the initial is bad, so we use several accelerate improvements such as sample shuffling and momentum. The sigmoid function we use is ( ) 1 (1

 e

 x

) . Comparing to the voting method, backpropagation is more error driven and machine intelligent.

Vote:

Voting is an ad-hoc way to judge whether the mail is spam or not. If more

8

classifiers consider that the test document is a spam, then we are more confident about the decision.

Vote1 – Knn + naïve Bayes + Maximum Entropy:

We first tried using knn, naïve Bayes and maximum entropy to build our vote model.

The result is shown below. In this case, we can see that vote can indeed improve the performance slightly.

Figure 9

Vote 2 – naïve Bayes + Maximum Entropy + SVM

Secondly we combined naïve Bayes, Maximum

Entropy and SVM.

Originally we expect that the accuracy of

“vote” is always the highest. However, we can see from the picture that three of the 10-fold

CV are the best and

Figure 10 seven of them are the second best. If two of the models predict correctly for most instances, the accuracy of “vote” will increase. Nevertheless, if two of the models predict incorrectly for most instances, the accuracy of “vote” will decrease. We think this is the reason why

“vote” are the second best in some cases.

Conclusion

In this work, we implement several well-known machine learning techniques—including Naïve Bayes, Maximum Entropy, SVM, KNN (Vector Space) and

9

Neural network—to simulate spam filter. And, some statistic computations have been done on the feature selection part. Except mail context word counts, another six features have been shown useful in discriminate spam and ham mails. As we implement spam filters, all the five techniques have given an impressive performances with these features we selected. Cross-validations have been performed on Bayesian model, Vector Space model, and SVM model, and the results give us confidence on our experiments. The hybrid-model, or voting model, averages the classification result, promoting the ability of the filter a little. However, sometimes voting might reduce the accuracy because of mis-adjustments of majority.

Besides, we also tried some other approach such as Latent Dirichlet Allocation, but a discouraging result is reported in indicating the labels of ham or spam. Spam filtering is a keep-going issue. By our analysis, we have got deeper understanding on it, and also become familiar with these machine learning techniques we implement.

Reference

[1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to

Filtering Junk E-Mail," in Proc. AAAI 1998, Jul. 1998.

[2] A plan for spam : http://www.paulgraham.com/spam.html

[3]Enron Corpus : http://www.aueb.gr/users/ion/

[4]Treetagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.ht

: ml

[5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html

[6]SRILM: http://www.speech.sri.com/projects/srilm/

[7]SVM: http://svmlight.joachims.org/

10

Download