A REVIEW ON VARIOUS CLASSIFICATION ALGORITHMS FOR AN INCREMENTAL SPAM FILTER

advertisement

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

A REVIEW ON VARIOUS CLASSIFICATION

ALGORITHMS FOR AN INCREMENTAL

SPAM FILTER

Munde Kusum M

1

.

, Asst. Prof. Mangrule Rupali A

2

1 M.E. Student, 2 Asst. Prof. Department of Computer Science and Engineering,

Marathwada Institute Of Technology, Aurangabad, Maharashtra, India.

ABSTRACT

Email spam has steadily grown in recent years. The cost of spam can be measured in lost human time, lost server time and loss of valuable mail. As the volume of e-mails keeps increasing, methods for filtering and classifying mails are becoming a necessity.

Classification of an email consists of assigning a class label whether it is a legitimate or spam em ail. Machine learning is efficient for classification task. In this paper we review different machine learning methods such as Naïve Bayes, SVMs, k-NN,

ANNs, Decision trees, Decision stump etc.

Keywords: Data mining, machine learning, Naïve Bayes, SVMs, k-NN, ANNs, Decision trees ,Decision stump,

Ensemble Of Classifier, Bagging, Boosting, Incremental learning etc.

1. INTRODUCTION

Data mining is a subfield of computer science. It is a Computational process of discovering patterns in large data sets.

Data mining uses different Machine Learning algorithms. Machine Learning algorithms discover the relationships between the variables of a system (input, output and hidden) from direct samples of the system. Different Machine learning algorithms are used for text classification eg. Naïve Bayes, support vector machines, Neural Networks, K-nearest neighbor, Decision Trees, Decision Stump etc.

2. MACHINE LEARNING

The goal of machine learning is to build computer systems that can adapt and learn from their experience.

Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P. If its performance at tasks in T, as measured by P, improves with experience E. It is often used due to its cost. Also, relationships and correlations can be hidden within large amounts of data. Machine Learning or Data Mining may be able to find these relationships. Various Machine learning algorithms for data mining are available .These are discussed in following section.

2.1 Naïve Bayes

This method computes the probability that a document is about a particular class, T, using 1] The words of the document to be classified and 2] The estimated probability of each of these words as they appeared in the set of training documents for the class, T. It is based on the Bayesian theorem, and is particularly suited when the dimensionality of the inputs is high. Parameter estimation for a naive Bayes model uses the method of maximum likelihood. In the Bayesian framework, the probability that a given representation of a message, denoted as belongs to a class

given by: classifier is obtained by assuming that the components Xi, i =1, 2…..N are conditionally independent, so that can be written as:

Where and are the probabilities that a message classified as c is represented by and the message belongs to class c, respectively, and is the a priori probability of a random message represented by [6]. The naive

(2)

Therefore, equation (1) becomes

With given by

(3)

Volume 2, Issue 11, November 2013 Page 325

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

(4)

Where the function f depends on the representation of the message, the probabilistic event model used and the documents used for training. The probability is determined based on the occurrence of the term in the training data. , and also depends on the event model adopted. Advantage: Requires a small amount of training data to estimate the parameters. It performs better in many complex real world situations.

2.2

Support Vector Machines

Support Vector Machines are based on the concept of decision planes that define decision boundaries [4]. A decision plane is one that separates between a set of objects having different class memberships, SVM finds an optimal hyperplane with the maximal margin to separate two classes, which requires solving the following optimization problem.

Maximize.

(5)

Subject to

(6)

SVMs maximize the margin around the separating support vectors. Support vectors are the data points that lie closest to the decision surface. After the weights are determined, a test sample x is classified by

(7)

SVM especially used for pattern classification and regression based applications.

2.3 K-Nearest Neighbor

K-Nearest neighbor is an example of instance-based learning , in which the training data set is stored and classification is done by comparing it to the most similar records in the training set. It finds the k closest training points. When a new document needs to be categorized, the k most similar documents (neighbors) are found. If a large enough proportion of them have been assigned to a certain category, the new document is also assigned to this category, otherwise not.

Consider an email classification example. E.g. Given a message x , determine its k nearest neighbors among the messages in the training set. If there are more spam's among these neighbors, classify given message as spam. Otherwise classify it as ham [4].

Given a training set D and a test object the algorithm computes the distance (or similarity) between z and all the training objects to determine its nearest-neighbor list, Dz . ( x is the data of a training object, while y is its class. Likewise, is the data of the test object and is its class.) Once the nearest-neighbor list is obtained, the test object is classified based on the majority Class of its nearest neighbors:

(8)

Where v is a class label, yi is the class label for the i th nearest neighbors, and I ( · ) is an indicator function that returns the value 1 if its argument is true and 0 otherwise [6].

Fig -1: KNN Algorithm

KNN is slow at the classification time and takes a lot of memory.

K-Nearest neighbor is a classification strategy that is an example of a "lazy learner."

Volume 2, Issue 11, November 2013 Page 326

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

2.4 Artificial Neural Network

An artificial neural network (ANN), also called simply a "Neural Network" (NN), is a computational model based on biological neural networks. It consists of an interconnected collection of artificial neurons. During training, a neural network looks at the patterns of features (e.g. words, phrases) that appear in a document of the training set and attempts to produce classifications for the document. If its attempt doesn’t match the set of desired classifications, it adjusts the weights of the connections between neurons. It repeats this process until the attempted classifications match the desired classifications.

There are two types of neural networks.1] Perceptron :-The perceptron is a linear classifier shown in fig 2[4].

Fig -2: Perceptron as neuron Fig -3: Structure of multilayer Perceptron

2] Multilayer perceptron :- It is a nonlinear classifier. It models a nonlinear decision boundary between classes. The black circles represent perceptrons in the input layer, the white circles represent perceptrons in the hidden layer(s), and the gray circles represent perceptrons at the output layer. The computational power of an MLP is proportional to the total number of perceptrons and the total number of hidden layers .

Structure of multilayer

Perceptron is shown in fig3.

2.5 Decision Tree

A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision trees classify instances by traverse from root node to leaf node [6]. We start from root node of decision tree, testing the attribute specified by this node, then moving down the tree branch according to the attribute value in the given set. This process is the repeated at the sub-tree level. Decision tree learning algorithm has been successfully used in expert systems in capturing knowledge. Decision tree is relatively fast compared to other classification models. It also Obtain similar and sometimes better accuracy compared to other models

2.6 Decision stump

A decision stump is a very simple decision tree. A decision stump is a machine learning model consisting of a one-level decision tree. It is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes

(its leaves). A decision stump makes a prediction based on the value of just a single input feature. Sometimes they are also called 1-rules. It’s a tree with only one split, so it’s a stump. Decision stump algorithm looks at all possible value for each attribute. It selects best attribute based on minimum entropy. Entropy is measure of uncertainty. We measure entropy of dataset (S) with respect to each attribute. For each attribute A, one level computes a score measuring how well attribute

A separates the classes [3].

3. Ensemble Of classifiers

An ensemble is a supervised learning algorithm, because it can be trained and then used to make predictions.

It combines many weak classifiers to produce powerful committee. Weak learners which are slightly better than random guess to strong learners which can make very accurate predictions [5]. So, “base learners” are also referred as “weak learners”. An ensemble contains a number of learners which are usually called base learners .

Base learners are usually generated from training data by a base learning algorithm .

ensemble learning is appealing because that it is able to boost.

It improves Accuracy through Combining Predictions. Ensemble methods phases:1] Production of the different models a)

Homogeneous: from different executions of the same algorithm (changing parameters) on the same dataset. b)

Heterogeneous: from different algorithm s on the same dataset. 2] Combination of the different models by Voting,

Weighted voting, etc.

Volume 2, Issue 11, November 2013 Page 327

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

Fig -4 : Traditional Vs Ensemble

3.1 Bagging

Bootstrap aggregating, often abbreviated as bagging .

Bagging trains each model in the ensemble using a randomly drawn

Subset of the training set. Bagging always uses resampling .Bagging does not modify the distribution over examples but instead always uses the uniform distribution. In forming the final hypothesis, bagging gives equal weight to each of the weak hypotheses. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy. For unstable learning algorithm Bagging improves performance.

Fig -5 : Bagging Algorithm

3.2 Boosting

“Boosting” convert a weak learning algorithm into a strong one. Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models misclassified.

Boosting steps:

1. Weight the predictions of the classifiers with their error.

2. Perform multiple iterations each time using different example weights.

3. Weight update between iterations.

4. Increase the weight of incorrectly classified examples.

This ensures that they will become more important in the next iterations.

5. Combine results of all iterations.

Boosting can significantly reduce the bias in addition to reducing the variance, and therefore, on weak learners such as decision stumps, Boosting is usually more effective.

E.g. AdaBoost

3.2.1 AdaBoost Algorithm

AdaBoost is one of most widely used classifiers in applications. AdaBoost weights keep changing even if training error is minimal. It can improve classifier accuracy for many problems. Fig.6 shows AdaBoost Algorithm [1].

Volume 2, Issue 11, November 2013

Fig -6 : AdaBoost Algorithm

Page 328

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

4. Incremental Learning

Incremental learning is one of the important machine learning issues. It is also called as relearning of previous data, which is misclassified. Incremental learning is used where training data are not available at the beginning of the training process and they add in small batches over a period of time [5] . It is capable of learning new information with remembering previous information. Objectives: 1) Acquire additional knowledge when new datasets are introduced. 2)

Ability to retain previously learned information. 3) Ability to learn new classes if introduced by new data.

Fig -7 :An Incremental spam filter Algorithm Broadview

4 .1 Incremental Spam detection Algorithm

In this an incremental version of RotBoost algorithm is presented. The RotBoost algorithm is a batch learning algorithms, which is a combination of two algorithms AdaBoost and Rotation Forest. The proposed incremental learning algorithm creates an ensemble of weak classifiers, each trained on a subset of the different distributions of the current training dataset, and then combined through weighted majority voting. The proposed algorithm performs much better than learn++ in most tested datasets [7].

4.2 Learn++

It is an incremental algorithm. These algorithms are capable of learning new information. Algorithm was developed to improve the classification performance of weak classifiers. For a two class problem, a weak learner that can do better than random guessing can be transformed into a strong learner using a procedure called boosting.

Learn ++ which use multi weak learn in order to construct a system based on incremental learning [2]. When a new batch training data is entered, this algorithm generates multi classifiers for the dataset and then combines them and previous classifier based on weighted majority voting. Learn++ algorithm is as:[2]

Volume 2, Issue 11, November 2013

Fig -8 : Algorithm Learn++

Page 329

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

5. Performance Parameter Discussions

Supervised learning process consist of two phases. 1] Training phase – Output of this phase is knowledge. 2]

Testing phase -Output of training phase (knowledge) is used with test data to generate result. Comparison of various machine learning algorithms is done by evaluating different parameters such as accuracy, precision etc.

Fig -9 : Learning process

A confusion matrix contains information about actual and predicted classifications done by a classification system.

Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier. The entries in the confusion matrix have the following meaning in the context of our study:

a is the number of correct predictions that an instance is negative, b is the number of incorrect predictions that an instance is positive, c is the number of incorrect of predictions that an instance negative, and d is the number of correct predictions that an instance is positive.

Table 1: Confusion Matrix

The accuracy ( AC ) is the proportion of the total number of predictions that were correct. It is determined using the equation:

(10)

The recall or true positive rat e ( TP ) is the proportion of positive cases that were correctly identified, as calculated using the equation:

(11)

Finally, precision ( P ) is the proportion of the predicted positive cases that were correct, as calculated using the equation:

(12)

CONCLUSION

In this paper, different classification algorithms for data mining are discussed. It includes base learning algorithms and incremental algorithms. We also reviewed ensemble method used for combining multiple classifiers and increasing their performance. These classification algorithms are applied for Detecting credit, card fraud, Stock market analysis,

Classifying DNA sequences, spam filtering, Optical character recognition etc.

References:

[1] Zhi-Hua Zhou, “Ensemble Learning”, National Key Laboratory for Novel Software Technology, Nanjing University,

Nanjing 210093, China

[2] Robi Polikar, Lalita Udapa, satish Udapa, and Vasant Honavar, “Learn++: An Incremental Learning Algorithm for

Supervised Neural Networks”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C:

APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Volume 2, Issue 11, November 2013 Page 330

International Journal of Application or Innovation in Engineering & Management (IJAIEM)

Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com

Volume 2, Issue 11, November 2013 ISSN 2319 - 4847

[3] Wyne Iba, Pat Langley, “Induction of One –Level Decision Trees”, Machine Learning 1992

[4] Machine Learning Techniques in Spam Filtering, Data Mining Problem-oriented Seminar, MTAT.03.177, May 2004

[5] Jeffrey Byorick and Robi Polikar, “Confidence Estimation Using the Incremental Learning Algorithm, Learn++”, O.

Kaynak et al. (Eds.): ICANN/ICONIP 2003, LNCS 2714, pp. 181–188, 2003

[6] XindongWu · Vipin Kumar · J. Ross Quinlan · Joydeep Ghosh · Qiang Yang ·Hiroshi Motoda · Geoffrey J.

McLachlan · Angus Ng · Bing Liu · Philip S. Yu ·Zhi-Hua Zhou · Michael Steinbach · David J. Hand · Dan

Steinberg, “Top 10 algorithms in data mining”, BIOGRAPHIES, Knowl Inf Syst (2008)

[7] Elham Ghanbari, Hamid Beigy, “An Incremental Spam Detection Algorithm”,IEEE 2011

Volume 2, Issue 11, November 2013 Page 331

Download