ISSN: 2231-5381 http://www.ijettjournal.org An Efficient Model Of

advertisement

International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013

An Efficient Model Of Detection And Filtering Technique Over Malicious And

Spam E-Mails

V S Kumar, Ravi kumar

Final MTech Student, Assoc.Professor & Head of the Dept

Dept of CSE,Kakinada institute of Engineering & Technology ,Kakinada

Abstract:

Malicious mail detection and Filtering is an interesting research work now a days. In this paper we are proposing an efficient malicious mail detection system through the (Supervised L earning) classification approach. Our proposed approach efficiently handles the spam and phishing emails based on the meta data characteristics of email. These present filtering and detection techniques are performing well under detection of targeted malicious mails detection. Next another step is present for detection of persistent and recipient oriented attacks

Introduction:

Targeted email attacks to enable computer network exploitation have become more prevalent, more insidious, and more widely documented in recent years.

Beyond nuisance spam or phishing designed to trick users into revealing personal information, targeted malicious email (TME) facilitates computer network exploitation and the gathering of sensitive information from targeted networks. These targeted email attacks are not singular unrelated events, instead they are coordinated and persistent attack campaigns that can span years. This dissertation surveys and categorizes existing email fi ltering techniques, proposes and implements new methods for detecting targeted malicious email and compares these newly developed techniques to traditional detection methods. Current research and commercial methods for detecting illegitimate email are limited to addressing Internet scale email abuse, such as spam, but not focused on addressing targeted malicious emails. Furthermore, conventional tools such asanti-virus are vulnerability focused examining only the binary code of an email but ignoring all relevant contextual metadata.

Related work:

Classification is process of grouping together documents or data that have similar properties or are related. Our understanding of the data and documents become greater and easier once they are classified. We can also infer logic based on the classification. Most of all it makes the new data to be sorted easily and retrieval faster with better results.

Dewey Decimal Classification is the system most used in the libraries. It is hierarchical; there are ten parent classes which are further divided into ten further divisions which also are in turn divided into ten sections.

Each book is assigned a number according to its class, division and section alphabetically. Dewey Decimal

Classification is very successful in libraries but unfortunately it can’t be implemented in Information

Retrieval. Somebody needs to have a central catalogue of all the documents in the web and whenever a new document is added the central committee would have to look at it classify it assign a number and publish it in the web. This is in strong violation of the way the internet works. Some authority controlling the contents of the web will restrict the amount of data that can be added into the web. We need a web that allows everyone to upload their content in the web together with a Machine

Learning technique that finds these new data and classifies them as they come.

Con fidentiality issues in data mining. A key problem that arises in any en masse collection of data is that of con fidentiality. The need for privacy is sometimes due to law (e.g., for medical databases) or can be motivated by business interests. However, there are situations where the sharing of data can lead to mutual gain. A key utility of large databases today is research, whether it be scienti fic, or economic and market oriented. Thus, for example, the medical field has much to gain by pooling data for research; as can even competing businesses with mutual interests. Despite the potential gain, this is often not possible due to the con fidentiality issues which arise.

We address this question and show that highly e ffi cient solutions are possible. Our scenario is the following: Let

P1 and P2 be parties owning (large) private databases D1 and D2. The parties wish to apply a data-mining algorithm to the joint databases D1, D2 without revealing any unnecessary information about their individual databases. That is, the only information learned by P1 about D2 is that which can be learned from the output of the data mining algorithm, and vice versa. We do not assume any “trusted” third party who computes the joint output.

Bayesian spam filtering :This is a statistical technique of

Email filtering[18]. In the process of filtering , it makes use of a naive bayes classifier[11] which classifies the words and features to identify spam e-mail[19], an approach commonly used in text classification. Naive

Bayes classifiers[17] work by correlating the use of tokens (typically words, or sometimes other things), with spam[20] and non-spam e-mails and then using Bayesian

ISSN: 2231-5381 http://www.ijettjournal.org

Page 36

International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013 inference to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering: Naive Bayes spam filtering is a filtering technique which deals with spam, that can tailor itself to the email needs of individual users, and gives low false positive spam detection rates that are generally acceptable by the users. Particular words have particular probabilities of occurring in spam email and in genuine email. For instance, most email users will frequently encounter the word “viagra” in spam email, but will rarely seen it in other email. The filter doesn't know these probabilities in advance, so the filter must be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email[9], the filter will adjust the probabilities that each word will

Proposed work:

We are proposing an efficient malicious email detection and filtering technique, for the classification of the testing dataset which contains the metadata of the newly received emails can be forwarded to training datasets which has the previously unauthorized spam and malicious mail meta data, analyze the attribute values with respect to the probability of the attributes when comparing with the testing dataset.

Datasets are the collection of tuples with respect to different attributes and possible values for each attribute, is given for the classification process for analyzing the testing set behavior with machine learning approach. Synthetic dataset can be gathered for the classification of results Maintenance of Previously accessed unauthorized malicious and spam emails at the server end with respect to their meta data, which is used for classifying the testing dataset meta data, when an access made over the network also maintained in the training dataset, because the data will helps in future classification Targeting the persistent threats, introduces

Meta data environment. It Meta data structure environment contains some fields of contents related recipients. Those fields are email address, subject lines, attach files etc. using the fields perform the verification operation here. Using these conditions control all different locations internet wide attacks here.

Dataset contains all related features of different attacks. Using dataset only starts the training process here. After completion of training process then perform the detect of attacks. Those attacks related emails classify here. That classification of related emails comes under targeted malicious mails, non targeted malicious mail, persistent and recipient oriented mails. appear in spam or legitimate email in its database. After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only on the most interesting words. This contribution is called the posterior probability and is computed using bayes theorem[7]. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%),the filter will mark the email as a spam. As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk"[20] email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision

Dataset contains communication related emails in between of customer to company. In total number of dataset emails comes under anti spam those mails are comes non targeted emails. Spam mail are comes under targeted malicious mails. Repeated intrusion attempts are identifies as persistent emails. Sender sends the content repeated to particular recipient, those recipient mails are contains high reputation values. Those reputation related mails are comes under Recipient oriented mails here.

Here preprocessing is a process of extracting necessary information (meta information) from the previously received unauthorized, malicious mails at the receiver end, like fields are email address, subject lines, attach filesetc.Testing dataset contains the meta data of new email details. They can be forwarded over training dataset, by calculating the probability of attributes of the testing and training dataset.

ISSN: 2231-5381 http://www.ijettjournal.org

Page 37

International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013

Raw

Mail details

Preprocessing

Training

Meta

Dataset

Classified

Data

1. In this study, trees grow to maximum size: k = number of trees to create; m = number of random features to select for node splitting; and d = maximum depth of the trees.

2. Select k vectors from the training data such that vector θ k is chosen independent of θ

1

, …, θ k – 1.

3. For each of the bootstrap samples, grow a tree Tk, where each node splits using the best split from m randomly selected features. The result is multiple tree classifiers Tk : h(x, θ k), where x is an input vector of unknown classification.

4. To classify x, process that feature vector down each tree in the forest. Each tree will output a classification, also known as a vote. If Ck(x) represents the classification of the kth tree in the forest, then the aggregate classification of the forest, Cforest(x) = majority vote Cx{()}kk1.

The 83 features extracted from email are represented as a vector of features. The output of the random forest classifier for a particular email is binary, classified as either TME or NTME using the email’s specific vector of persistent threat and recipientoriented features as input. When the classifier correctly predicts a TME, it’s a true positive (TP).

When the classifier correctly predicts an NTME, it’s a true negative (TN). When the classifier predicts an

NTME as TME, it’s a false positive (FP) or Type I error. When the classifier predicts a TME as NTME, it’s a false negative (FN) or Type II error. Table 3 shows the possible outcomes from the classifier.

Testing

Meta

Dataset

The false positive rate (FPR) is the proportion of

NTME that was incorrectly classified as TME. The specificity is equal to 1 − FPR, where the FPR is

(FP/FP+TN.)

The false negative rate (FNR) is the proportion of

TME that was incorrectly classified as NTME. The sensitivity is equal to 1 − FNR, where the FNR is

(FN/FN+TP).

Conclusion and Future work:

We concluded our research work with an efficient probability based supervised learning approach by classifying the testing dataset with training dataset.

We can enhance our approach by improving our classification approach like improved naïve Bayesian classifier, c4.5algorithms with their probability and classification measures

References:

1. Targeted Trojan Email Attacks, briefing 08/2005,

Nat’l Infrastructure Security Co-ordination Centre,

2005; www.egovmonitor.com/reports/rep11599.pdf.

2. Targeted Trojan Email Attacks, tech. cybersecurity alert TA05-189A, US-CERT, 2005; www.uscert.gov/cas/techalerts/TA05-189A.html.

3. J.A. Lewis, “Holistic Approaches to Cybersecurity to Enable Network Centric Operations,” statement before Armed Services Committee, Subcommittee on

Terrorism, Unconventional Threats and Capabilities,

110th Cong., 2nd sess., 1 April 2008.

4. 2009 Report to Congress of the U.S.-China

Economic and Security Review Commission, report,

Nov. 2009;

ISSN: 2231-5381 http://www.ijettjournal.org

Page 38

International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013 www.uscc.gov/annual_report/2009/annual_report_ful l_09.pdf.

5. B. Krekel, Capability of the People’s Republic of

China to Conduct Cyber Warfare and Computer

Network Exploitation, Oct. 2009; www.uscc.gov/researchpapers/2009/NorthropGrumm an_PRC_Cyber_Paper_FINAL_Approved%20Report

_16Oct2009.pdf.

6. I. Androutsopoulos et al., “An Experimental

Comparison of Naive Bayesian and Keyword-Based

Anti-Spam Filtering with Personal E-mail

Messages,” Proc. 23rd Ann. Int’l ACM SIGIR Conf.

Research and Development in Information Retrieval,

ACM, 2000, pp. 160–167.

7. R.M. Amin, “Detecting Targeted Malicious Email through Supervised Classification of Persistent

Threat and Recipient Oriented Features,” PhD thesis,

Dept. Eng. and Applied Sciences, George

Washington Univ., 2011.

8. L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, 2001, pp. 5–32.

9. T. Hastie, R. Tibshirani, and J. Friedman, The

Elements of Statistical Learning: Data Mining,

Inference, and Prediction, 2nd ed., Springer, 2008.

10. E. Hutchins, M. Cloppert, and R. Amin,

“Intelligence-Driven Computer Network Defense

Informed by Analysis of Adversary Campaigns and

Intrusion Kill Chains,” Proc. 6th Int’l Conf.

Information Warfare and Security (ICIW 11),

Academic Conferences, 2011, pp. 113–125.

Bibliography:

Malireddi V S kumar completed his BTech in Sri

Sai Aditya Institute Of

Science &

Technology.Currently pursuing MTech in

Kakinada institute of Engineering &

Technology. Interesting research areas are

Data mining and Network security.

Mr.K.Ravi Kumar currently working as Assoc.

Professor & Head of the

Department in Kakinada institute of

Engineering & Technology , Completed his

BTech and MTech in Pragati Engg College,

Surampalem. Area of Interests are Data Mining

Bioinformatics & Networking.

ISSN: 2231-5381 http://www.ijettjournal.org

Page 39

Download