International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013
Abstract:
Malicious mail detection and Filtering is an interesting research work now a days. In this paper we are proposing an efficient malicious mail detection system through the (Supervised L earning) classification approach. Our proposed approach efficiently handles the spam and phishing emails based on the meta data characteristics of email. These present filtering and detection techniques are performing well under detection of targeted malicious mails detection. Next another step is present for detection of persistent and recipient oriented attacks
Introduction:
Targeted email attacks to enable computer network exploitation have become more prevalent, more insidious, and more widely documented in recent years.
Beyond nuisance spam or phishing designed to trick users into revealing personal information, targeted malicious email (TME) facilitates computer network exploitation and the gathering of sensitive information from targeted networks. These targeted email attacks are not singular unrelated events, instead they are coordinated and persistent attack campaigns that can span years. This dissertation surveys and categorizes existing email fi ltering techniques, proposes and implements new methods for detecting targeted malicious email and compares these newly developed techniques to traditional detection methods. Current research and commercial methods for detecting illegitimate email are limited to addressing Internet scale email abuse, such as spam, but not focused on addressing targeted malicious emails. Furthermore, conventional tools such asanti-virus are vulnerability focused examining only the binary code of an email but ignoring all relevant contextual metadata.
Related work:
Classification is process of grouping together documents or data that have similar properties or are related. Our understanding of the data and documents become greater and easier once they are classified. We can also infer logic based on the classification. Most of all it makes the new data to be sorted easily and retrieval faster with better results.
Dewey Decimal Classification is the system most used in the libraries. It is hierarchical; there are ten parent classes which are further divided into ten further divisions which also are in turn divided into ten sections.
Each book is assigned a number according to its class, division and section alphabetically. Dewey Decimal
Classification is very successful in libraries but unfortunately it can’t be implemented in Information
Retrieval. Somebody needs to have a central catalogue of all the documents in the web and whenever a new document is added the central committee would have to look at it classify it assign a number and publish it in the web. This is in strong violation of the way the internet works. Some authority controlling the contents of the web will restrict the amount of data that can be added into the web. We need a web that allows everyone to upload their content in the web together with a Machine
Learning technique that finds these new data and classifies them as they come.
Con fidentiality issues in data mining. A key problem that arises in any en masse collection of data is that of con fidentiality. The need for privacy is sometimes due to law (e.g., for medical databases) or can be motivated by business interests. However, there are situations where the sharing of data can lead to mutual gain. A key utility of large databases today is research, whether it be scienti fic, or economic and market oriented. Thus, for example, the medical field has much to gain by pooling data for research; as can even competing businesses with mutual interests. Despite the potential gain, this is often not possible due to the con fidentiality issues which arise.
We address this question and show that highly e ffi cient solutions are possible. Our scenario is the following: Let
P1 and P2 be parties owning (large) private databases D1 and D2. The parties wish to apply a data-mining algorithm to the joint databases D1, D2 without revealing any unnecessary information about their individual databases. That is, the only information learned by P1 about D2 is that which can be learned from the output of the data mining algorithm, and vice versa. We do not assume any “trusted” third party who computes the joint output.
Bayesian spam filtering :This is a statistical technique of
Email filtering[18]. In the process of filtering , it makes use of a naive bayes classifier[11] which classifies the words and features to identify spam e-mail[19], an approach commonly used in text classification. Naive
Bayes classifiers[17] work by correlating the use of tokens (typically words, or sometimes other things), with spam[20] and non-spam e-mails and then using Bayesian
ISSN: 2231-5381 http://www.ijettjournal.org
Page 36
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013 inference to calculate a probability that an email is or is not spam.
Naive Bayes spam filtering: Naive Bayes spam filtering is a filtering technique which deals with spam, that can tailor itself to the email needs of individual users, and gives low false positive spam detection rates that are generally acceptable by the users. Particular words have particular probabilities of occurring in spam email and in genuine email. For instance, most email users will frequently encounter the word “viagra” in spam email, but will rarely seen it in other email. The filter doesn't know these probabilities in advance, so the filter must be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email[9], the filter will adjust the probabilities that each word will
Proposed work:
We are proposing an efficient malicious email detection and filtering technique, for the classification of the testing dataset which contains the metadata of the newly received emails can be forwarded to training datasets which has the previously unauthorized spam and malicious mail meta data, analyze the attribute values with respect to the probability of the attributes when comparing with the testing dataset.
Datasets are the collection of tuples with respect to different attributes and possible values for each attribute, is given for the classification process for analyzing the testing set behavior with machine learning approach. Synthetic dataset can be gathered for the classification of results Maintenance of Previously accessed unauthorized malicious and spam emails at the server end with respect to their meta data, which is used for classifying the testing dataset meta data, when an access made over the network also maintained in the training dataset, because the data will helps in future classification Targeting the persistent threats, introduces
Meta data environment. It Meta data structure environment contains some fields of contents related recipients. Those fields are email address, subject lines, attach files etc. using the fields perform the verification operation here. Using these conditions control all different locations internet wide attacks here.
Dataset contains all related features of different attacks. Using dataset only starts the training process here. After completion of training process then perform the detect of attacks. Those attacks related emails classify here. That classification of related emails comes under targeted malicious mails, non targeted malicious mail, persistent and recipient oriented mails. appear in spam or legitimate email in its database. After training, the word probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only on the most interesting words. This contribution is called the posterior probability and is computed using bayes theorem[7]. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%),the filter will mark the email as a spam. As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk"[20] email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision
Dataset contains communication related emails in between of customer to company. In total number of dataset emails comes under anti spam those mails are comes non targeted emails. Spam mail are comes under targeted malicious mails. Repeated intrusion attempts are identifies as persistent emails. Sender sends the content repeated to particular recipient, those recipient mails are contains high reputation values. Those reputation related mails are comes under Recipient oriented mails here.
Here preprocessing is a process of extracting necessary information (meta information) from the previously received unauthorized, malicious mails at the receiver end, like fields are email address, subject lines, attach filesetc.Testing dataset contains the meta data of new email details. They can be forwarded over training dataset, by calculating the probability of attributes of the testing and training dataset.
ISSN: 2231-5381 http://www.ijettjournal.org
Page 37
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013
1. In this study, trees grow to maximum size: k = number of trees to create; m = number of random features to select for node splitting; and d = maximum depth of the trees.
2. Select k vectors from the training data such that vector θ k is chosen independent of θ
1
, …, θ k – 1.
3. For each of the bootstrap samples, grow a tree Tk, where each node splits using the best split from m randomly selected features. The result is multiple tree classifiers Tk : h(x, θ k), where x is an input vector of unknown classification.
4. To classify x, process that feature vector down each tree in the forest. Each tree will output a classification, also known as a vote. If Ck(x) represents the classification of the kth tree in the forest, then the aggregate classification of the forest, Cforest(x) = majority vote Cx{()}kk1.
The 83 features extracted from email are represented as a vector of features. The output of the random forest classifier for a particular email is binary, classified as either TME or NTME using the email’s specific vector of persistent threat and recipientoriented features as input. When the classifier correctly predicts a TME, it’s a true positive (TP).
When the classifier correctly predicts an NTME, it’s a true negative (TN). When the classifier predicts an
NTME as TME, it’s a false positive (FP) or Type I error. When the classifier predicts a TME as NTME, it’s a false negative (FN) or Type II error. Table 3 shows the possible outcomes from the classifier.
The false positive rate (FPR) is the proportion of
NTME that was incorrectly classified as TME. The specificity is equal to 1 − FPR, where the FPR is
(FP/FP+TN.)
The false negative rate (FNR) is the proportion of
TME that was incorrectly classified as NTME. The sensitivity is equal to 1 − FNR, where the FNR is
(FN/FN+TP).
Conclusion and Future work:
We concluded our research work with an efficient probability based supervised learning approach by classifying the testing dataset with training dataset.
We can enhance our approach by improving our classification approach like improved naïve Bayesian classifier, c4.5algorithms with their probability and classification measures
References:
1. Targeted Trojan Email Attacks, briefing 08/2005,
Nat’l Infrastructure Security Co-ordination Centre,
2005; www.egovmonitor.com/reports/rep11599.pdf.
2. Targeted Trojan Email Attacks, tech. cybersecurity alert TA05-189A, US-CERT, 2005; www.uscert.gov/cas/techalerts/TA05-189A.html.
3. J.A. Lewis, “Holistic Approaches to Cybersecurity to Enable Network Centric Operations,” statement before Armed Services Committee, Subcommittee on
Terrorism, Unconventional Threats and Capabilities,
110th Cong., 2nd sess., 1 April 2008.
4. 2009 Report to Congress of the U.S.-China
Economic and Security Review Commission, report,
Nov. 2009;
ISSN: 2231-5381 http://www.ijettjournal.org
Page 38
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 number 1 - Nov 2013 www.uscc.gov/annual_report/2009/annual_report_ful l_09.pdf.
5. B. Krekel, Capability of the People’s Republic of
China to Conduct Cyber Warfare and Computer
Network Exploitation, Oct. 2009; www.uscc.gov/researchpapers/2009/NorthropGrumm an_PRC_Cyber_Paper_FINAL_Approved%20Report
_16Oct2009.pdf.
6. I. Androutsopoulos et al., “An Experimental
Comparison of Naive Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail
Messages,” Proc. 23rd Ann. Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval,
ACM, 2000, pp. 160–167.
7. R.M. Amin, “Detecting Targeted Malicious Email through Supervised Classification of Persistent
Threat and Recipient Oriented Features,” PhD thesis,
Dept. Eng. and Applied Sciences, George
Washington Univ., 2011.
8. L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, 2001, pp. 5–32.
9. T. Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer, 2008.
10. E. Hutchins, M. Cloppert, and R. Amin,
“Intelligence-Driven Computer Network Defense
Informed by Analysis of Adversary Campaigns and
Intrusion Kill Chains,” Proc. 6th Int’l Conf.
Information Warfare and Security (ICIW 11),
Academic Conferences, 2011, pp. 113–125.
Bibliography:
ISSN: 2231-5381 http://www.ijettjournal.org
Page 39