A Hybrid Data Mining Classification Technique for the Detection of Suspicious Criminal Activities on Emails Using Neuro-Genetic Approach Hitesh Mahanand Computer Science & Engineering School of Engineering & IT MATS University, Aarang, Raipur(C.G), India hitesh.mahanand@gmail.com Deepak Kumar Xaxa (Assistant Professor) Computer Science & Engineering School of Engineering & IT MATS University, Aarang, Raipur(C.G), India xaxadeepak@gmail.com Abstract— The paper presents a hybrid data mining classification technique for detection of suspicious criminal activities on e-mails. In the paper we proposed a Neuro-Genetic approach in which, a Genetic Algorithm is used for pre-processing of e-mails and Neural Network with back-propagation is used for classification of the e-mails. The presented hybrid model focuses on the evaluation of knowledge based learning algorithms to detecting the suspicious criminal activities on e-mails. In the literature, various machine learning algorithms achieved good accuracy for the related task. We have identified the combination of those algorithms results that improve the performance and accuracy of the existing algorithm. that requires human brilliance and expertise and data mining is one technology that can help us to problem for detection of suspicious activities in crimes [3]. When the criminals and their activities are detected and restricted properly, then it is probable to significantly cut down the crime rate. Keywords—Classification techniques; Criminal Activities; feature selection;Genetic,Hybrid;Neuro,Suspicious. I. INTRODUCTION Present days, the major concerns for us are security and it is given top priority by the National Crime Records Bureau, India. Yet it keeps on encountering terrorist and extremist exercises such as many recent activities like in New Delhi on 13th February 2012 an explosive attacked on an Israeli diplomatic vehicle, 12 peoples killed at New Delhi’s High Court on 7th September, 2011by a bomb blast, many people killed due to three separate bomb attacks in the crowded areas in Mumbai on 13th July, 2011. Day by day the criminal activity based graph of India through web or internet is increasing. Online way of correspondence such as internet can increase their publicity to criminal activities, or chances to attack their own, as well as broadcast their activities to others. Preferably crime based activity should be stopped; if it cannot be stopped then it should at least be detected. Criminology is a most critical field where data mining technologies can create vital results. In crime analysis, we investigate and detect the criminal activities and their relationships with the criminals [1]. With growing number of criminal movements, there is a huge space of criminal database and also the complexity of these kinds of data have made to detect criminal based activities is a good enough area for employing data mining technologies. The knowledge that is gained from data mining approaches is very fruitful which can help and support police forces [2]. Dealing with crimes is a challenging and time taking task Present days the criminals are planning and committing crimes [4], intelligently by using becoming new technologies. One of the most popular means of communications is E-mail or social media. The growing electronic transaction of data is also discovering the demand for automated analyzing technologies. Therefore to raising & catching the criminals we required such a detection of criminal activities and also required analysis approach. The security agencies should also use the present and new growing technologies [5] to give themselves the muchneeded edge. Email messages can be sent to particular or group of users. A single email can broadcast among millions of users within few moments. It contributes effortless and reliable approach of correspondence. Nowadays, most individual users cannot imagine their life without internet and social media. For those reasons, E-mail has become a most widely used mode of correspondence to the terrorists as well. Most of the researchers [6],[7], concentrate in the area of counter terrorism after the disastrous events of 9/11 trying to envision the terrorist plans from suspicious correspondence. This also motivated us to contribute in this area. Classification is one of the classic data mining approaches, which is used to classify each item in a set of data into one of predefined set of classes or groups. The idea is to define the criteria use for the segmentation of the whole database, once this is done, individual dataset can then fall into one or more groups naturally. With the help of classification, existing dataset can easily be understood and it also helps to predict how new individual dataset will behave based on the classification criteria. Data mining creates classification models by observing already classified data and finding a predictive pattern among those data [8]. In this paper, we have proposed a hybrid model to detect suspicious criminal activities on E-mails. A hybrid classification technique involves two different machine learning approaches that are Genetic Algorithm and Neural Network with back-propagation and this hybrid model is called here a Neuro-Genetic approach. In this NeuroGenetic; Genetic Algorithm is used for feature selection and reduction where as the Neural Network with backpropagation is used to classify the E-mails either as suspicious or non-suspicious. We have collected some Email dataset that contains such suspicious criminal activities; there is no any benchmark dataset available in the domain. This paper is organized as follows: Section II presents the details of existing work done in the area of detection of the suspicious activities on E-mails and Classification techniques; the proposed Neuro-Genetic approach with its advantages is explained in Section III. Section IV describes the Motivation of the New Approach. Section V illustrates our Experimental Design & Results and finally the paper is concluded with section VI along with Conclusion. Different feature selection approaches along with classification systems was incorporated by authors [14][15]. According to [15], one can improve the precision, applicability, and understandability of the classification process by using feature selection strategies. The most efficient and majority used algorithm for e-mail classification and content analysis systems [4][6] is ID3 algorithm [17]. An advanced ID3 algorithm with enhanced feature selection mechanism is used to classify the e-mails as suspicious and non-suspicious e-mails with experimental analysis [16]. This paper proposed a hybrid classification data mining to detect the criminal activities on E-mails with using a Neuro-Genetic approach. A Neuro-Genetic approach has several advantages over existing systems and it may improve the performance and accuracy of the process. III. II. RELATED WORK Suspicious emails are additional class of illicit emails. Suspicious emails are those which have some word which is and contains some indication regarding some illicit activities, which is to be required further inspected by law enforcement agencies. Most of the researchers have worked in the area of detecting the criminal based activities on Emails, that includes the content analysis and applying the different Data Mining approaches [6][7][9] which helped as a motivation to contribute in this area. Various data mining techniques can be used to overcome the critical issues to identified the fraud through E-mail communication [10][11]. The authors proposed the data mining approaches can be used to detect the fraud activities and criminal activities in different domains. A singular value decomposition method is used to detect suspicious and deceptive e-mails is proposed by authors [12]. The drawback of this approach was that it does not deal with inadequate data in proper efficient manner and can not maintain the new data incrementally. For updating the new data properly they had to reprocess the entire algorithm. To minimizing the classification error in an e-mail sequence and for a good performance authors S.Kiritchenko et al. [13] projected an approach by observing temporal relations and embedding the discovered information into content-based learning methods. Other authors [6] have also applied a decision tree to detect the suspicious e-mails by considering the keyword base approach. The problem with this approach was that they have not used any information about the context of the identified keywords from the e-mails. An association rule mining is used to further classify the suspicious e-mail message [4] and e-mails can be classified as suspicious alert if there is suspicious keywords present in the future tense or else it is classified as suspicious information only. METHODOLOGY A. Proposed Algorithm The suspicious E-mail detection process works in flow which is given as follows: Step 1: Input E-mail dataset E Step2: Extract E-mail content Step3: Construct feature set from E-mail content. Step4: Make different feature selection. Step5: For each set of features Fi Step6: For each set of classification algorithm Cj Step7: Outputi,j=Suspicious_Detection (E,Fi,Cj) Step8: Display Outputi,j Step9: End loop at step5 Step10:End loop at step6 B. Proposed Model In this research work a Hybrid Data Mining Classification Technique for the Detection of Suspicious Criminal Activities Using Neuro-Genetic Approach is proposed. Classification model is built describing a pre-determined set of data classes or concepts. The proposed methodology has two machine learning algorithms combined together: a Genetic algorithm for feature reduction and selection that would be best for the Neural Network with back propagation the best information about the different classes being used. Classification consists of two steps: Training: It is the supervised learning of a training set of data to build a model. Testing: It includes classifying the data according to that model. The proposed model as shown in the Figure-1 consists of two phase model building and validation. Model building performs on training set of data and model validation performs on test set of data. Model Building Phase: In this phase a model is built using a set of operational steps. The operational steps are preprocessing of dataset, feature selection and classification. First step is preprocessing of training set of data. Preprocessing consists of data to remove noise and achieve data linearity. Here’s E-mail dataset are used for proposed work. Feature selection is also a preprocessing activity. Here a Genetic Algorithm is used for feature selection process. Feature selection consists of detecting the relevant features and discarding the irrelevant ones, with the goal of obtaining a subset of features that describes properly the given problem with minimum degradation of performance. Using Neural Network a classifier is trained to classify item set known as suspicious, non suspicious E-mail through different feature sets. population size and number of generations (or iterations) the algorithm is run for. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as Crossover (also called recombination), Mutation and Selection. Crossover: Crossover is the most important operation that a GA performs on its gene pool of candidate solutions. Given the binary string representation that the system uses, crossover is achieved by the following process: Select two entities that are to breed Randomly select a point in their chromosomes Break the chromosomes apart at this point, and exchange two of the halves Recombine the halves to form two new entities The probability that this happens is controlled by the crossover rate. This parameter is typically set to 0.5 in most GA applications. Other similar parameters for the GA are mutation and migration. Mutation: The mutation parameter controls the probability that an entity undergoes a change without the involvement of a mate. Selection: The selection of entities that should produce offspring is done by evaluating each entity in terms of their fitness; this is the crux of the GA’s operation, and is done by the fitness function. Fitness Function: The fitness function must be defined by the problem that the GA is being applied to. For email classification, the resulting selection must be a suitable collection of words that can be used to predict email from all possible classes. Figure 1. Hybrid Model i. Genetic Algorithm: A genetic algorithm (or GA) is a search technique used in computing to find true or approximate solutions to optimization and search problems. The system uses a genetic algorithm as the primary means of feature selection. In order to give the GA a helping hand, some initial reduction of the feature size needed to be done before the GA was allowed to do its work. Genetic algorithms simulate the interactions between a numbers of different entities in a "gene pool" of candidate solutions. Immediately this has created several variables to manipulate in our quest for the perfect settings: the population size to be simulated, and how many generations to simulate are two of them. These two alone can have a big impact on the effectiveness of the GA; it will also determine the total time taken by the algorithm, since the number of entity evaluations required will be the product of the Steps for the Algorithm shown in the Figure-2: The evolution usually starts from a population of randomly generated individuals and happens in generations. Step 1: In each generation, the fitness of every individual in the population is evaluated, multiple individuals are Step 2: selected from the current population (based on their fitness), and modified (recombined and possibly mutated) to form a new population. Step 3: The new population is then used in the next iteration of the algorithm. Step 4: Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. Step 5: If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached. IV. MOTIVATION FOR THE NEW APPROACH The traditional filter based on the old technique is difficult to adapt to the suspicious emails. The Hybrid Classification Technique to the design of email filters are the innovations. GAs model the principles of evolution and natural selection to quickly search through a large space of solutions to a problem. The potential of GA to be used in a hybrid system with Neural Networks, either as a feature selection step or by evolving the NN Architecture. There’s no GUI for NN, so the topology cannot be modified during training time. V. Figure-2 Flow-Chart of Genetic Algorithm ii. Neural Network: The neural network in this system as shown in the Figure-3 is the component that does the classification, from the features that the GA has chosen. Compared to the GA, the process of creating the NN was relatively painless. Once the GA and NN are in place, the interactions between then can be studied further. A neural network machine learning algorithm is an attempt to mimic the way real neural networks in the human brain work. Each neural network has an input layer and an output layer, with one or more hidden layers between them. Depending on the input to each neuron, they will pass their output on to the next layer of neurons. Propagating values forwards through the network like this makes values eventually reach the output layer, where predictions can be made. Training is accomplished through use of the back-propagation algorithm. EXPERIMENTAL DESIGN & RESULTS Experimental work is performed in MATLAB R2012a (7.14.0.739) data mining tool and Stanford parser. MATLAB is widely used in all areas of applied mathematics, in education and research at universities, and in the industry. According to the framework as mentioned in Figure-1, the experiment basically consists of different focal operational steps E-mail contents extraction, stemming words removals stop word removals, features reduction, feature selection and classification. Results are analyzed and discussed here. Datasets: The experiment is done through the pool of some E-mails data. The labeled suspicious and nonsuspicious E-mails are collected over the different security web sites, e-papers and different journals. Genetic Algorithms: Simulate the interactions between a numbers of different entities in a "gene pool" of candidate solutions. Immediately this has created several variables to manipulate in our quest for the perfect settings: the population size to be simulated, and how many generations to simulate are two of them. These two alone can have a big impact on the effectiveness of the GA; it will also determine the total time taken by the algorithm, since the number of entity evaluations required will be the product of the population size and number of generations (or iterations) the algorithm is run for. The main options as shown in the Table-1 that were used were first set to the following: Table-1 Genetic Algorithm Parameters Parameter Name Value Figure-3 Neural Network The following steps are carrying out to classify the back propagation preparation: Step 1: Initialize the weights Step 2: Propagate the inputs forward Step 3: Back propagate the error. GA Population Size 200 GA Maximum Generation 80 The best-fit for the features are calculated by the algorithm as per shown figure-2 given in Methodology Section. Neural Network Specifics: At the start of the classification stage only requiring a few parameters (as mentioned above). The most influential parameter for the NN is the architecture; the number of neurons at each hidden layer. It is mainly the details external to the NN itself, such as the training of the NN and the formatting of input data that affects its performance. The main options as shown in the Table-2 that were used were first set to the following: Table-2 Neural Network Parameters Parameter Name Value Neural Network Inputs 2500 Number of Hidden Layers 10 Numbers of Outputs 1 The trick to training a NN is not to overdo it. Training a NN too much on the training data will cause it to overfit the data, leading to excellent predictions on the training data but poor predictions on data that was not seen during training. Training of the network will always stop after the maximum epoch limit (1000 epochs), but it may be desirable to finish training earlier if the NN begins to show signs of over training. To accomplish this, any data to be used for training the NN is further split into training data and self-testing data. For each class, the values are calculated as follows: i. Precision: The number of E-mails correctly assigned to this class/The number correctly assigned to this class + the number incorrectly assigned to this class ii. Recall: The number of E-mails correctly assigned to this class/The number correctly assigned to this class + the number incorrectly rejected from this class iii. F1measure=((β2+1)*precision*recall)/(β2*precisi on + recall) where β is set to 1.0 Table-3 Performance of Classifiers with different feature sets Classification Technique Hybrid Classification SVM Precision % Recall % FMeasure % Feature Set 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 90.77 94.62 96.54 96.92 74.23 90.77 96.54 96.15 95.16 97.23 98.24 98.44 85.21 95.16 98.24 98.04 50 100 400 700 50 100 400 700 A graph plot of performance of Hybrid Classifier & SVM Classifier using precision, recall and f-measure parameters with different features is given in figure-5. Figure 4: NN error rates in sample training session Figure-4 shows the progress of the NN during a sample training session for different features selected by the GA.The value marked with red bubbles at red line is the error values for the training data. The values using blue bubbles at blue line is the error on the testing data. As can be seen, during the training the error rate on the training data decreases; this is to be expected. As training progresses, error on both sets of data slowly decreases, up until a certain point where error on the test set begins to increase (as shown in figures). This is the indication that over training is being done. Performance Measures: The term “Accuracy” is used to often in this document and it the primary evaluation criteria for the experimental results. It needs to be defined as there are many different performance measures that are often used in the field of text categorization. Accuracy is calculated simply by dividing the number of correctly classified E-mails by the total of E-mails classified. Other performance scores are calculated: these are precision and recall. Then calculation is performed at a per-class level and then later averaging those values. Figure:5 Performance of Classifiers with different feature sets From Table-3 & Figure:5 it is observed that a Hybrid Classification which is based on Neuro-Genetic approach provides highest recall 96.92% and f-measure 98.44% for suspicious and non-suspicious E-mails with 700 feature set. It is observed that a Hybrid Classification based on NeuroGenetic approach have provided better results compare to single classifier SVM. VI. CONCLUSION & FUTURE WORK To detect suspicious criminal activities is a new active area of text mining. The main aim of this work was to show that the Neuro-Genetic approach fed with appropriate features will be result in high accuracy compared with individual classifier. In particular, presented work showed that the feature sets are different for different classifier with result in high accuracy. Based on this approach, it’s believed that security experts and knowledge based learning experts should work together with the aim of improving the current E-mail classification techniques. Many improvements can be added to the detection of suspicious criminal activities of E-mails developed in this thesis. Since many techniques can be applied and the list is quite long, it’s decided to focus on the most important ones which are described in this section. Here proposed work is on Hybrid Data Mining Classification Techniques which is combined with Neural Network and Genetic Algorithm. The work can be extended with taken other classification techniques classify all categories of attacks. References [1]. Hussain,Durairaj et al., “Criminal behavior analysis by using data mining techniques”, International Conference on Advances in Engineering, Science and Management (ICAESM), IEEE, March 2012. [2]. Keyvanpour, Javideh, et al., “Detecting and investigating crime by means of data mining: a general crime matching framework”, World Conference on Information Technology, Elsvier B.V., 2010. [3]. S. Nath, “Crime data mining, Advances and innovations in systems”, K.Elleithy (ed.), Computing Sciences and Software Engineering, 2007. [4]. S. Appavvu, MuthuPandian, et al., “Association Rule Mining for Suspicious Email Detection: A Data Mining Approach”, Intelligence and Security Informatics, IEEE, 2007. [5]. M. Mansourvar, et al., “A computer- Based System to Support Intelligent Forensic Study”, Computational Intelligence, Modelling and Simulation (CIMSiM), Fourth International Conference, IEEE,September 2012. [6]. S. Appavu, R. Rajaram, “Suspicious email detection via decision tree: a data mining approach,” Journal of Computing and Information Technology -CIT 15 , 2007, pp. 161-169. [7]. S. Appavu and R. Rajaram, “Learning to classify threatening e-mail”, Internation Journal of Artificial Intelligence and Soft Computing. vol. 1, no. 1, 2008, pp. 39-51. [8]. T. Abraham and de Vel, O., “Investigative profiling with computer forensic log data and association rules”, Data Mining, Proceedings, IEEE International Conference, 2002. [9]. Sarwat Nizamani, Nasrullah Memon, Uffe Kock Wiil and Panagiotis Karampelas, “Modeling Suspicious Email Detection Using Enhanced Feature Selection”, International Journals of Modeling and Optimization, Vol 2, No. 4 , August 2012. [10]. Anirudha Khshirsagar U Lalit Dole “A review on data mining methods for identity crime detection”, International Journal of Electrical, Electronics and Computer Science (IJEECS), January, 2014. [11]. N. Hanmant Renushe, R. Prasanna Rasal & S. Abhijit Desai ,”Data Mining Practices for Effective Investigation of Crime”, IJCTA, May-June 2012. [12]. P.S.Keila, D.B. Skillicorn, “Detecting Unusual and Deceptive Communication in Email”, Centers for Advanced study conference, June 2005. [13]. S. Kiritchenko, S.Matwin, S. Abu-Hakima, “Email Classification with Temporal Features”, Intelligent Information Systems, 2004. [14]. K.Selvakuberan et al., “Combined feature selection and classification –A novel approach for categorization of web pages”, Journal of Information and Computing Science, 2008. [15]. A. Arauzo-Azofra, J. M. Benitez, “Empirical Study of Feature Selection Methods in Classification”, 8th International Conference on Hybrid Intelligent Systems, IEEE 2008. [16]. Mugdha Sharma , “Z - CRIME: A Data Mining Tool for the Detection of Suspicious Criminal Activities Based on Decision Tree”, Proceedings of International Conference on Data Mining and Intelligent Computing (ICDMIC) September 5-6, 2014, Technically Cosponsored by IEEE Delhi Section. [17]. Rui-Min Chai, Miao Wang, “A more efficient classification scheme for ID3”, 2nd International Conference on Computer Engineering and Technology (ICCET), IEEE, 2010.