A Hybrid Data Mining Classification Technique for the Detection of

advertisement
A Hybrid Data Mining Classification Technique for the
Detection of Suspicious Criminal Activities on Emails
Using Neuro-Genetic Approach
Hitesh Mahanand
Computer Science & Engineering
School of Engineering & IT
MATS University, Aarang, Raipur(C.G), India
hitesh.mahanand@gmail.com
Deepak Kumar Xaxa (Assistant Professor)
Computer Science & Engineering
School of Engineering & IT
MATS University, Aarang, Raipur(C.G), India
xaxadeepak@gmail.com
Abstract— The paper presents a hybrid data mining
classification technique for detection of suspicious
criminal activities on e-mails. In the paper we proposed
a Neuro-Genetic approach in which, a Genetic
Algorithm is used for pre-processing of e-mails and
Neural Network with back-propagation is used for
classification of the e-mails. The presented hybrid model
focuses on the evaluation of knowledge based learning
algorithms to detecting the suspicious criminal activities
on e-mails. In the literature, various machine learning
algorithms achieved good accuracy for the related task.
We have identified the combination of those algorithms
results that improve the performance and accuracy of
the existing algorithm.
that requires human brilliance and expertise and data mining
is one technology that can help us to problem for detection
of suspicious activities in crimes [3]. When the criminals
and their activities are detected and restricted properly, then
it is probable to significantly cut down the crime rate.
Keywords—Classification techniques; Criminal Activities;
feature selection;Genetic,Hybrid;Neuro,Suspicious.
I.
INTRODUCTION
Present days, the major concerns for us are security and it
is given top priority by the National Crime Records Bureau,
India. Yet it keeps on encountering terrorist and extremist
exercises such as many recent activities like in New Delhi
on 13th February 2012 an explosive attacked on an Israeli
diplomatic vehicle, 12 peoples killed at New Delhi’s High
Court on 7th September, 2011by a bomb blast, many people
killed due to three separate bomb attacks in the crowded
areas in Mumbai on 13th July, 2011. Day by day the
criminal activity based graph of India through web or
internet is increasing. Online way of correspondence such as
internet can increase their publicity to criminal activities, or
chances to attack their own, as well as broadcast their
activities to others. Preferably crime based activity should
be stopped; if it cannot be stopped then it should at least be
detected.
Criminology is a most critical field where data mining
technologies can create vital results. In crime analysis, we
investigate and detect the criminal activities and their
relationships with the criminals [1]. With growing number
of criminal movements, there is a huge space of criminal
database and also the complexity of these kinds of data have
made to detect criminal based activities is a good enough
area for employing data mining technologies. The
knowledge that is gained from data mining approaches is
very fruitful which can help and support police forces [2].
Dealing with crimes is a challenging and time taking task
Present days the criminals are planning and committing
crimes [4], intelligently by using becoming new
technologies. One of the most popular means of
communications is E-mail or social media. The growing
electronic transaction of data is also discovering the demand
for automated analyzing technologies. Therefore to raising
& catching the criminals we required such a detection of
criminal activities and also required analysis approach. The
security agencies should also use the present and new
growing technologies [5] to give themselves the muchneeded edge.
Email messages can be sent to particular or group of
users. A single email can broadcast among millions of users
within few moments. It contributes effortless and reliable
approach of correspondence. Nowadays, most individual
users cannot imagine their life without internet and social
media. For those reasons, E-mail has become a most widely
used mode of correspondence to the terrorists as well. Most
of the researchers [6],[7], concentrate in the area of counter
terrorism after the disastrous events of 9/11 trying to
envision the terrorist plans from suspicious correspondence.
This also motivated us to contribute in this area.
Classification is one of the classic data mining
approaches, which is used to classify each item in a set of
data into one of predefined set of classes or groups. The idea
is to define the criteria use for the segmentation of the whole
database, once this is done, individual dataset can then fall
into one or more groups naturally. With the help of
classification, existing dataset can easily be understood and
it also helps to predict how new individual dataset will
behave based on the classification criteria. Data mining
creates classification models by observing already classified
data and finding a predictive pattern among those data [8].
In this paper, we have proposed a hybrid model to detect
suspicious criminal activities on E-mails. A hybrid
classification technique involves two different machine
learning approaches that are Genetic Algorithm and Neural
Network with back-propagation and this hybrid model is
called here a Neuro-Genetic approach. In this NeuroGenetic; Genetic Algorithm is used for feature selection and
reduction where as the Neural Network with backpropagation is used to classify the E-mails either as
suspicious or non-suspicious. We have collected some Email dataset that contains such suspicious criminal
activities; there is no any benchmark dataset available in the
domain.
This paper is organized as follows: Section II presents the
details of existing work done in the area of detection of the
suspicious activities on E-mails and Classification
techniques; the proposed Neuro-Genetic approach with its
advantages is explained in Section III. Section IV describes
the Motivation of the New Approach. Section V illustrates
our Experimental Design & Results and finally the paper is
concluded with section VI along with Conclusion.
Different feature selection approaches along with
classification systems was incorporated by authors [14][15].
According to [15], one can improve the precision,
applicability, and understandability of the classification
process by using feature selection strategies.
The most efficient and majority used algorithm for e-mail
classification and content analysis systems [4][6] is ID3
algorithm [17]. An advanced ID3 algorithm with enhanced
feature selection mechanism is used to classify the e-mails
as suspicious and non-suspicious e-mails with experimental
analysis [16]. This paper proposed a hybrid classification
data mining to detect the criminal activities on E-mails with
using a Neuro-Genetic approach. A Neuro-Genetic approach
has several advantages over existing systems and it may
improve the performance and accuracy of the process.
III.
II.
RELATED WORK
Suspicious emails are additional class of illicit emails.
Suspicious emails are those which have some word which is
and contains some indication regarding some illicit
activities, which is to be required further inspected by law
enforcement agencies. Most of the researchers have worked
in the area of detecting the criminal based activities on Emails, that includes the content analysis and applying the
different Data Mining approaches [6][7][9] which helped as
a motivation to contribute in this area.
Various data mining techniques can be used to overcome
the critical issues to identified the fraud through E-mail
communication [10][11]. The authors proposed the data
mining approaches can be used to detect the fraud activities
and criminal activities in different domains. A singular
value decomposition method is used to detect suspicious
and deceptive e-mails is proposed by authors [12]. The
drawback of this approach was that it does not deal with
inadequate data in proper efficient manner and can not
maintain the new data incrementally. For updating the new
data properly they had to reprocess the entire algorithm. To
minimizing the classification error in an e-mail sequence
and for a good performance authors S.Kiritchenko et al. [13]
projected an approach by observing temporal relations and
embedding the discovered information into content-based
learning methods.
Other authors [6] have also applied a decision tree to
detect the suspicious e-mails by considering the keyword
base approach. The problem with this approach was that
they have not used any information about the context of the
identified keywords from the e-mails. An association rule
mining is used to further classify the suspicious e-mail
message [4] and e-mails can be classified as suspicious alert
if there is suspicious keywords present in the future tense or
else it is classified as suspicious information only.
METHODOLOGY
A. Proposed Algorithm
The suspicious E-mail detection process works in flow
which is given as follows:
Step 1: Input E-mail dataset E
Step2: Extract E-mail content
Step3: Construct feature set from E-mail content.
Step4: Make different feature selection.
Step5: For each set of features Fi
Step6: For each set of classification algorithm Cj
Step7: Outputi,j=Suspicious_Detection (E,Fi,Cj)
Step8: Display Outputi,j
Step9: End loop at step5
Step10:End loop at step6
B. Proposed Model
In this research work a Hybrid Data Mining Classification
Technique for the Detection of Suspicious Criminal
Activities Using Neuro-Genetic Approach is proposed.
Classification model is built describing a pre-determined set
of data classes or concepts. The proposed methodology has
two machine learning algorithms combined together: a
Genetic algorithm for feature reduction and selection that
would be best for the Neural Network with back
propagation the best information about the different classes
being used.
Classification consists of two steps:
 Training: It is the supervised learning of a training
set of data to build a model.
 Testing: It includes classifying the data according
to that model.
The proposed model as shown in the Figure-1 consists of
two phase model building and validation. Model building
performs on training set of data and model validation
performs on test set of data.
Model Building Phase:
In this phase a model is built using a set of operational
steps. The operational steps are preprocessing of dataset,
feature selection and classification.
First step is preprocessing of training set of data.
Preprocessing consists of data to remove noise and achieve
data linearity.
 Here’s E-mail dataset are used for proposed work.
 Feature selection is also a preprocessing activity.
Here a Genetic Algorithm is used for feature
selection process. Feature selection consists of
detecting the relevant features and discarding the
irrelevant ones, with the goal of obtaining a subset
of features that describes properly the given
problem
with
minimum
degradation
of
performance.
 Using Neural Network a classifier is trained to
classify item set known as suspicious, non
suspicious E-mail through different feature sets.
population size and number of generations (or
iterations) the algorithm is run for.
Genetic algorithms are a particular class of
evolutionary algorithms that use techniques inspired by
evolutionary biology such as Crossover (also called
recombination), Mutation and Selection.

Crossover: Crossover is the most important
operation that a GA performs on its gene pool of
candidate solutions. Given the binary string
representation that the system uses, crossover is
achieved by the following process:
 Select two entities that are to breed
 Randomly select a point in their chromosomes
 Break the chromosomes apart at this point, and
exchange two of the halves
 Recombine the halves to form two new entities
The probability that this happens is controlled by
the crossover rate. This parameter is typically set to 0.5
in most GA applications. Other similar parameters for
the GA are mutation and migration.
 Mutation: The mutation parameter controls the
probability that an entity undergoes a change
without the involvement of a mate.
 Selection: The selection of entities that should
produce offspring is done by evaluating each entity
in terms of their fitness; this is the crux of the GA’s
operation, and is done by the fitness function.
 Fitness Function: The fitness function must be
defined by the problem that the GA is being
applied to. For email classification, the resulting
selection must be a suitable collection of words
that can be used to predict email from all possible
classes.
Figure 1. Hybrid Model
i.
Genetic Algorithm: A genetic algorithm (or GA) is a
search technique used in computing to find true or
approximate solutions to optimization and search
problems. The system uses a genetic algorithm as the
primary means of feature selection. In order to give the
GA a helping hand, some initial reduction of the feature
size needed to be done before the GA was allowed to
do its work.
Genetic algorithms simulate the interactions
between a numbers of different entities in a "gene pool"
of candidate solutions. Immediately this has created
several variables to manipulate in our quest for the
perfect settings: the population size to be simulated, and
how many generations to simulate are two of them.
These two alone can have a big impact on the
effectiveness of the GA; it will also determine the total
time taken by the algorithm, since the number of entity
evaluations required will be the product of the
Steps for the Algorithm shown in the Figure-2:
The evolution usually starts from a population of
randomly generated individuals and happens in
generations.
Step 1: In each generation, the fitness of every
individual in the population is evaluated, multiple
individuals are
Step 2: selected from the current population (based on
their fitness), and modified (recombined and possibly
mutated) to form a new population.
Step 3: The new population is then used in the next
iteration of the algorithm.
Step 4: Commonly, the algorithm terminates when
either a maximum number of generations has been
produced, or a satisfactory fitness level has been
reached for the population.
Step 5: If the algorithm has terminated due to a
maximum number of generations, a satisfactory
solution may or may not have been reached.
IV.





MOTIVATION FOR THE NEW APPROACH
The traditional filter based on the old technique is
difficult to adapt to the suspicious emails.
The Hybrid Classification Technique to the design of email filters are the innovations.
GAs model the principles of evolution and natural
selection to quickly search through a large space of
solutions to a problem.
The potential of GA to be used in a hybrid system with
Neural Networks, either as a feature selection step or by
evolving the NN Architecture.
There’s no GUI for NN, so the topology cannot be
modified during training time.
V.
Figure-2 Flow-Chart of Genetic Algorithm
ii.
Neural Network: The neural network in this system as
shown in the Figure-3 is the component that does the
classification, from the features that the GA has chosen.
Compared to the GA, the process of creating the NN
was relatively painless. Once the GA and NN are in
place, the interactions between then can be studied
further. A neural network machine learning algorithm is
an attempt to mimic the way real neural networks in the
human brain work.
Each neural network has an input layer and an
output layer, with one or more hidden layers between
them. Depending on the input to each neuron, they will
pass their output on to the next layer of neurons.
Propagating values forwards through the network like
this makes values eventually reach the output layer,
where predictions can be made. Training is
accomplished through use of the back-propagation
algorithm.
EXPERIMENTAL DESIGN & RESULTS
Experimental work is performed in MATLAB R2012a
(7.14.0.739) data mining tool and Stanford parser.
MATLAB is widely used in all areas of applied
mathematics, in education and research at universities, and
in the industry. According to the framework as mentioned in
Figure-1, the experiment basically consists of different focal
operational steps E-mail contents extraction, stemming
words removals stop word removals, features reduction,
feature selection and classification. Results are analyzed and
discussed here.
 Datasets: The experiment is done through the pool of
some E-mails data. The labeled suspicious and nonsuspicious E-mails are collected over the different
security web sites, e-papers and different journals.
 Genetic Algorithms: Simulate the interactions between
a numbers of different entities in a "gene pool" of
candidate solutions. Immediately this has created
several variables to manipulate in our quest for the
perfect settings: the population size to be simulated, and
how many generations to simulate are two of them.
These two alone can have a big impact on the
effectiveness of the GA; it will also determine the total
time taken by the algorithm, since the number of entity
evaluations required will be the product of the
population size and number of generations (or
iterations) the algorithm is run for.
The main options as shown in the Table-1 that were
used were first set to the following:
Table-1 Genetic Algorithm Parameters
Parameter Name
Value
Figure-3 Neural Network
The following steps are carrying out to classify the back
propagation preparation:
Step 1: Initialize the weights
Step 2: Propagate the inputs forward
Step 3: Back propagate the error.

GA Population Size
200
GA Maximum Generation
80
The best-fit for the features are calculated by the
algorithm as per shown figure-2 given in Methodology
Section.
Neural Network Specifics: At the start of the
classification stage only requiring a few parameters (as
mentioned above). The most influential parameter for
the NN is the architecture; the number of neurons at
each hidden layer. It is mainly the details external to the
NN itself, such as the training of the NN and the
formatting of input data that affects its performance.
The main options as shown in the Table-2 that were
used were first set to the following:
Table-2 Neural Network Parameters
Parameter Name
Value
Neural Network Inputs
2500
Number of Hidden Layers
10
Numbers of Outputs
1
The trick to training a NN is not to overdo it. Training a
NN too much on the training data will cause it to overfit the data, leading to excellent predictions on the
training data but poor predictions on data that was not
seen during training. Training of the network will
always stop after the maximum epoch limit (1000
epochs), but it may be desirable to finish training earlier
if the NN begins to show signs of over training. To
accomplish this, any data to be used for training the NN
is further split into training data and self-testing data.
For each class, the values are calculated as follows:
i. Precision: The number of E-mails correctly
assigned to this class/The number correctly
assigned to this class + the number incorrectly
assigned to this class
ii. Recall: The number of E-mails correctly assigned
to this class/The number correctly assigned to this
class + the number incorrectly rejected from this
class
iii. F1measure=((β2+1)*precision*recall)/(β2*precisi
on + recall) where β is set to 1.0
Table-3 Performance of Classifiers with different feature sets
Classification
Technique
Hybrid
Classification
SVM
Precision
%
Recall
%
FMeasure
%
Feature
Set
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
90.77
94.62
96.54
96.92
74.23
90.77
96.54
96.15
95.16
97.23
98.24
98.44
85.21
95.16
98.24
98.04
50
100
400
700
50
100
400
700
A graph plot of performance of Hybrid Classifier & SVM
Classifier using precision, recall and f-measure parameters
with different features is given in figure-5.
Figure 4: NN error rates in sample training session

Figure-4 shows the progress of the NN during a sample
training session for different features selected by the
GA.The value marked with red bubbles at red line is the
error values for the training data. The values using blue
bubbles at blue line is the error on the testing data. As
can be seen, during the training the error rate on the
training data decreases; this is to be expected. As
training progresses, error on both sets of data slowly
decreases, up until a certain point where error on the
test set begins to increase (as shown in figures). This is
the indication that over training is being done.
Performance Measures: The term “Accuracy” is used
to often in this document and it the primary evaluation
criteria for the experimental results. It needs to be
defined as there are many different performance
measures that are often used in the field of text
categorization. Accuracy is calculated simply by
dividing the number of correctly classified E-mails by
the total of E-mails classified.
Other performance scores are calculated: these are
precision and recall. Then calculation is performed at a
per-class level and then later averaging those values.
Figure:5 Performance of Classifiers with different feature sets
From Table-3 & Figure:5 it is observed that a Hybrid
Classification which is based on Neuro-Genetic approach
provides highest recall 96.92% and f-measure 98.44% for
suspicious and non-suspicious E-mails with 700 feature set.
It is observed that a Hybrid Classification based on NeuroGenetic approach have provided better results compare to
single classifier SVM.
VI.
CONCLUSION & FUTURE WORK
To detect suspicious criminal activities is a new active area
of text mining. The main aim of this work was to show that
the Neuro-Genetic approach fed with appropriate features
will be result in high accuracy compared with individual
classifier. In particular, presented work showed that the
feature sets are different for different classifier with result in
high accuracy. Based on this approach, it’s believed that
security experts and knowledge based learning experts
should work together with the aim of improving the current
E-mail classification techniques.
Many improvements can be added to the detection of
suspicious criminal activities of E-mails developed in this
thesis. Since many techniques can be applied and the list is
quite long, it’s decided to focus on the most important ones
which are described in this section. Here proposed work is
on Hybrid Data Mining Classification Techniques which is
combined with Neural Network and Genetic Algorithm. The
work can be extended with taken other classification
techniques classify all categories of attacks.
References
[1]. Hussain,Durairaj et al., “Criminal behavior analysis by
using data mining techniques”, International
Conference on Advances in Engineering, Science and
Management (ICAESM), IEEE, March 2012.
[2]. Keyvanpour, Javideh, et al., “Detecting and
investigating crime by means of data mining: a general
crime matching framework”, World Conference on
Information Technology, Elsvier B.V., 2010.
[3]. S. Nath, “Crime data mining, Advances and innovations
in systems”, K.Elleithy (ed.), Computing Sciences and
Software Engineering, 2007.
[4]. S. Appavvu, MuthuPandian, et al., “Association Rule
Mining for Suspicious Email Detection: A Data Mining
Approach”, Intelligence and Security Informatics,
IEEE, 2007.
[5]. M. Mansourvar, et al., “A computer- Based System to
Support Intelligent Forensic Study”, Computational
Intelligence, Modelling and Simulation (CIMSiM),
Fourth International Conference, IEEE,September
2012.
[6]. S. Appavu, R. Rajaram, “Suspicious email detection via
decision tree: a data mining approach,” Journal of
Computing and Information Technology -CIT 15 ,
2007, pp. 161-169.
[7]. S. Appavu and R. Rajaram, “Learning to classify
threatening e-mail”, Internation Journal of Artificial
Intelligence and Soft Computing. vol. 1, no. 1, 2008,
pp. 39-51.
[8]. T. Abraham and de Vel, O., “Investigative profiling
with computer forensic log data and association rules”,
Data Mining, Proceedings, IEEE International
Conference, 2002.
[9]. Sarwat Nizamani, Nasrullah Memon, Uffe Kock Wiil
and Panagiotis Karampelas, “Modeling Suspicious
Email Detection Using Enhanced Feature Selection”,
International Journals of Modeling and Optimization,
Vol 2, No. 4 , August 2012.
[10]. Anirudha Khshirsagar U Lalit Dole “A review on data
mining methods for identity crime detection”,
International Journal of Electrical, Electronics and
Computer Science (IJEECS), January, 2014.
[11]. N. Hanmant Renushe, R. Prasanna Rasal & S. Abhijit
Desai ,”Data Mining Practices for Effective
Investigation of Crime”, IJCTA, May-June 2012.
[12]. P.S.Keila, D.B. Skillicorn, “Detecting Unusual and
Deceptive Communication in Email”, Centers for
Advanced study conference, June 2005.
[13]. S. Kiritchenko, S.Matwin, S. Abu-Hakima, “Email
Classification with Temporal Features”, Intelligent
Information Systems, 2004.
[14]. K.Selvakuberan et al., “Combined feature selection
and classification –A novel approach for categorization
of web pages”, Journal of Information and Computing
Science, 2008.
[15]. A. Arauzo-Azofra, J. M. Benitez, “Empirical Study of
Feature Selection Methods in Classification”, 8th
International Conference on Hybrid Intelligent Systems,
IEEE 2008.
[16]. Mugdha Sharma , “Z - CRIME: A Data Mining Tool
for the Detection of Suspicious Criminal Activities
Based on Decision Tree”, Proceedings of International
Conference on Data Mining and Intelligent Computing
(ICDMIC) September 5-6, 2014, Technically Cosponsored by IEEE Delhi Section.
[17]. Rui-Min Chai, Miao Wang, “A more efficient
classification scheme for ID3”, 2nd International
Conference on Computer Engineering and Technology
(ICCET), IEEE, 2010.
Download