www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242

advertisement
www.ijecs.in
International Journal Of Engineering And Computer Science ISSN:2319-7242
Volume 3, Issue 10 October, 2014 Page No. 8964-8971
Internet traffic classification using Hybrid Aggregated
classifier and Neural Network
Ms. G. Rubadevi1, Mrs. R. Amsaveni2,
1
Research scholar, Department of Computer science,
PSGR.Krishnammal college for women,
Tamil Nadu, India.
2
Assistant professor, Department of Information Technology,
PSGR.Krishnammal college for women,
Tamil Nadu, India.
Abstract: Internet traffic classification is a fundamental technology for modern network security such as quality of service
(QoS) control. It is useful to tackle a number of network security problems including lawful interception and intrusion
detection. There is an increasing demand on the development of modern traffic classification techniques due to the
development of different application. In this work, Internet traffic is carried out by using the supervised classification
techniques namely the Neural Network such as Multilayer perceptron (MLP) and Radial base function (RBF) and Hybrid
Aggregated Classifier. The task involved in this work is IP packet capturing, Preprocessing, Flow container construction (If
the flows observed in a certain period of time share the same destination IP, port, and transport layer protocol, they are
determined as correlated flows and modeled as “Flow Container”), separating low density and high density flow, feature
extraction and classification. The accuracy of the classifier Hybrid aggregated classification is better than Neural Network.
Keywords: Internet traffic, Hybrid aggregated classifier, Neural Network.
Flow based application detection and packet based method and
1. Introduction
Internet traffic is the flow of data across the Internet.
payload method. Eg: packet-headers alone do not contain
required information for accurate methodology. So the
Because of the distributed nature of the Internet, there is no
accuracy in traditional techniques for traffic classification often
single point of measurement for total Internet traffic. Today
provides approximately 50.70% to 70%.
connection to the Internet can be its most vital link to the
Recent research method for traffic classification has been
outside world. Too much Internet traffic can cause even the
focused on the correlation based statistical features. The flow
fastest connections to bog down. Learning to identify common
statistical feature based internet traffic classification can be
sources of Internet traffic can help you keep your bandwidth
implemented by using supervised (classification) algorithms or
available and prevent congestion issues from interfering with
unsupervised (clustering) algorithms. The unsupervised traffic
important data transfers.
classification is very difficult to construct without knowing the
Accurate network traffic classification plays vital role in
numerous network activities, from security monitoring to
real traffic classes.
In this research work, set of pre-labeled data is given in the
process with essential forecasts for long-term provisioning, and
supervised traffic classification. The classifiers are divided into
from Quality of Service to accounting. Internet traffic
two categories based on the pre-labeled data: parametric and
classification schemes are difficult to model accurately because
non-parametric classifier. Parametric classifiers are Naïve
of the limited information commonly available in the network.
Bayes, C4.5 decision tree, Bayesian network, SVM and neural
Traditional method of internet traffic classification includes
network. Non-parametric classifiers are k-Nearest Neighbor (k-
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8964
NN) etc. The proposed internet traffic classification uses
parametric classifier such as Hybrid Aggregated classifier
(advantage of both Naïve Bayes) and Neural network (MLP
and RBF).
As reported, the NN classifier can achieve superior
performance similar to that of the parametric classifiers, SVM
and neural nets. The NN classifier has several important
advantages. For example, it does not require training procedure
for over fitting of parameters, it is able to process a large
number of classes. The accuracy of the NN classifier is affected
by a small size of training data. When the number of training
samples reduces from 100 to 10 for each class, t he
classification accuracy of the NN-based traffic classifier goes
down by approximate 20 %.
The Hybrid Aggregated classifiers have the advantage of
both NB and C4.5 and it provides high accuracy than NN
classifier. In this research work, the traffic dataset is divided
Fig 1: Flow diagram for internet traffic classification
Preprocessing
into Low density flow and high density flow. For high density
Pre-processing is a process of removing noise and incorrect
traffic NB classifier is used for classification and for Low
data by data cleaning and data reduction techniques. Real-
density flow C4.5 algorithm is used for classification.
world database are highly susceptible to noisy, missing, and
inconsistent data due to their typically huge size often several
gigabytes or more and their likely from multiple, heterogeneous
2. Methodology
sources. Low-quality data will lead to low-quality mining
2.1 The Proposed Framework
result,
The proposed framework of internet traffic classification
consist
of
following
modules,
IP
packet
capturing,
Preprocessing, Flow Container (FC) construction, Low density
and
high
density
flow,
Feature
Extraction,
Feature
discretization, Classification and Performance evaluation. The
proposed classification performance like accuracy, precision,
recall and f-measure are compared between our hybrid
classifier with machine learning algorithm such as simple
Naïve Bayes, Neural Network (Multilayer Perceptron, Radial
Basis Function).The proposed hybrid classifier gives higher
accuracy than the other algorithms.
IP packet capturing:
The Internet traffic packet was taken by using WIRESHARK
so
it
prefers
a
preprocessing
concepts.
Data
preprocessing techniques includes Data Cleaning, Data
Integration, Data Transformation, and Data Reduction.
In the preprocessing, the system captures IP packets crossing
a target network and constructs traffic flows by checking the
headers of IP packets. A flow consists of successive IP packets
with the same 5-tuple: source IP, source port, destination IP,
destination port, and transport layer protocol. We apply a
heuristic way to determine the correlated flows and model them
using “Flow Container (FC)”.
Correlated flows
If the flows observed in a certain period of time share the
same destination IP, destination port, and transport layer
protocol, they are determined as correlated flows.
tool from ISP provider of educational institution. The
Wireshark tool is well-known open-source packet capturing
software is used to capture internet traffic. It captures network
packets and extracts detail of the captured packet. To create
data set, internet traffic packets are captured for the duration of
Flow Container construction
In the proposed scheme, a set of correlated flows are generated
by the same application, which is modeled using a “Flow
1 minute.
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8965
Container”. Since the flows, belong to the same application-
methods are combined with methods coming from the field of
based class, such correlation information can be utilized to
artificial intelligence.
improve the classification results. Therefore, we aim to
aggregate the individual predictions of the correlated flows so
The following features were found to match the above
criteria and became the base feature set for our experiments:
as to conduct more accurate classification. Our research shows
 Protocol
that the goal can be achieved by following the approach of
 Flow duration
classifier combination. The analysis on classifier combination
 Flow volume in bytes and packets
using bagging and random subspace are provided.
 Packet length (minimum, mean, maximum
There is a strong assumption that the average performance of
and standard deviation)
all the individual classifiers, each trained on a subset of
 Inter-arrival time between packets (minimum,
features and the training set replicas, is similar to a classifier
mean, maximum and standard deviation).
which uses the full feature set and the whole training set. This
assumption is not always true, but we do not make such
assumption here. From the inequality, one can see that the more
accurate aggregated classifier can be obtained with the higher
diversity of the simple predictor. In our work, the simple
predictor is unstable due to a small set of training data.
Consequently, the aggregation of correlated flow predictions
can improve the performance to generate the aggregated
predictor.
Low density flow and high density flow
After construct the “FLOW CONTAINER (FC)”, the traffic
flow is divided into low density flow and high density flow
based up on the each packet size.
2.2. Feature Extraction
The proposed research work follows statistics based
classification. In this statistical feature of the packet-level-trace
is grabbed and used to classify the network traffic. E.g., a jump
in the rate of packets generated by a host might be the sign of
worm propagation. However, a jump in the rate of packets
might be an indication of a P2P application, which generates
plenty of zero payload flows while peers try to connect to each
other. In case of statistical approaches it is feasible to
determine the application type, but specific application/client
Packet lengths are based on the IP length excluding link
layer overhead. Inter-arrival times have at least microsecond
precision and accuracy (traces were captured using wireshark).
As the traces contained both directions of the flows, features
were calculated in both directions (except protocol and flow
duration). This produces a total of 22 flow features, which we
refer to as the ‘full feature set’. Our features are simple and
well understood within the networking community.
3. Machine learning classification
3.1. Neural Network
A neural network is a type of computational model
which is able to solve multi problems in various fields. It
processes the information in a similar way as the human brain
concept processing the information. Basically, neural network
consists of large processing elements called neurons working
together to perform specific tasks. As in the human brain, there
are thousands of dendrites which contain information signals.
They transmitted the signals to the axon in the form of
electrical spikes. The axon then sends the signals to another
dendrites causing to a synapse. This synapse occurred when
excitatory input is sufficiently large than the inhibitory input,
and this concept of signal transmission also depicted on how
neural network process inputs received.
cannot be determined in general: e.g., it cannot be stated that a
Multilayer Perceptron (MLP):
flow belongs to Skype or MSN Messenger voice traffic but it
can be assumed that it is the traffic of some kind of VoIP
application, which generates packets with constant bit rate in
both directions. These flow characteristics can be hardcoded
manually or another way is to automatically discover the
features of a specific kind of traffic. To achieve this, statistical
A neural network is characterized by 1) its pattern of
connections between the neurons (called its architecture), 2) its
algorithm of determining the weights on the connections (called
its training, or learning algorithm), and 3) its activation
function. The Multilayer Perceptron (MLP) is the most
common neural network. This type of neural network is known
as a supervised network because it requires a desired output in
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8966
order to learn. The purpose of the MLP is to develop a model
that correctly maps the input data to the output using historical
data so that the model can then be used to produce the output
result when the desired output is unknown. A graphical
representation of an MLP is shown in Figure 2. In the first step,
the MLP is used to learn the behavior of the input data using
back-propagation algorithm. This step is called the training
phase. In the second step, the trained MLP is used to test using
unknown input data.
Figure 3: Radial Basis Function Architecture
In this research work Radial basis function provides
only 70% to 75% of accuracy.
3.2. Hybrid aggregated classifier
The Hybrid aggregated classifier takes the advantage of both
NB classifier and C4.5 classifier. The traffic flow is divided
into low density flow and high density flow based up on the
length of the packet. The low density flow is classified using
the NB algorithm and high density flow is classified using C4.5
algorithm.
Figure 2: MLP architecture with two hidden layers
Our proposed system classifier performance is evaluated
There are different training algorithms, while it is very
along with attacks. In our system, attacks are considered as an
difficult to know which training algorithm is the fastest for a
unknown source address in the data transmission. Unknown
given problem. In order to determine the fastest training
source address is considered as an attack in our process. Our
algorithm, many parameters should be considered. For
proposed classification performance like accuracy, precision,
instance, the complexity of the problem, the number of data
recall and f-measure are compared between our hybrid
points in the training set, the number of weights, and biases in
classifier without attacks and along with attacks. Proposed
the network, and error goal should be evaluated.
hybrid classifier is performs well even in the attacks
In this research work, Multilayer perceptron provides only
65% to 75% of accuracy.
environment.
High density flow – NB classifier
Naïve Bayesian is Simple (“naive”) classification method
Radial Basis Function Neural Network
based on Bayes rule. The Bayesian Classification represents a
Radial basis function (RBF) networks typically have three
supervised learning method as well as a statistical method for
layers: an input layer, a hidden layer with a non-linear RBF
classification. Assumes an underlying probabilistic model and
activation function and a linear output layer. Radial Basis
it allows us to capture uncertainty about the model in a
Function (RBF) Neural Network is a multilayer feed forward
principled way by determining probabilities of the outcomes. It
artificial neural network which uses radial basis functions as
can solve diagnostic and predictive problems. Bayesian
activation functions at each hidden layer neuron. The output of
classification provides practical learning algorithms and prior
this RBF neural network is weighted linear superposition of all
knowledge and observed data can be combined. Bayesian
these basis functions.
Classification provides a useful perspective for understanding
and evaluating many learning algorithms. It calculates explicit
probabilities for hypothesis and it is robust to noise in input
data.
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8967
The Naive Bayesian classifier is based on Bayes’ theorem
with independence assumptions between predictors. A Naive
with missing attribute values, it can also handle attributes with
differing costs etc
Bayesian model is easy to build, with no complicated
Let S be set consisting of s data samples with m distinct
iterative parameter estimation which makes it particularly
classes. The expected information needed to classify a given
useful for very large datasets. Despite its simplicity, the Naive
sample is given by
Bayesian classifier often does surprisingly well and is widely
used because it often outperforms more sophisticated
classification methods. Bayes theorem provides a way of
calculating the posterior probability, P(c|x), from P(c), P(x),
and P(x|c). Naive Bayes classifier assumes that the effect of the
pi is the probability that an arbitrary sample belongs to class
value of a predictor (x) on a given class (c) is independent of
Ci and is estimated by si/s. Let attribute A has v distinct values.
the values of other predictors. This assumption is called class
Let sij be number of samples of class Ci in a subset Sj. Sj
conditional independence.
contains those samples in S that have value aj of A. The
entropy, or expected information based on the partitioning into
subsets by A, is given by
•
is the posterior probability of class (target) given
predictor (attribute).
•
•
is the prior probability of class.
is the likelihood which is the probability of
predictor given class.
•
is the prior probability of predictor.
The encoding information that would be gained by branching
on A is
Low density flow – C4.5 classifier
C4.5 is a well-known decision tree Machine Learning
algorithm used to generate Univariate decision tree. It is an
C4.5 uses gain ratio which applies normalization to information
extension of Iterative Dichotomiser 3 (ID3) algorithm which is
gain using a value defined as:
used to find simple decision trees. C4.5 is also called as
Statistical Classifier due of its classification capability. C4.5
makes decision trees from a set of training data samples, with
the help of information entropy concept. The training dataset
consists of large number of training samples which are
characterized by various features and it also consists of target
class. C4.5 selects one particular feature of the data at each
node of the tree which is used to split its set of samples into
The above value represents the information generated by
splitting the training data set S into v partitions corresponding
to v outcomes of a test on the attribute A.
The gain ratio is defined as
subsets enriched in one or another class.
It is based upon the criterion of normalized information gain
that is obtained from selecting a feature for splitting the data.
The attribute with the highest gain ratio is selected as the
The feature with the highest normalized information gain is
splitting attribute. The non leaf nodes of the decision tree
selected and a decision is made. After that, the C4.5 algorithm
generated are considered as relevant attributes. The authors
repeats the same action on the smaller subsets. C4.5 has made a
have integrated decision tree and neural network, which
number of improvements to ID3 like it can handle both
resulted in improved classification accuracy.
continuous and discrete attributes, it can handle training data
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8968
In this research work hybrid classifier provides the highest
accuracy than the other machine learning algorithm. The
accuracy of the hybrid classifier is 90% to 96%.
4. Experimental Result
4.1. Dataset
Multilayer
Perceptron
65.09
68.65
66.74
69.85
Radial Basis
Function
69.26
72.24
70.70
75
Hybrid
Aggregated
classifier
93.96
88.80
90.77
96.28
The Internet traffic packet was taken by using WIRESHARK
tool from ISP provider of educational institution. It captures
Accuracy
network packets and extracts detail of the captured packet. To
It is the percentage of correctly classified samples over all
create data set, internet traffic packets are captured for the
classified samples. Accuracy can be calculated from formula
duration of 1 minute. There are 2730 packet was captured for 1
given as follows
minute time period.
For our experimentation, we are given the 2330 data instance
Accuracy =
to the training phase in the classifiers. In the training phase, we
are given the data instance along with the class label for a
Figure4: Accuracy for different classifier
training purpose. Thus in the training phase, classifiers are
learn the data with features according to the class label. For the
testing phase, we are taking the above 400 traffic flows for
each protocol such as ALC, ARP, UDP, CDP, DHCP, LLMNR
and NBNS. These dataset are classified using existing system
NB classifier and proposed hybrid classifier.
4.2. Result
Working Environment
Various experiments have been carried out by implementing
Recall
classification algorithms such as Neural Network such as
It is the proportion of samples of a particular class Z
multilayer perceptron, Radial Basis Function and Hybrid
correctly classified as belonging to that class Z. It is equivalent
aggregated classification supervised learning algorithms are
to True Positive Rate (TPR). Recall can be calculated from
implemented using MATLAB 2012. The results of the
formula given as follows
experiments are compared using accuracy and F-measure.
Recall =
Comparative result of all the classifier and overall
performance
Figure5: Recall for different classifier
Comparative results of three experiments carried out by
implementing algorithms including Neural Network such as
multilayer perceptron, Radial Basis Function and Hybrid
aggregated classification. The comparative result shows that
Hybrid aggregated classifier gives better result than the other
classifier.
Table 1: Performance of all the classifier
Precision
ALGORITHM
PRECISI
ON (%)
RECAL
L (%)
FMEASUR
E (%)
ACCUR
ACY (%)
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8969
It is the proportion of the samples which truly have class z
The existing system has several drawbacks such that doesn’t
among all those which were classified as class z. Precision can
analyses the density of the data. For high density and also low
be calculated from formula given as follows
density, this system used the same classifier. This will degrades
the performance of the system and also less number of features
is extracted from the data. This will degrade the accuracy rate
Precision =
of the system. With the intension of overcome these problems
Figure6: Precision for different classifier
as well as to increase the accuracy rate of traffic classification,
we are proposing the novel hybrid aggregated classifier.
The proposed novel hybrid aggregated classifier contains the
advantage of Naïve Bayesian classifier and C4.5 classifier.
Based on the density of the data, in this system uses these two
classifiers for traffic classification purpose. In addition,
proposed system is extracts more relevant features from the
traffic data in order to enhance the accuracy rate as well as
improve the performance of the system. Hence proposed
system classifier performance is evaluated along with attacks.
The experimental results show that the proposed scheme can
F-measure comparison
F-measure
distinguishes
the
correct classification of
document labels within different classes. In essence, it assesses
the effectiveness of the algorithm on a single class, and the
higher it is, the better is the clustering. It is defined as follows:
F = 2.(Precision.Recall) / (Precision + Recall)
Figure7: F-measure for different classifier
achieve much better classification performance than existing
internet traffic classification methods.
Future scope of work:
To improve the performance of ML classifier, our future work
will include:
 In this research work, internet traffic dataset has been
developed by considering packet flow duration of 1
minute. An increase in the capture duration for the
training data set, so that a significant variation in the
feature values for different classes could be observed.
 The internet traffic can also be captured from various
different real time environments such as university,
offices, home environments, Shopping mall etc.
 Various other types of attacks in the real time traffic
can be finding out by using different techniques.
In this research work, the performance evaluation of
different classifier shows that the Hybrid Aggregated classifier
provides better result than Neural Network
References
5. Conclusion
[1]. Jun Zhang, Chao Chen, Yang Xiang and Yong Xiang
Internet “Traffic Classification by Aggregating Correlated
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8970
Naive Bayes Predictions” IEEE Transaction on Information
Forensics and Security, VOL. 8, NO. 1, JANUARY 2013- 5
[2]. Kuldeep Singh, S. Agrawal, B.S. Sohi, “A Near Real-time
IP Traffic Classification Using Machine Learning”, I.J.
Intelligent Systems and Applications, 2013, 03, 83-93
Published Online February 2013 in MECS (http://www.mecspress.org/) DOI: 10.5815/ijisa.2013.03.09
[12]. Wireshark,Available:http://www.wireshark.org/
[13]. MATLAB,Available:www.mathworks.com.
[14]. Mr. Shezad Shaikh, Mr. Niket Bhargava, Ms. Urmila
Mahor “Implementation Of Internet Traffic Classifier Using
Dbscan Algorithm”, International Journal of Engineering
Research
and
Applications
(IJERA)
ISSN:
2248-9622
www.ijera.com Vol. 2, Issue 5, September- October 2012,
[3]. S. Agrawal Panjab, Jaspreet Kaur, B.S.Sohi “Machine
pp.1616-1623.
Learning Classifier for Internet Traffic from Academic
[15]. Max Bhatia, Sakshi Kaushal, “A Hybrid Technique to
Perspective” International Conference on Recent Advances and
Identify Peer-to-Peer Internet Traffic “International Journal of
Future Trends in Information Technology (iRAFIT2012)
Computer Applications (0975 – 8887) Volume 74– No.13, July
Proceedings published in International Journal of Computer
Applications® (IJCA).
[16].T. T. Nguyen and G. Armitage, “A survey of techniques
[4].Kun-Chan Lan, John Heidemann,” On the correlation of
Internet flow characteristics”.
Internet “Network Traffic Classification Using Correlation
Information”, IEEE Vol.8,No.1, January 2013-5
based approaches to handle imbalances in network traffic
dataset for machine learning techniques”
Communications
A
Institutions”
IP
Special
Issue
Multimedia
from
IJCA.www.ijcaonline.org.
for
accurate
of attacks,” IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 4,
pp. 567–580, Apr. 2009.
[19]. Bro 2011 [Online]. Available: http://broids.org/index.html
[20]. H. Kim, K. Claffy, M. Fomenkov, D. Barman, M.
Faloutsos, and K. Lee, “Internet traffic classification
[8]. Murat Soysal, and Ece Guran Schmidt, “Machine learning
algorithms
[17].Y. Xiang, W. Zhou, and M. Guo, “Flexible deterministic
[18]. Snort 2011 [Online]. Available: http://www.snort.org/
[7]. Jaspreet Kaur. S. Agrawal “ A Proposal for IP Traffic
Educational
2008.
packet marking: An ip traceback system to find the real source
[6]. Raman Singh, Harish Kumar, and R.K. Singla “Sampling
for
for internet traffic classification using machine learning,”
Commun. Surveys Tuts., vol. 10, no. 4, pp. 56–76, 4th Quarter
[5]. Jun Zhang, Chao Chen, Yang Xiang and Yong Xiang
Classifier
2013.
flow-based
network
traffic
demystified:Myths, caveats, and the best practices,” in Proc.
ACM CoNEXT Conf., New York, 2008, pp. 1–12.
classification: Evaluation and comparison,” Performance
Evaluation Elsevier Journal, Vol. 67, 2010, pp. 451-467.
[9]. Indra Bhan Arya, and Rachna Mishra, “Internet Traffic
Classification:
An
Enhancement
in
Performance
using
Classifiers Combination,” International Journal of Computer
Science and Information Technologies, Vol. 2 (2), 2011, pp.
663-667.
[10]. Shijun Huang Kai Chen Chao Liu, Alei Liang, Haibing
Guan, “A Statistical-Feature-Based Approach to Internet
Traffic Classification Using Machine Learning” 9781-42443941-6/09/$25.00 ©2009 IEEE
[11]. Yongli Ma, Zongjue Qian, Guochu Shou, Yihong
Hu“Study on Preliminary Performance of Algorithms for
Network Traffic Identification” 978-0-7695-3336-0/08 $25.00
© 2008 IEEE DOI 10.1109/CSSE.2008.1277
Ms. G. Rubadevi1 IJECS Volume3 Issue10 October, 2014 Page No.8964-8971
Page 8971
Download