Uploaded by youleave2018

An Intrusion Detection Model With Hierarchical Att

advertisement
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
An Intrusion Detection Model with
Hierarchical Attention Mechanism
CHANG LIU1 , (Member, IEEE), YANG LIU2 , YU YAN3 ,(STUDENT MEMBER, IEEE), AND JI
WANG1
1
Guangdong Ocean University, Zhanjiang, Guangdong, 524088, China. (e-mail: byndgjc@163.com)
Beijing Insititute of Astronautical Systems Engineering, Donggaodi, Fengtai District, Beijing, 100076, China. (e-mail: yangliu_npu@163.com)
Harbin Engineering University, No.145 Nantong Street, Nangang District, Harbinm, Heilongjiang, 150001, China. (e-mail: yanyuyikey@hrbeu.edu.cn)
1
Guangdong Ocean University, Zhanjiang, Guangdong, 524088, China. (e-mail: zjouwangji@163.com)
2
3
Corresponding author: Ji Wang (e-mail: zjouwangji@163.com).
The work was supported by program for scientific research start-up funds of Guangdong Ocean University.
ABSTRACT Network security has always been a hot topic as security and reliability are vital to software
and hardware. Network intrusion detection system (NIDS) is an effective solution to the identification of
attacks in computer and communication systems. A necessary condition for high-quality intrusion detection
is the gathering of useful and precise intrusion information. Machine learning, particularly deep learning,
has achieved a lot of success in various fields of industry and academic due to its good ability of feature
representation and extraction. In this paper, deep learning methods are integrated into the NIDS. The
intrusion activity is regarded as a time-series event and a bidirectional gated recurrent unit (GRU) based
network intrusion detection model with hierarchical attention mechanism is presented. The influence of
different lengths of previous traffic on the performance is then studied. Some experiments are performed on
the dataset UNSW-NB15, in which the proposed hierarchical attention model achieves satisfactory detection
accuracy of more than 98.76% and a false alarm rate (FAR) of lower than 1.2%. An attention probability
map to reflect the importance of features is then visualized using the attention mechanism. The visualization
ability assists in providing an understanding of the varied importance of the same features for different traffic
classes and to determine feature selection in the future.
INDEX TERMS Intrusion Detection System, Recurrent Neural Network, Attention Mechanism, Visualization.
I. INTRODUCTION
AST amounts of data are generated, processed, and exchanged in the use and interaction process of numerous
devices. Such data has become the target of illegal activity,
which has caused significant damage to network systems [1],
[2]. Research into advanced security methods has become
increasingly important in both industry and academia in order
to consistently improve and update security threat detection [3]. The basic general components of network security
mechanisms include firewall, user authentication technology,
anti-virus software, and an intrusion detection system (IDS)
[4], [5]. As a proactive security technology, IDS monitors
a host or network and alerts when an attack is detected.
Cybersecurity can be further guaranteed through intrusion
detection methods in which network attack behavior can
be obtained and learned by data analysis and modeling.
According to the location of the deployment and the scope
of monitoring, IDS products can be loosely divided into
V
network intrusion detection system (NIDS) and host-based
intrusion detection system (HIDS) [6]. The NIDS works at
the network layer to detect network threats by taking all
traffic from the target network as its data source to protect
the entire network segment [7]–[9]. The HIDS serves as a
monitor and analyzer of a computer system that does not act
on the external interface, but focuses on the internal system
[10], [11]. This framework commonly analyzes system logs,
processes, or files to monitor the dynamic behavior of all or
part of the system and the state of the entire computer system.
In the network system, many devices or components require
IDS support such as web server, file server, and workstations
[12]. A scenario illustrating how IDS works at different sites
in the network system is provided in Fig. 1.
The development of network technology and hardware
devices creates issues for the application and upgrade of
IDS [13]–[15]. Current challenges include the following:
1) Diversity: An increase in the type of network protocols
1
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Database
Web applier
Database
HIDS
NIDS
HIDS
HIDS
Firewall
Application
Server
File Server
Web Server
Router
Workstations User
Firewall
HIDS
Firewall
Workstations Group
Internet
NIDS
FTP Server
NIDS
FIGURE 1: Intrusion Detection System Works at different Place of the Network System.
makes it increasingly hard to distinguish between normal
and abnormal data. 2) Low-frequency attacks: The imbalance
distribution of different attack types results in weak detection
precision of the IDSs, particularly for data-driven methods.
3) Adaptability: The diverse and flexible characteristic of
the network causes a significant reduction in the lifespan of
detection models because IDS requires updating to adapt to
the evolving environment. 4) Placement: Distributed, centralized, and hybrid methods must be adopted according to
specific considerations of financial, computational, and time
costs. 5) Accuracy: Existing traditional techniques cannot
achieve the required high-level accuracy due to the aforementioned challenges. To ensure the performance of IDS, a
deeper, more granular and increasingly comprehensive understanding of the nature of the intrusion events is required.
Around the issue of intrusion detection, some scholars
have done a lot of work, using methods including expert
knowledge, data mining, and machine learning [16]–[18].
Among them, the deep learning method is unique, providing
a high level of detection performance. Deep neural network
mimics human nerves and uses a large number of non-linear
processing units to deal with complex problems [19]–[21]. It
can automatically learn features and extract core data information. Due to the improvement of hardware and optimization of the algorithm, recurrent neural network (RNN) has
received widespread acclaim. RNN has become a star model
in applications including natural language processing (NLP),
semantic understanding, and speech recognition [13], [22].
Learning to identify whether the network traffic is normal or
anomaly can be understood as learning to perform sentiment
analysis or document classification given several sentences.
From this perspective, network intrusion detection is partly
similar to sentiment analysis tasks, for which RNN-based
methods have been suitable.
In this study, network traffic activity is treated as a timeseries event, meaning that the assessment of traffic type
at the current time depends not only on the current data
but also on data at the previous moments. To provide the
ability to process such data, an RNN-based method is used
as a benchmark approach for intrusion detection. In reality,
the traffic information at different moments or features in a
sample of traffic contributes differently to the judgment of the
current traffic type. To take full advantage of this characteristic, attention mechanism is adopted to enhance the model,
with two kinds of attention mechanism respectively applied
to the feature and traffic slice. The attention mechanism
provides the ability of visualization to discern which feature
or traffic slice is important. The proposed attention-based
models are then evaluated on the benchmark dataset UNSWNB15, which has been used frequently in various recent
2
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
studies. The experiments prove that the proposed attentionbased model demonstrates superior performance compared
with other models. The main contributions of this paper
include:
1) Different feature or traffic slices contribute uniquely
to the classification of the current traffic. To account
for sensitivity, the proposed model includes two levels
of attention mechanisms, feature-based and slice-based.
The attention mechanism guides the model to provide
increased attention to some individual features or traffic
slices when constructing the representation of traffic
information. Based on the above, an attention map is
visualized, contributing to an understanding of the importance of features or slices of traffic.
2) Three RNN-based detection models are individually
compared with the different attention mechanisms of
no-attention, one-layer attention, and hierarchical attention. It is observed that the attention mechanism
contributes to improved model performance. The influence of timestep on the performance of IDS is also
studied and the concept of cost-performance is applied
to determine if the value of timestep must be increased.
3) The entire UNSW-NB15 dataset is utilized in this study,
rather than partial data. The results show that when the
timestep equals 10, the hierarchical attention model
achieves the highest detection accuracy of over 98.76%
and the false alarm rate (FAR) is as low as 1.49%.
The rest of this paper is organized as follows. Section II
details existing NIDS works, mainly using RNN as the base
model. In Section III, a number of basic methods of RNN
and attention mechanism are introduced. Section IV details
the proposed work, and Section V presents the results and
analysis of experiments. Finally, Section VI describes the
conclusion of this paper and the direction of future work.
II. RELATED WORKS
The three predominant types of NIDS are misuse-based,
anomaly-based, and hybrid. Among them, the misuse-based
method works by constructing a pattern matching template to
detect intrusion. The constructed template is built on artificial
knowledge and the analysis of existing data. The template
is fixed, so the benefit of this method exists in detecting
known attack types with high accuracy [23]. However, this
feature also leads to an inherent disadvantage of this method
as in a dynamic network environment, new attack types or
variations may appear at any time. It is thus difficult for the
misuse-based approach to perform adequately in the static
background [24], [25]. Another kind of intrusion detection
method is the anomaly-based approach, which operates by
only utilizing normal data so that samples with different
behaviors may all be judged as anomaly [26]. When an
attack occurs in a real device where the misuse-based method
is deployed, the NIDS will alert the alarm, but provide
no information about the exact attack type. However, the
disadvantage of this method is poor accuracy performance
as some attacks behave like normal data or it is difficult to
separate the attack data and the normal data in the extracted
features.
Several machine learning based approaches proposed in
previous studies have achieved success in intrusion detection
systems.In [27], Hebatallah et al. presented a framework for
feature selection considering irrelevant and redundant features. In their model, five different kinds of feature selection
strategies are used and the J48 decision tree classifier with
gain ratio filter is determined to have the best performance.
In [28], Tian et al. proposed a robust and sparse method using
one-class support vector machine (OSVM), which aimed to
locate samples that are different from the majority of data.
However, the anomaly method is limited by the outliers and
noise during the training phase. To improve the performance
of this model, the Ramp loss function is adopted resulting in
the algorithm more robust and sparse.
Deep learning has become an important branch of machine
learning and has become the preferred solution to many
problems. This method has been applied in intrusion detection filed, achieving remarkable results. In [29], Khan et al.
presented a two-stage intrusion detection model based on the
stacked autoencoder network. In the initial phase, the traffic
is judged as normal or abnormal by the value of classification
probability. In the second stage, the result of the first stage
is regarded as an extra feature for the following multi-class
classification process. However, the detection accuracy could
only reach 89.134% on the UNSW-NB15 dataset. In [30],
Tian et al. presented a hybrid method of shallow and deep
learning using a stacked autoencoder to reduce the dimension
of features. The SVM is then combined with the artificial bee
colony algorithm for classification. The experiments were
also conducted with accuracy reaching beyond 89.62%.
Many scholars have explored works using RNN to solve
network intrusion detection problems. In 2012, Sheikhan et
al. presented a three-layer RNN model to solve the misusebased intrusion detection problems [31]. The input features
in their experiment are divided into four categories according
to the feature attribute. However, the RNN model is reduced
in this method, meaning that the connection between the
neural layers is partial and diminishes the performance. In
2016, Kim et al. explored the possibility of applying RNN
to intrusion detection using a variant of RNN to build an
intrusion detection model [32]. Instances from the KDD
Cup99 dataset were extracted in their experiment which focused on locating the super parameters and evaluating model
performance. In 2017, Yin et al. used standard RNN to build
an IDS, and evaluated their approach with benchmark dataset
NSL-KDD [33]. In their work, the number of hidden nodes,
the number of layers, and the learning rate have become the
main variables. Unfortunately, the accuracy of their proposed
model is not adequate. In 2018, Xu et al. constructed a
new DNN model that applied gated recurrent unit (GRU)
and multilayer perceptron (MLP) to extract data information
[34]. Their simulation results show that the GRU cell can
be more effective than the long short-term memory (LSTM)
cell for intrusion detection problem. In [35], Anani et al.
3
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
used the full KDD Cup99 dataset to compare the model
detection performance based on LSTM, bidirectional long
short-term memory (BiLSTM), skip-LSTM, and GRU. The
results illustrate that the GRU achieves superior performance
compared to other models. In [36], Agarap et al. sought to
enhance the ability of classification by building a GRU model
and introducing linear SVM to replace the softmax classifier.
Similarly, L2-SVM loss function was adopted to replace the
cross-entropy function. In [37], Roy et al. selected samples
from the UNSW-NB15 dataset and build a BiLSTM network.
Five features were selected, reaching an accuracy of over
95%. However, only part of the dataset is utilized in this
approach, which may cause some bias in the results. In [38],
the authors used the unsupervised version of different variants of RNN cells to construct an autoencoder for intrusion
detection. In [39], an end-to-end intrusion detection approach
was proposed. Network packets were adopted as the input
and processed sequentially. There exists nothing about feature engineering or domain knowledge in this method and
instead, the payloads are divided into characters and train
the RNN model to identify specific sequences. However, the
drawback of this end-to-end approach is that there are too
many parameters which make the model overly complex.
III. BASIC THEORY
A. GRU-BASED METHOD
The RNN is unique as the neural unit is self connection,
meaning that when the cycle unfolds, the data flow over
time is preserved in the neurons [40]. The cyclic structure of
neurons enables them to preserve historical information and
provide sequence modeling capabilities. The RNN calculates
a mapping from the input x = (x1 , x2 , . . . , xT ) to the hidden
state h = (h1 , h2 , . . . , hT ) as follows:
yt
+
ht -1
1-
rt
zt
s
Whr ,Wxr , br
~
ht
s
Whz ,Wxz , bz
tanh
Whh ,Wxh , bh
xt
FIGURE 2: Structure of Gated Recurrent Unit.
The reset gate rt works in the process to derive the candidate
state. The way to obtain the candidate state is similar to that
in traditional RNN except for the gate mechanism.
h̃t = tanh(xt Wxh + Whh (rt
ht−1 ) + bh )
(1)
where σ is a non-linear function and t ∈ [1, T ]. Wxh and
Whh are corresponding weight matrices, b is a bias term.
As we all know, the gradient descent method is often used
to train the deep learning model. And Back Propagation(BP)
is a way to obtain the gradient. In particular, back propagation
training time (BPTT) is an algorithm that specifically solves
the computation of parameters in RNN models. However,
as it is limited by structure, the gradient in RNN can easily
explode or vanish due to the product of W [41].
As traditional RNN is limited by gradient vanishing or
exploding, variants of RNN are proposed. Gated recurrent unit(GRU) was proposed to address such issues by introducing
the gating mechanism [42]. There are two kinds of gates: the
reset gate rt and the update gate zt . They work together to
decide the information update process.
Suppose the current input is xt and the new state ht in time
t contains two part: the candidate state h̃t and the past state
ht−1 .
ht = (1 − zt )ht−1 + zt h̃t
(2)
(3)
where
stands for the Hadamard Product, W is weight
matrix, and b is the bias.
Here, rt helps to control how much information from the
past state can be added into the candidate state. rt is updated
as follow:
rt = σ(xt Wxr + ht−1 Whr + br )
(4)
According to the equation to obtain new state ht , update
gate zt plays a role to balance the previous state ht−1 and the
current candidate state h̃t . zt can then be regarded as a valve
for distributing the past information and the new information.
The update of zt is similar to that of rt :
zt = σ(xt Wxz + ht−1 Whz + bz )
ht = σ(Wxh xt + Whh ht−1 + bh )
ht
(5)
Former experiments have shown that the BiGRU cell performs better than other three cells including LSTM,
GRU, and BiLSTM. A Bidirectional GRU (BiGRU) is an
enhanced version of the GRU that works in two directions.
→
−
It summarizes the forward information h and the backward
←
−
information h to enhance feature extraction abilities.
→
−
−−−→
h t = GRU (xt ), t ∈ [1, T ]
(6)
←
−
←−−−
h t = GRU (xt ), t ∈ [1, T ]
B. ATTENTION MECHANISM
The generation of attention mechanism is inspired by the
behavior of humans. Human attention happens, to some
extent, when humans predominantly focus on particular local
regions of an image or special words in one sentence. The
attention mechanism assists to fully utilize limited resources.
The regular process of attention mechanism is illustrated in
Fig. 3. The attention value can be obtained by the pair of key
and query. The attention mechanism is not a specific method,
but a mode of thinking which contains the two important
components of addressing and calculating.
4
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Key1
Key2
Key3
Key4
Query (Q)
Attention
Value
Query
D
Weight Matrix
FIGURE 3: The Regular Pipeline of Attention Mechanism.
a1
a2
a3
FIGURE 5: The Illustration of Dot-product Attention.
Finally, the attention vector can be obtained by:
a = dV
After deriving the probability vector, the final attention
representation, that is context vector, can be calculated. Depending on the range of hidden states used, the attention
mechanism can be divided into global attention and local
attention [44]. In this paper, we used the global attention
shown in Fig.6. The global attention model absorbs all the
hidden states when deriving the context vector ct . Due to this
calculating way, ct can capture relevant source-side features.
Attention Layer
ct
x2
(10)
aT
x1
a
Attention
Weight (d)
Keys (K)
Using an attention model, an input can be written in
X = [x1 , x2 , ..., xn ], where n can be treated as different
timestep for a 3-D data or the number of features for a 1D vector. Addressing is also called alignment score function,
and is used to obtain the attention probability. The attention
probability represents how much weight should be given to
the hidden state of each input. There are numerous addressing
methods available. In this paper, a location-based attention
and dot-product attention method is utilized.
matmul
Source
softmax
Value (V)
Value4
matmul
Value3
tanh
Value2
matmul
Value1
x3
Global align weights
xT
at
FIGURE 4: The Illustration of Location-based Attention.
h1
hT
Ă
Location-based attention was initially proposed in [43].
It computes the alignment from the generator state and the
previous alignment only in such a simple way:
x1
αt = sof tmax(Wa ht )
where Wa is the weight matrix and ht is the current hidden
state.
Dot-product attention consists of three parts: a learned key
matrix K, a value matrix V , and a query vector q. The process
to obtain the attention vector is illustrated in Fig. 5.
First, the key matrix is obtained:
K = tanh(V W a )
x2
x3
x4
x5
Ă
xT
(7)
(8)
where W a is a randomly initialized weight matrix. After
determining the current key matrix, the similarity between
each query value and the current key value is calculated to
obtain a normalized probability vector d, which is the weight
vector.
d = sof tmax(qK T )
(9)
FIGURE 6: Global Attention Model.
IV. PROPOSED MODEL
The proposed model is introduced in this section. The hierarchical attention mechanism of feature-based attention and
slice-based attention is applied respectively into the IDS. The
overall architecture of the hierarchical attention intrusion detection model is shown in Fig. 7. This model consists of three
main steps. To begin with, data preprocessing is required.
The main operations at this stage include the missing value
process, feature transformation, and feature normalization.
Feature-based attention is then utilized for enhancing the
expression ability of the traffic features, then the slice-based
attention is applied to several pieces of traffic data.
5
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
y pred
Dense
cT
a0
u0
h0'
aT
u1
a1
h1'
Slice-based
Attention
hT'
...
h0
uT
h1
s0
hT
s1
h10
a10
sT
h11
...
a11
x10
a1N -1
x11
...
h1N -1 Feature-based
Attention
x1N -1
Preprocessed Data
D
A
T
A
Missing
value
process
Transfo
rmation
Normali
zation
Merge
History
Data
FIGURE 7: Proposed Model for Intrusion Detection.
A. FEATURE-BASED ATTENTION
Not all features have the same importance in the representation of single traffic information. Thus, to fully release the
energy of some features and capture the features that are truly
significant to the representation of traffic, the feature-based
attention mechanism is adopted to determine which feature
should be the focus. Besides, the location-based attention
mechanism has no additional objects of interest, and is only
relevant to each input in the data source itself. So it is very
suitable to deal with the input features.
−1
Given a sample with N dimensions, Xi = [x0i , x1i , ..., xN
],
i
the sof tmax function is adopted to get the probability
vector, that is the weight for each feature. The normalized
weight of j-th feature in time i can be computed by:
j
exi
αij = sof tmax(xji ) = PN −1
k=0
k
exi
(11)
The value of αij shows the importance of feature j. Based
on the above definition, the final output hji with locationattention can be derived as:
hji = xji × αij
(12)
So in this part, a fully-connected layer with sof tmax
activation is then adopted to determine the weight vector α.
Input Xi is then multiplied by α to derive the output hi .
6
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
B. SLICE-BASED ATTENTION
We believe the traffic data is time-related. Traffic information
for multiple adjacent moments helps significantly to judge
the type of current traffic. Thus, several pieces of traffic information are grouped together, which is called slice traffic. The
dot-product attention is adopted due to the optimized matrix
multiplication operation in the program that can reduce the
resource consumption during calculation.
For each timestep, the corresponding hidden state hi is
fed through a single-layer perception to obtain ui as a hidden
representation of hi .
ui = tanh(Ww hi + bw )
(13)
The importance of each piece of traffic at different moment
i is then evaluated using the similarity of ui with uw . A
normalized importance vector α, also called attention weight,
can be computed through a sof tmax function.
exp(uTi us )
αi = P
T
i exp(ui us )
(14)
The output of slice-based attention is then computed as a
weighted sum. The context vector v can be regarded as a high
level representation of the slice traffic.
X
v=
αi hi
(15)
i
A summary of the algorithmic phases of the proposed
hierarchical attention intrusion detection model in Algorithm
1 is provided below.
V. EXPERIMENT
A. DATASET
A modern dataset that can represent actual situations in
the real network is required to build and evaluate the performance of NIDS. The KDDCUPâĂŹ99, NSL-KDD, and
UNSW-NB15 datasets are compared in this paper, considering multiple factors such as dataset size, number of types, and
data distribution.
TABLE 1: Comparison of several training dataset for intrusion detection
Name
Year
Normal Samples
Anomaly Samples
Proportion
(anomaly/normal)
Category
Incomplete Samples
KDDCUP’99
1999
972,781
2,952,869
NSL-KDD
2009
67,343
58,631
UNSW-NB15
2015
56,000
119,341
3.03:1
0.87:1
2.13:1
5
Many
5
None
10
Few
Referring to Table I and Table II, it can be determined
that UNSW-NB15 is an ideal candidate dataset for intrusion
detection. UNSW-NB15 was created by Moustafa et al. to
overcome the shortcomings of KDDCUPâĂŹ99, and has
gradually become one of the benchmark datasets in the filed
of IDS [45]. UNSW-NB15 includes rich traffic types so that
Algorithm 1 Algorithm for the hierarchical attention intrusion detection model
Input: The training dataset X is input with n pieces of
samples. Each sample is x(i) , where i ∈ (1, ..., n). The
weight matrix of GRU cells W are initialized along with
attention layer matrix Wa , learning rate l, number of
timesteps Nt , epochs K;
Output: The classification category y is output with the
feature-based attention probability α1 and the slicebased attention probability α2 ;
1: Data Preprocessing: Missing value filling is conducted
by transforming the nominal features into numerical data
and then normalizing the numerical data into the range of
0 and 1;;
2: The current data xt is merged with the history data xt ,
where the length of history data is determined by Nt ;
3: for k = 1 : K do
4:
The feature-based attention probability is obtained:
α1t = sof tmax(xt );
5:
st1 = α1t xt ;
6:
The BiGRU cells are fed with st1 and the output
0
[ht , ht ] is obtained;
7:
ut = tanh(Ww ht + bw );
8:
α2 =Psof tmax(ut );
9:
v = i αi hi ;
10:
The BPTT algorithm with learning rate l is used to
train the model ;
11:
The output ot of the model is obtained;
12:
if ot > 0.5 then
13:
yt = 1
14:
else
15:
yt = 0
16:
end if
17: end for
18: return yt , α1 , α2
TABLE 2: Comparison of several testing dataset for intrusion
detection
Name
Year
Normal Samples
Anomaly Samples
Proportion
(anomaly/normal)
Category
Incomplete Samples
KDDCUP’99
1999
60,591
250,436
NSL-KDD
2009
9,711
12,834
UNSW-NB15
2015
37,000
45332
4.13:1
1.32:1
1.22:1
5
Many
5
None
10
Few
it can more accurately reflect the characteristics of modern
network traffic data. Ten types of traffic data exist which are
Normal, Dos, Fuzzers, Analysis, Exploits, Reconnaissance,
Worm, Backdoors, Generic, and Shellcode. Besides, the distribution of normal and anomaly data is balanced both in
training and testing datasets.
Each sample in UNSW-NB15 contains 49 features, which
can be divided into the five sections of flow features. The
7
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
detailed descriptions of every feature are listed in Table III.
The official website provides a pair of training and testing
datasets and there exist 82,332 records in the testing set,
where the normal accounts for 45% and anomaly is 55%.
The training set is comprised of 175,341 samples, where the
ratio of normal records to the abnormal is 32% to 68%. In
this research, the entire UNSW-NB15 dataset is adopted for
model evaluation and analysis.
To meet the input requirements of deep learning, data
preprocessing is needed which mainly includes feature transformation, and feature normalization.
To meet the requirements of the input format in neural
network, data preprocessing is required and mainly includes
feature transformation and normalization. Feature transformation is used to transform the symbolic features into numerical data such as service, state, and proto. This step is
necessary because neural network calculations only allow numerical operations. Several feature transformation techniques
exist, among which one-hot encoding is frequently adopted,
especially in the case of attributes that are not serializable and
cannot be compared in value. After encoding, the dimension
of the samples is changed from 42 to 196.
Feature normalization is highly useful in deep learning
methods and is utilized in most neural network calculation
works. This is related to the activation feature of neurons
and updating of the weight [46]. In several partitions, the
response of neurons is stronger than other parts which will
accelerate the speed of training. In this paper, the min-max
technique is adopted as follows:
x∗ =
x − min
max − min
(16)
B. EVALUATION
To evaluate the performance of a classifier, the confusion
matrix is defined in Table IV. True Negative (TN) means the
total number of normal examples correctly classified. False
Negative (FN), contrary to TN, represents the amount of
normal data wrongly judged. True Positive (TP) stands for
the number of attack correctly classified. False Positive (FP)
is the amount of attack samples that are wrongly divided into
normal parts.
TABLE 4: Confusion Matrix for Binary Classification
Actual Class
Predicted Class
Attack
Normal
Attack
Normal
TP
FN
FP
TN
Based on the above definition, other advanced matrices can
be obtained. Accuracy is a good measure when the classes
are balanced.
TP + TN
(17)
TP + FP + FN + TN
The F AR is a traditional metric and reflects the situation
in which records are misclassified. The definition is as follow:
Accuracy =
1
FP
FN
(
+
)
(18)
2 FP + TP
FN + TN
The P recision is the ratio of records correctly classified
as attacks to the number of attacks and Recall is the fraction
of correctly classified attacks to all records that are detected
to be anomaly.
F AR =
P recision =
Recall =
TP
TP + FN
TP
TP + FP
(19)
(20)
C. MODEL CONFIGURATION AND TRAINING
In this paper, Keras with T ensorf low is used as the backend to build the model. To meet the requirement of the input
dimension for BiGRU, the dataset is reorganized into a 3D
shape. All 196 features are then arranged in a single piece of
data into a vector. Some samples at the end of the dataset
are dropped in order to make the total number an integer
multiple of the batch size. Thus, the final training dataset
has a shape of (175340, timestep, 196) and the shape of the
testing dataset is (825340, timestep, 196), where timestep
is a hyper-parameter representing the length of historical
events. To create such data, the T imeseriesGenerator in
Keras is adopted.
In the proposed model, to build the feature-based attention mechanism, a Dense layer with sof tmax activation is
connected to the input layer. The number of hidden units
in this dense layer is equal to that of the input layer. Two BiGRU layers with 32 and 12 units respectively are
then stacked together for processing time-series data. Each
timestep creates an output, and the dot-product attention is
applied to all the steps. Finally, Dense layers are connected
to the output of the attention layer and the output layer has
only one unit. In the training phase, a batch of 1024 is used
and the Adam optimizer is adopted. The parameters of Adam
are set to be lr = 0.1, beta_1 = 0.9, beta_2 = 0.999. The
binary_crossentropy is adopted as the loss function.
D. RESULT AND ANALYSIS
The three different structures of no attention, single attention,
and hierarchical attention were individually explored in this
research (other components were kept the same except the
attention module). The influence of timestep on the convergence performance was first explored. Corresponding experiments were then conducted on the hierarchical attention
model and several timesteps were randomly selected.
The convergence curves plotted in Fig. 8 illustrate how
the loss function changes with iterations during the training phase.It can be observed that even though timestep is
different, the model will finally converge. Additionally, the
larger the timestep, the lower the loss value, meaning that
model performance is improved when the timestep is larger.
Considering the speed of convergence, it appears that the
value of timestep has no effect on this condition.
8
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 3: UNSW-NB15 Dataset
name
dur
proto
service
state
spkts
dpkts
sbytes
dbytes
rate
sttl
dttl
sload
dload
sloss
dloss
sintpkt
dintpkt
sjit
djit
swin
stcpb
dtcpb
dwin
tcprtt
synack
ackdat
smeansz
dmeansz
trans_depth
response_body_len
ct_srv_src
ct_state_ttl
ct_dst_ltm
ct_src_dport_ltm
ct_dst_sport_ltm
ct_dst_src_ltm
is_ftp_login
ct_ftp_cmd
ct_flw_http_mthd
ct_src_ltm
ct_srv_dst
is_sm_ips_ports
attack_cat
label
Description
Record total duration
Transaction protocol
service such as http, ftp, smtp, ssh, dns, and (-) (if not used state)
Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR,and (-) (if not used state)
Source to destination packet count
Destination to source packet count
Source to destination transaction bytes
Destination to source transaction bytes
Transaction bytes per second
Source to destination time to live value
Destination to source time to live value
Source bits per second
Destination bits per second
Source packets retransmitted or dropped
estination packets retransmitted or dropped
Source interpacket arrival time (mSec)
Destination interpacket arrival time (mSec)
Source jitter (mSec)
Destination jitter (mSec)
Source TCP window advertisement value
Source TCP base sequence number
Destination TCP base sequence number
Destination TCP window advertisement value
TCP connection setup round-trip time, the sum of synack and ackdat.
TCP connection setup time, the time between the SY N and the SY N _ACK packets.
TCP connection setup time, the time between the SY N _ACK and the ACK packets.
Mean of the flow packet size transmitted by the src.
Mean of the flow packet size transmitted by the dst.
Represents the pipelined depth into the connection of http request or response transaction.
Actual uncompressed content size of the data transferred from the serverâĂŹs http service.
No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26).
No. for each state (6) according to specific range of values for sourcedestination time to live (10) (11).
No. of connections of the same destination address (3) in 100 connections according to the last time (26).
No. of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
No. of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
No. of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26).
If the ftp session is accessed by user and password then 1 else 0.
No. of flows that has a command in ftp session.
No. of flows that has methods such as Get and Post in http service.
No. of connections of the same source address (1) in 100 connections according to the last time (26).
No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).
"If source (1) and destination (3)IP addresses equal and port numbers (2)(4) equal then, this variable takes value 1 else 0".
"The name of each attack category.
0 for normal and 1 for attack records.
The influence of timestep on accuracy and false alarm
rate was also studied, and the results are provided in Fig.9.
As illustrated in Fig.9(a), as the value of timestep increases,
the accuracy also increases gradually on the testing dataset.
Impressively, the promotion of performance due to timestep
growth is clearly evident (detection accuracy can reach
91.69% and 98.88% when timestep = 1 and timestep = 11
respectively).
To better characterize the impact of timestep, a gain ratio
is introduced that is the value of accuracy improvement for
each timestep. Starting from timestep = 2, a vertical line
serves as the indicator of the increase of accuracy: the longer
the line, the greater the increase of accuracy. Generally,
the development of the model experiences a fast-ascension
period and a slow-convergence period which is evident in the
proposed model. Although the length of the black line reach-
es its max value when timestep equals 2, the actual detection
rate can improve further and is therefore still considered to
be in the fast-ascension period. When timestep = 11, the
accuracy improves to a relatively high level due to its smallest
gain ratio and there is little room for further improvement. At
the same time, the black line looks as if it will disappear. In
summary, 10 is determined as an optimized candidate value
for the parameter of timestep. Fig.9(b) shows a comparison
of false alarm rate under different timestep values, where
FAR is retained at a relatively low level when timestep =
10, again indicating the optimal value of timestep being 10.
It can also be concluded that the hierarchical attentionbased model performs the best, and the single-level attentionbased model has a better performance than the model with no
attention mechanism. When the value of timestep is small,
for example, timestep equals 2 or 3, attention mechanism
9
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
(a) Comparison of Accuracy using different Structures with different
timestep on Testing Dataset.
FIGURE 8: Convergence Curve of the Hierarchical Attention
Model During the Training Phase with different timestep
FIGURE 10: The Detection Process on Testing Dataset using
BiGRU with 10 timestep.
erarchical attention model with timestep = 10 value of
10 is shown in Fig. 10. The blue bold dots represent the
real label of testing samples, orange middle dots denote the
correct predictions, and the red small dots are incorrectly
classified samples. In the beginning, despite the fluctuation in
the classified data (orange dots), a satisfying performance can
still be achieved. However, with increased time and samples,
the performance of the model is reduced, roughly beginning
at sample 40,000th. This may be attributed to the emergence
of new types of attacks as time goes on. For some features,
certain values exist in the testing dataset that are not available
in the training dataset.Another reason may be the fluctuation
of N ormal data, especially in features like proto and state.
Online learning could provide a solution to these problems.
In future research, the inclusion of online operations to this
model will be considered.
The proposed method was also compared with other works
using the UNSW-NB15 dataset, as shown in Table V. The
comparison results further illustrate the effectiveness and
improvement of the proposed hierarchical attention model.
TABLE 5: Comparison Between Our Proposed Model with
Other Machine Learning Algorithms
(b) Comparison of FAR using different Structures with different
timestep on Testing Dataset.
FIGURE 9: The Experiment Results on Training and Testing
Dataset.
has a significant impact on the performance improvement
of the model. As the timestep increases, the effect of the
attention mechanism is gradually reduced. This is likely
because the features extracted by the BiGRU with a high
timestep value are sufficient to characterize the data and is
illustrated by the performance of the model without attention
mechanism.
The detecting phase on the testing dataset using the hi-
Method
Decision Tree(J48) [27]
Ramp OCSVM [28]
Autoencoder [29]
Autoencoder & SVM [30]
DAE-DFFNN [47]
DFEL-GBT [48]
BiLSTM [37]
Our Single Attention model
Our Hierarchical Attention model
Accuracy
88.3%
97.24%
89.71%
89.62%
92.5%
91.22%
95.71%
98.64%
98.76%
Precision
77.78%
91.33%
89.74%
77.93%
98.2%
90.38%
100%
98.54%
99.35%
Recall
94.59%
98.5%
89.85%
94.18
99%
90.69%
96.00%
98.43%
98.94%
E. VISUALIZATION OF ATTENTION
To validate that attention effectively helps to select informative features or pieces of traffic, two pieces of traffic were
randomly selected, representing normal and anomaly. The
10
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
attention probability was then visualized separately, using
both slice-based attention and feature-based attention. To
classify the current traffic, several previous traffic data points
were also considered.
Attention maps for a case of normal traffic and anomaly
events are respectively illustrated in Fig. 11 and 12. The xaxis is the value of timestep and the y-axis is the feature
number or feature name. There are two different ways to
illustrate attention probability. A bar chart is used to represent the slice-based attention, and the color block is adopted
for the illustration of feature-based attention. The darker the
color, the greater the probability.
In Fig. 11(a) and Fig. 11(b), the lower section in the
subgraph displays the slice-based attention probability for 10
timesteps. The value of slice-based attention probability at
varying timestep is close which is reasonable because the
10 traffic data points all belong to normal classes which are
similar to each other. As can be seen from the dark areas
of the attention map, the features extracted are similar at
varying timestep. This occurrence is also reasonable as similar features have a similar effect on the final classification.
Figure 11(a) also illustrates that the dload may be the most
important feature for this kind of normal traffic.
An example of the attention map for anomaly traffic is
provided in Fig. 12. The probability distribution is completely different from that in the normal case in Fig. 11.
First, the features with strong responses at each timestep
are different from each other, that is, they are all different
data types. Further evidence that the data is of different
types is provided by the assigning of attention probability.
Therefore, when classifying the current data, data at other
timestep contributes nothing, which is reasonable. Additionally, features including sttl, dttl, dload, and cts rvd st, play
an important role in judging this kind of attack data. The
attention mechanism strengthens the existence of this feature
resulting in the improvement of model performance.
The hierarchical attention mechanism proposed in this
work not only enhances the detection ability, but also helps
to determine which feature plays a substantial role in the
detection process. Feature selection can thus be conducted
based on attention probability, which will be the focus of
future work.
VI. CONCLUSION
This paper presented an intrusion detection model with hierarchical attention mechanism. Several traffic data are merged
in order and the influence of the different number of previous
traffic on performance was also investigated. The proposed
model was demonstrated to achieve satisfactory performance
on the UNSW-NB15 dataset, with accuracy of more than
98.76% and FAR lower than 1.2%. With the assistance of
attention mechanism, an attention map was presented. The
visualization may provide assistance for feature selection and
contributes to the understanding of the differences between
varied traffic classes in the future. Future developments will
be focused on the evolution of attention mechanism and
attempts of parallel computing. Besides, work will also be
conducted on classifying specific types of attacks using the
attention mechanism.
VII. ACKNOWLEDGMENT
The work was supported by program for scientific research
start-up funds of Guangdong Ocean University. Meantime,
all the authors declare that there is no conflict of interests
regarding the publication of this article. We gratefully thank
of very useful discussions of reviewers.
REFERENCES
[1] C. Alcaraz, R. Roman, P. Najera, and J. Lopez, “Security of industrial
sensor network-based remote substations in the context of the internet of
things,” Ad Hoc Networks, vol. 11, no. 3, pp. 1091–1104, 2013.
[2] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, “Deep learning for
super-resolution channel estimation and doa estimation based massive
mimo system,” IEEE Transactions on Vehicular Technology, vol. 67, no. 9,
pp. 8549–8560, 2018.
[3] F. Salo, A. B. Nassif, and A. Essex, “Dimensionality reduction with igpca and ensemble classifier for network intrusion detection,” Computer
Networks, vol. 148, pp. 164–175, 2019.
[4] A.-R. Sadeghi, C. Wachsmann, and M. Waidner, “Security and privacy challenges in industrial internet of things,” in 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, IEEE,
2015.
[5] Y. Lin, M. Wang, X. Zhou, G. Ding, and S. Mao, “Dynamic spectrum
interaction of uav flight formation communication with priority: A deep
reinforcement learning approach,” IEEE Transactions on Cognitive Communications and Networking, 2020.
[6] S. Agrawal and J. Agrawal, “Survey on anomaly detection using data
mining techniques,” Procedia Computer Science, vol. 60, pp. 708–713,
2015.
[7] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Intrusion
detection techniques in cloud environment: A survey,” Journal of Network
and Computer Applications, vol. 77, pp. 18–47, 2017.
[8] T. Liu, Y. Guan, and Y. Lin, “Research on modulation recognition with
ensemble learning,” EURASIP Journal on Wireless Communications and
Networking, vol. 2017, no. 1, p. 179, 2017.
[9] Y. Lin, C. Wang, J. Wang, and Z. Dou, “A novel dynamic spectrum access
framework based on reinforcement learning for cognitive radio sensor
networks,” Sensors, vol. 16, no. 10, p. 1675, 2016.
[10] Z. Zhang, X. Guo, and Y. Lin, “Trust management method of d2d communication based on rf fingerprint identification,” IEEE Access, vol. 6,
pp. 66082–66087, 2018.
[11] H. Wang, L. Guo, Z. Dou, and Y. Lin, “A new method of cognitive signal
recognition based on hybrid information entropy and ds evidence theory,”
Mobile Networks and Applications, vol. 23, no. 4, pp. 677–685, 2018.
[12] R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection and big
heterogeneous data: a survey,” Journal of Big Data, vol. 2, no. 1, p. 3, 2015.
[13] Y. Lin, X. Zhu, Z. Zheng, Z. Dou, and R. Zhou, “The individual identification method of wireless device based on dimensionality reduction
and machine learning,” The Journal of Supercomputing, vol. 75, no. 6,
pp. 3010–3027, 2019.
[14] C. Shi, Z. Dou, Y. Lin, and W. Li, “Dynamic threshold-setting for rfpowered cognitive radio networks in non-gaussian noise,” Physical Communication, vol. 27, pp. 99–105, 2018.
[15] Y. Xiao, C. Xing, T. Zhang, and Z. Zhao, “An intrusion detection model
based on feature reduction and convolutional neural networks,” IEEE
Access, vol. 7, pp. 42210–42219, 2019.
[16] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly
detection techniques,” Journal of Network and Computer Applications,
vol. 60, pp. 19–31, 2016.
[17] Y. Tu, Y. Lin, J. Wang, and J.-U. Kim, “Semi-supervised learning with generative adversarial networks on digital signal modulation classification,”
Comput. Mater. Continua, vol. 55, no. 2, pp. 243–254, 2018.
[18] Q. Shi, J. Kang, R. Wang, H. Yi, Y. Lin, and J. Wang, “A framework of intrusion detection system based on bayesian network in iot.,” International
Journal of Performability Engineering, vol. 14, no. 10, 2018.
11
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
(a) Part of Attention Map for a Normal Case.
(b) Part of Attention Map for a Normal Case.
FIGURE 11: Attention Map for a case of Normal Traffic
[19] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, p. 436, 2015.
[20] D. Kwon, H. Kim, J. Kim, C. S. Sang, I. Kim, and K. J. Kim, “A survey of
deep learning-based network anomaly detection,” Clust. Comput., no. 5,
pp. 1–13, 2017.
[21] R. Wu, X. Chen, H. Han, H. Zhao, and Y. Lin, “Abnormal information
identification and elimination in cognitive networks,” International Journal
of Performability Engineering, vol. 14, no. 10, pp. 2271–2279, 2018.
[22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-based models for speech recognition,” in Advances in neural
information processing systems, pp. 577–585, 2015.
[23] S. T. Ikram and A. K. Cherukuri, “Improving accuracy of intrusion
detection model using pca and optimized svm,” Journal of computing and
information technology, vol. 24, no. 2, pp. 133–148, 2016.
[24] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach
to network intrusion detection,” IEEE Transactions on Emerging Topics in
Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2018.
[25] F. Farahnakian and J. Heikkonen, “A deep auto-encoder based approach
for intrusion detection system,” in 2018 20th International Conference
on Advanced Communication Technology (ICACT), pp. 178–183, IEEE,
2018.
[26] M. S. Islam, W. Khreich, and A. Hamou-Lhadj, “Anomaly detection
techniques based on kappa-pruned ensembles,” IEEE Transactions on
Reliability, vol. 67, no. 1, pp. 212–229, 2018.
[27] H. M. Anwer, M. Farouk, and A. Abdel-Hamid, “A framework for efficient
network anomaly intrusion detection with features selection,” in 2018
9th International Conference on Information and Communication Systems
(ICICS), pp. 157–162, IEEE, 2018.
[28] Y. Tian, M. Mirzabagheri, S. M. H. Bamakan, H. Wang, and Q. Qu, “Ramp
loss one-class support vector machine; a robust and effective approach to
anomaly detection problems,” Neurocomputing, vol. 310, pp. 223–235,
2018.
[29] A. F. A. Khan, A. Gumaei and A.Hussain, “A novel two-stage deep
learning model for efficient network intrusion detection,” IEEE Access,
vol. 7, pp. 30373–30385, 2019.
[30] Q. Tian, J. Li, and H. Liu, “A method for guaranteeing wireless communication based on a combination of deep and shallow learning,” IEEE
Access, vol. 7, pp. 38688–38695, 2019.
[31] M. Sheikhan, Z. Jadidi, and A. Farrokhi, “Intrusion detection using
reduced-size rnn based on feature grouping,” Neural Computing and
12
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
(a) Part of Attention Map for an Anomaly Case.
(b) Part of Attention Map for an Anomaly Case.
FIGURE 12: Attention Map for a case of Anomaly Traffic
Applications, vol. 21, no. 6, pp. 1185–1190, 2012.
[32] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory
recurrent neural network classifier for intrusion detection,” in Proc. Int.
Conf. Platform Technol. Service, pp. 1–5, 2016.
[33] C. L. Yin, Y. F. Zhu, J. L. Fei, and X. Z. He, “A deep learning approach for
intrusion detection using recurrent neural networks,” IEEE Access, vol. 5,
no. 99, pp. 21954–21961, 2017.
[34] C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system using
a deep neural network with gated recurrent units,” IEEE Access, vol. 6,
pp. 48697–48707, 2018.
[35] W. Anani and J. Samarabandu, “Comparison of recurrent neural network
algorithms for intrusion detection based on predicting packet sequences,”
in 2018 IEEE Canadian Conf. on Electrical & Computer Engineering
(CCECE), pp. 1–4, IEEE, 2018.
[36] A. F. M. Agarap, “A neural network architecture combining gated recurrent unit (gru) and support vector machine (svm) for intrusion detection
in network traffic data,” in Proc. of the 2018 10th Int. Conf. on Machine
Learning and Computing, pp. 26–30, ACM, 2018.
[37] B. Roy and H. Cheung, “A deep learning approach for intrusion detection
in internet of things using bi-directional long short-term memory recurrent
neural network,” in 2018 28th Int. Telecommun. Netw. and Appl. Conf.
(ITNAC), pp. 1–6, IEEE, 2018.
[38] A. H. Mirza and S. Cosan, “Computer network intrusion detection using
sequential lstm neural networks autoencoders,” in Proc. IEEE Sign. Process. Commun. Appl. Conf. (SIU), pp. 1–4, IEEE, 2018.
[39] H. Liu, B. Lang, M. Liu, and H. Yan, “Cnn and rnn based payload
classification methods for attack detection,” Knowledge-Based Systems,
vol. 163, pp. 332–341, 2019.
[40] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681,
1997.
[41] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural
Networks, vol. 61, pp. 85–117, 2015.
[42] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho,
“Deep recurrent neural network for intrusion detection in sdn-based networks,” in Proc. IEEE NetSoft, pp. 202–206, IEEE, 2018.
[43] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
[44] Y. Guo, J. Ji, X. Lu, H. Huo, T. Fang, and D. Li, “Global-local attention
network for aerial scene classification,” IEEE Access, 2019.
[45] N. Moustafa and J. Slay, “The evaluation of network anomaly detection
13
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983568, IEEE Access
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
systems: Statistical analysis of the unsw-nb15 data set and the comparison
with the kdd99 data set,” Information Security Journal: A Global Perspective, vol. 25, no. 1-3, pp. 18–31, 2016.
[46] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to
parallelizing stochastic gradient descent,” in Advances in neural information processing systems, pp. 693–701, 2011.
[47] A.-H. Muna, N. Moustafa, and E. Sitnikova, “Identification of malicious
activities in industrial internet of things based on deep learning models,”
Journal of Information Security and Applications, vol. 41, pp. 1–11, 2018.
[48] Y. Zhou, M. Han, L. Liu, J. S. He, and Y. Wang, “Deep learning approach
for cyberattack detection,” in IEEE INFOCOM 2018-IEEE Conference on
Computer Communications Workshops (INFOCOM WKSHPS), pp. 262–
267, IEEE, 2018.
JI WANG received the B.S. degree in Electronics and communication technology from LiaoNing University, China, in 1994 and got M.S. degrees in engineering from Guangdong University
of Technology in 2010. He is currently a Professor in Institute of Electronics and Information
Engineering, Guangdong Ocean University. He is
the director of Guangdong intelligent ocean sensor
network and its equipment engineering technology
research center, senior member of China electronics society, member of Guangdong electronic information education and
reference committee. His main studying area is s Wireless Sensor Network
and ocean Internet of things, information processing and communication
system.
CHANG LIU received the B.S. degree in computer science and Technology from Kharkiv National
University of Ukraine in 2008 and got M.S. degrees in computer science and Technology from
Kharkiv National University of Ukraine in 2009.
He got a doctoral position in Radio Technology
and Television Systems from the Kharkiv National
University of Radio Electronics in 2013, Kharkiv,
Ukraine. He has been a Teacher with the Heilongjiang Agricultural University of China since
2011, and became an Associated Professor in 2018. He is currently an Associated Professor in Institute of Electronics and Information Engineering,
Guangdong Ocean University. He is a member of IEEE. His main studying
area is signal processing, artificial intelligence.
YANG LIU received the B.S. degree in electronic
information engineering from College of electronic information, Northwestern Polytechnical University, Xian, China, in 2015. He got M.S. degrees
in China Aerospace Science and Technology Corporation, in 2018. He is currently working in Beijing Insititute of Astronautical Systems Engineering, and His research interest includes Command
and control, Information Security.
YU YAN received the B.S. degree from the College of Information and Communication Engineering, Harbin Engineering University, Harbin, China, in 2019. She holds a master position with the
College of Information and Communication Engineering, Harbin Engineering University, Harbin,
China. Her current research interests include network intrusion detection, machine learning, and
data analysis.
14
VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
Download