This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI An Intrusion Detection Model with Hierarchical Attention Mechanism CHANG LIU1 , (Member, IEEE), YANG LIU2 , YU YAN3 ,(STUDENT MEMBER, IEEE), AND JI WANG1 1 Guangdong Ocean University, Zhanjiang, Guangdong, 524088, China. (e-mail: byndgjc@163.com) Beijing Insititute of Astronautical Systems Engineering, Donggaodi, Fengtai District, Beijing, 100076, China. (e-mail: yangliu_npu@163.com) Harbin Engineering University, No.145 Nantong Street, Nangang District, Harbinm, Heilongjiang, 150001, China. (e-mail: yanyuyikey@hrbeu.edu.cn) 1 Guangdong Ocean University, Zhanjiang, Guangdong, 524088, China. (e-mail: zjouwangji@163.com) 2 3 Corresponding author: Ji Wang (e-mail: zjouwangji@163.com). The work was supported by program for scientific research start-up funds of Guangdong Ocean University. ABSTRACT Network security has always been a hot topic as security and reliability are vital to software and hardware. Network intrusion detection system (NIDS) is an effective solution to the identification of attacks in computer and communication systems. A necessary condition for high-quality intrusion detection is the gathering of useful and precise intrusion information. Machine learning, particularly deep learning, has achieved a lot of success in various fields of industry and academic due to its good ability of feature representation and extraction. In this paper, deep learning methods are integrated into the NIDS. The intrusion activity is regarded as a time-series event and a bidirectional gated recurrent unit (GRU) based network intrusion detection model with hierarchical attention mechanism is presented. The influence of different lengths of previous traffic on the performance is then studied. Some experiments are performed on the dataset UNSW-NB15, in which the proposed hierarchical attention model achieves satisfactory detection accuracy of more than 98.76% and a false alarm rate (FAR) of lower than 1.2%. An attention probability map to reflect the importance of features is then visualized using the attention mechanism. The visualization ability assists in providing an understanding of the varied importance of the same features for different traffic classes and to determine feature selection in the future. INDEX TERMS Intrusion Detection System, Recurrent Neural Network, Attention Mechanism, Visualization. I. INTRODUCTION AST amounts of data are generated, processed, and exchanged in the use and interaction process of numerous devices. Such data has become the target of illegal activity, which has caused significant damage to network systems [1], [2]. Research into advanced security methods has become increasingly important in both industry and academia in order to consistently improve and update security threat detection [3]. The basic general components of network security mechanisms include firewall, user authentication technology, anti-virus software, and an intrusion detection system (IDS) [4], [5]. As a proactive security technology, IDS monitors a host or network and alerts when an attack is detected. Cybersecurity can be further guaranteed through intrusion detection methods in which network attack behavior can be obtained and learned by data analysis and modeling. According to the location of the deployment and the scope of monitoring, IDS products can be loosely divided into V network intrusion detection system (NIDS) and host-based intrusion detection system (HIDS) [6]. The NIDS works at the network layer to detect network threats by taking all traffic from the target network as its data source to protect the entire network segment [7]–[9]. The HIDS serves as a monitor and analyzer of a computer system that does not act on the external interface, but focuses on the internal system [10], [11]. This framework commonly analyzes system logs, processes, or files to monitor the dynamic behavior of all or part of the system and the state of the entire computer system. In the network system, many devices or components require IDS support such as web server, file server, and workstations [12]. A scenario illustrating how IDS works at different sites in the network system is provided in Fig. 1. The development of network technology and hardware devices creates issues for the application and upgrade of IDS [13]–[15]. Current challenges include the following: 1) Diversity: An increase in the type of network protocols 1 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS Database Web applier Database HIDS NIDS HIDS HIDS Firewall Application Server File Server Web Server Router Workstations User Firewall HIDS Firewall Workstations Group Internet NIDS FTP Server NIDS FIGURE 1: Intrusion Detection System Works at different Place of the Network System. makes it increasingly hard to distinguish between normal and abnormal data. 2) Low-frequency attacks: The imbalance distribution of different attack types results in weak detection precision of the IDSs, particularly for data-driven methods. 3) Adaptability: The diverse and flexible characteristic of the network causes a significant reduction in the lifespan of detection models because IDS requires updating to adapt to the evolving environment. 4) Placement: Distributed, centralized, and hybrid methods must be adopted according to specific considerations of financial, computational, and time costs. 5) Accuracy: Existing traditional techniques cannot achieve the required high-level accuracy due to the aforementioned challenges. To ensure the performance of IDS, a deeper, more granular and increasingly comprehensive understanding of the nature of the intrusion events is required. Around the issue of intrusion detection, some scholars have done a lot of work, using methods including expert knowledge, data mining, and machine learning [16]–[18]. Among them, the deep learning method is unique, providing a high level of detection performance. Deep neural network mimics human nerves and uses a large number of non-linear processing units to deal with complex problems [19]–[21]. It can automatically learn features and extract core data information. Due to the improvement of hardware and optimization of the algorithm, recurrent neural network (RNN) has received widespread acclaim. RNN has become a star model in applications including natural language processing (NLP), semantic understanding, and speech recognition [13], [22]. Learning to identify whether the network traffic is normal or anomaly can be understood as learning to perform sentiment analysis or document classification given several sentences. From this perspective, network intrusion detection is partly similar to sentiment analysis tasks, for which RNN-based methods have been suitable. In this study, network traffic activity is treated as a timeseries event, meaning that the assessment of traffic type at the current time depends not only on the current data but also on data at the previous moments. To provide the ability to process such data, an RNN-based method is used as a benchmark approach for intrusion detection. In reality, the traffic information at different moments or features in a sample of traffic contributes differently to the judgment of the current traffic type. To take full advantage of this characteristic, attention mechanism is adopted to enhance the model, with two kinds of attention mechanism respectively applied to the feature and traffic slice. The attention mechanism provides the ability of visualization to discern which feature or traffic slice is important. The proposed attention-based models are then evaluated on the benchmark dataset UNSWNB15, which has been used frequently in various recent 2 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS studies. The experiments prove that the proposed attentionbased model demonstrates superior performance compared with other models. The main contributions of this paper include: 1) Different feature or traffic slices contribute uniquely to the classification of the current traffic. To account for sensitivity, the proposed model includes two levels of attention mechanisms, feature-based and slice-based. The attention mechanism guides the model to provide increased attention to some individual features or traffic slices when constructing the representation of traffic information. Based on the above, an attention map is visualized, contributing to an understanding of the importance of features or slices of traffic. 2) Three RNN-based detection models are individually compared with the different attention mechanisms of no-attention, one-layer attention, and hierarchical attention. It is observed that the attention mechanism contributes to improved model performance. The influence of timestep on the performance of IDS is also studied and the concept of cost-performance is applied to determine if the value of timestep must be increased. 3) The entire UNSW-NB15 dataset is utilized in this study, rather than partial data. The results show that when the timestep equals 10, the hierarchical attention model achieves the highest detection accuracy of over 98.76% and the false alarm rate (FAR) is as low as 1.49%. The rest of this paper is organized as follows. Section II details existing NIDS works, mainly using RNN as the base model. In Section III, a number of basic methods of RNN and attention mechanism are introduced. Section IV details the proposed work, and Section V presents the results and analysis of experiments. Finally, Section VI describes the conclusion of this paper and the direction of future work. II. RELATED WORKS The three predominant types of NIDS are misuse-based, anomaly-based, and hybrid. Among them, the misuse-based method works by constructing a pattern matching template to detect intrusion. The constructed template is built on artificial knowledge and the analysis of existing data. The template is fixed, so the benefit of this method exists in detecting known attack types with high accuracy [23]. However, this feature also leads to an inherent disadvantage of this method as in a dynamic network environment, new attack types or variations may appear at any time. It is thus difficult for the misuse-based approach to perform adequately in the static background [24], [25]. Another kind of intrusion detection method is the anomaly-based approach, which operates by only utilizing normal data so that samples with different behaviors may all be judged as anomaly [26]. When an attack occurs in a real device where the misuse-based method is deployed, the NIDS will alert the alarm, but provide no information about the exact attack type. However, the disadvantage of this method is poor accuracy performance as some attacks behave like normal data or it is difficult to separate the attack data and the normal data in the extracted features. Several machine learning based approaches proposed in previous studies have achieved success in intrusion detection systems.In [27], Hebatallah et al. presented a framework for feature selection considering irrelevant and redundant features. In their model, five different kinds of feature selection strategies are used and the J48 decision tree classifier with gain ratio filter is determined to have the best performance. In [28], Tian et al. proposed a robust and sparse method using one-class support vector machine (OSVM), which aimed to locate samples that are different from the majority of data. However, the anomaly method is limited by the outliers and noise during the training phase. To improve the performance of this model, the Ramp loss function is adopted resulting in the algorithm more robust and sparse. Deep learning has become an important branch of machine learning and has become the preferred solution to many problems. This method has been applied in intrusion detection filed, achieving remarkable results. In [29], Khan et al. presented a two-stage intrusion detection model based on the stacked autoencoder network. In the initial phase, the traffic is judged as normal or abnormal by the value of classification probability. In the second stage, the result of the first stage is regarded as an extra feature for the following multi-class classification process. However, the detection accuracy could only reach 89.134% on the UNSW-NB15 dataset. In [30], Tian et al. presented a hybrid method of shallow and deep learning using a stacked autoencoder to reduce the dimension of features. The SVM is then combined with the artificial bee colony algorithm for classification. The experiments were also conducted with accuracy reaching beyond 89.62%. Many scholars have explored works using RNN to solve network intrusion detection problems. In 2012, Sheikhan et al. presented a three-layer RNN model to solve the misusebased intrusion detection problems [31]. The input features in their experiment are divided into four categories according to the feature attribute. However, the RNN model is reduced in this method, meaning that the connection between the neural layers is partial and diminishes the performance. In 2016, Kim et al. explored the possibility of applying RNN to intrusion detection using a variant of RNN to build an intrusion detection model [32]. Instances from the KDD Cup99 dataset were extracted in their experiment which focused on locating the super parameters and evaluating model performance. In 2017, Yin et al. used standard RNN to build an IDS, and evaluated their approach with benchmark dataset NSL-KDD [33]. In their work, the number of hidden nodes, the number of layers, and the learning rate have become the main variables. Unfortunately, the accuracy of their proposed model is not adequate. In 2018, Xu et al. constructed a new DNN model that applied gated recurrent unit (GRU) and multilayer perceptron (MLP) to extract data information [34]. Their simulation results show that the GRU cell can be more effective than the long short-term memory (LSTM) cell for intrusion detection problem. In [35], Anani et al. 3 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS used the full KDD Cup99 dataset to compare the model detection performance based on LSTM, bidirectional long short-term memory (BiLSTM), skip-LSTM, and GRU. The results illustrate that the GRU achieves superior performance compared to other models. In [36], Agarap et al. sought to enhance the ability of classification by building a GRU model and introducing linear SVM to replace the softmax classifier. Similarly, L2-SVM loss function was adopted to replace the cross-entropy function. In [37], Roy et al. selected samples from the UNSW-NB15 dataset and build a BiLSTM network. Five features were selected, reaching an accuracy of over 95%. However, only part of the dataset is utilized in this approach, which may cause some bias in the results. In [38], the authors used the unsupervised version of different variants of RNN cells to construct an autoencoder for intrusion detection. In [39], an end-to-end intrusion detection approach was proposed. Network packets were adopted as the input and processed sequentially. There exists nothing about feature engineering or domain knowledge in this method and instead, the payloads are divided into characters and train the RNN model to identify specific sequences. However, the drawback of this end-to-end approach is that there are too many parameters which make the model overly complex. III. BASIC THEORY A. GRU-BASED METHOD The RNN is unique as the neural unit is self connection, meaning that when the cycle unfolds, the data flow over time is preserved in the neurons [40]. The cyclic structure of neurons enables them to preserve historical information and provide sequence modeling capabilities. The RNN calculates a mapping from the input x = (x1 , x2 , . . . , xT ) to the hidden state h = (h1 , h2 , . . . , hT ) as follows: yt + ht -1 1- rt zt s Whr ,Wxr , br ~ ht s Whz ,Wxz , bz tanh Whh ,Wxh , bh xt FIGURE 2: Structure of Gated Recurrent Unit. The reset gate rt works in the process to derive the candidate state. The way to obtain the candidate state is similar to that in traditional RNN except for the gate mechanism. h̃t = tanh(xt Wxh + Whh (rt ht−1 ) + bh ) (1) where σ is a non-linear function and t ∈ [1, T ]. Wxh and Whh are corresponding weight matrices, b is a bias term. As we all know, the gradient descent method is often used to train the deep learning model. And Back Propagation(BP) is a way to obtain the gradient. In particular, back propagation training time (BPTT) is an algorithm that specifically solves the computation of parameters in RNN models. However, as it is limited by structure, the gradient in RNN can easily explode or vanish due to the product of W [41]. As traditional RNN is limited by gradient vanishing or exploding, variants of RNN are proposed. Gated recurrent unit(GRU) was proposed to address such issues by introducing the gating mechanism [42]. There are two kinds of gates: the reset gate rt and the update gate zt . They work together to decide the information update process. Suppose the current input is xt and the new state ht in time t contains two part: the candidate state h̃t and the past state ht−1 . ht = (1 − zt )ht−1 + zt h̃t (2) (3) where stands for the Hadamard Product, W is weight matrix, and b is the bias. Here, rt helps to control how much information from the past state can be added into the candidate state. rt is updated as follow: rt = σ(xt Wxr + ht−1 Whr + br ) (4) According to the equation to obtain new state ht , update gate zt plays a role to balance the previous state ht−1 and the current candidate state h̃t . zt can then be regarded as a valve for distributing the past information and the new information. The update of zt is similar to that of rt : zt = σ(xt Wxz + ht−1 Whz + bz ) ht = σ(Wxh xt + Whh ht−1 + bh ) ht (5) Former experiments have shown that the BiGRU cell performs better than other three cells including LSTM, GRU, and BiLSTM. A Bidirectional GRU (BiGRU) is an enhanced version of the GRU that works in two directions. → − It summarizes the forward information h and the backward ← − information h to enhance feature extraction abilities. → − −−−→ h t = GRU (xt ), t ∈ [1, T ] (6) ← − ←−−− h t = GRU (xt ), t ∈ [1, T ] B. ATTENTION MECHANISM The generation of attention mechanism is inspired by the behavior of humans. Human attention happens, to some extent, when humans predominantly focus on particular local regions of an image or special words in one sentence. The attention mechanism assists to fully utilize limited resources. The regular process of attention mechanism is illustrated in Fig. 3. The attention value can be obtained by the pair of key and query. The attention mechanism is not a specific method, but a mode of thinking which contains the two important components of addressing and calculating. 4 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS Key1 Key2 Key3 Key4 Query (Q) Attention Value Query D Weight Matrix FIGURE 3: The Regular Pipeline of Attention Mechanism. a1 a2 a3 FIGURE 5: The Illustration of Dot-product Attention. Finally, the attention vector can be obtained by: a = dV After deriving the probability vector, the final attention representation, that is context vector, can be calculated. Depending on the range of hidden states used, the attention mechanism can be divided into global attention and local attention [44]. In this paper, we used the global attention shown in Fig.6. The global attention model absorbs all the hidden states when deriving the context vector ct . Due to this calculating way, ct can capture relevant source-side features. Attention Layer ct x2 (10) aT x1 a Attention Weight (d) Keys (K) Using an attention model, an input can be written in X = [x1 , x2 , ..., xn ], where n can be treated as different timestep for a 3-D data or the number of features for a 1D vector. Addressing is also called alignment score function, and is used to obtain the attention probability. The attention probability represents how much weight should be given to the hidden state of each input. There are numerous addressing methods available. In this paper, a location-based attention and dot-product attention method is utilized. matmul Source softmax Value (V) Value4 matmul Value3 tanh Value2 matmul Value1 x3 Global align weights xT at FIGURE 4: The Illustration of Location-based Attention. h1 hT Ă Location-based attention was initially proposed in [43]. It computes the alignment from the generator state and the previous alignment only in such a simple way: x1 αt = sof tmax(Wa ht ) where Wa is the weight matrix and ht is the current hidden state. Dot-product attention consists of three parts: a learned key matrix K, a value matrix V , and a query vector q. The process to obtain the attention vector is illustrated in Fig. 5. First, the key matrix is obtained: K = tanh(V W a ) x2 x3 x4 x5 Ă xT (7) (8) where W a is a randomly initialized weight matrix. After determining the current key matrix, the similarity between each query value and the current key value is calculated to obtain a normalized probability vector d, which is the weight vector. d = sof tmax(qK T ) (9) FIGURE 6: Global Attention Model. IV. PROPOSED MODEL The proposed model is introduced in this section. The hierarchical attention mechanism of feature-based attention and slice-based attention is applied respectively into the IDS. The overall architecture of the hierarchical attention intrusion detection model is shown in Fig. 7. This model consists of three main steps. To begin with, data preprocessing is required. The main operations at this stage include the missing value process, feature transformation, and feature normalization. Feature-based attention is then utilized for enhancing the expression ability of the traffic features, then the slice-based attention is applied to several pieces of traffic data. 5 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS y pred Dense cT a0 u0 h0' aT u1 a1 h1' Slice-based Attention hT' ... h0 uT h1 s0 hT s1 h10 a10 sT h11 ... a11 x10 a1N -1 x11 ... h1N -1 Feature-based Attention x1N -1 Preprocessed Data D A T A Missing value process Transfo rmation Normali zation Merge History Data FIGURE 7: Proposed Model for Intrusion Detection. A. FEATURE-BASED ATTENTION Not all features have the same importance in the representation of single traffic information. Thus, to fully release the energy of some features and capture the features that are truly significant to the representation of traffic, the feature-based attention mechanism is adopted to determine which feature should be the focus. Besides, the location-based attention mechanism has no additional objects of interest, and is only relevant to each input in the data source itself. So it is very suitable to deal with the input features. −1 Given a sample with N dimensions, Xi = [x0i , x1i , ..., xN ], i the sof tmax function is adopted to get the probability vector, that is the weight for each feature. The normalized weight of j-th feature in time i can be computed by: j exi αij = sof tmax(xji ) = PN −1 k=0 k exi (11) The value of αij shows the importance of feature j. Based on the above definition, the final output hji with locationattention can be derived as: hji = xji × αij (12) So in this part, a fully-connected layer with sof tmax activation is then adopted to determine the weight vector α. Input Xi is then multiplied by α to derive the output hi . 6 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS B. SLICE-BASED ATTENTION We believe the traffic data is time-related. Traffic information for multiple adjacent moments helps significantly to judge the type of current traffic. Thus, several pieces of traffic information are grouped together, which is called slice traffic. The dot-product attention is adopted due to the optimized matrix multiplication operation in the program that can reduce the resource consumption during calculation. For each timestep, the corresponding hidden state hi is fed through a single-layer perception to obtain ui as a hidden representation of hi . ui = tanh(Ww hi + bw ) (13) The importance of each piece of traffic at different moment i is then evaluated using the similarity of ui with uw . A normalized importance vector α, also called attention weight, can be computed through a sof tmax function. exp(uTi us ) αi = P T i exp(ui us ) (14) The output of slice-based attention is then computed as a weighted sum. The context vector v can be regarded as a high level representation of the slice traffic. X v= αi hi (15) i A summary of the algorithmic phases of the proposed hierarchical attention intrusion detection model in Algorithm 1 is provided below. V. EXPERIMENT A. DATASET A modern dataset that can represent actual situations in the real network is required to build and evaluate the performance of NIDS. The KDDCUPâĂŹ99, NSL-KDD, and UNSW-NB15 datasets are compared in this paper, considering multiple factors such as dataset size, number of types, and data distribution. TABLE 1: Comparison of several training dataset for intrusion detection Name Year Normal Samples Anomaly Samples Proportion (anomaly/normal) Category Incomplete Samples KDDCUP’99 1999 972,781 2,952,869 NSL-KDD 2009 67,343 58,631 UNSW-NB15 2015 56,000 119,341 3.03:1 0.87:1 2.13:1 5 Many 5 None 10 Few Referring to Table I and Table II, it can be determined that UNSW-NB15 is an ideal candidate dataset for intrusion detection. UNSW-NB15 was created by Moustafa et al. to overcome the shortcomings of KDDCUPâĂŹ99, and has gradually become one of the benchmark datasets in the filed of IDS [45]. UNSW-NB15 includes rich traffic types so that Algorithm 1 Algorithm for the hierarchical attention intrusion detection model Input: The training dataset X is input with n pieces of samples. Each sample is x(i) , where i ∈ (1, ..., n). The weight matrix of GRU cells W are initialized along with attention layer matrix Wa , learning rate l, number of timesteps Nt , epochs K; Output: The classification category y is output with the feature-based attention probability α1 and the slicebased attention probability α2 ; 1: Data Preprocessing: Missing value filling is conducted by transforming the nominal features into numerical data and then normalizing the numerical data into the range of 0 and 1;; 2: The current data xt is merged with the history data xt , where the length of history data is determined by Nt ; 3: for k = 1 : K do 4: The feature-based attention probability is obtained: α1t = sof tmax(xt ); 5: st1 = α1t xt ; 6: The BiGRU cells are fed with st1 and the output 0 [ht , ht ] is obtained; 7: ut = tanh(Ww ht + bw ); 8: α2 =Psof tmax(ut ); 9: v = i αi hi ; 10: The BPTT algorithm with learning rate l is used to train the model ; 11: The output ot of the model is obtained; 12: if ot > 0.5 then 13: yt = 1 14: else 15: yt = 0 16: end if 17: end for 18: return yt , α1 , α2 TABLE 2: Comparison of several testing dataset for intrusion detection Name Year Normal Samples Anomaly Samples Proportion (anomaly/normal) Category Incomplete Samples KDDCUP’99 1999 60,591 250,436 NSL-KDD 2009 9,711 12,834 UNSW-NB15 2015 37,000 45332 4.13:1 1.32:1 1.22:1 5 Many 5 None 10 Few it can more accurately reflect the characteristics of modern network traffic data. Ten types of traffic data exist which are Normal, Dos, Fuzzers, Analysis, Exploits, Reconnaissance, Worm, Backdoors, Generic, and Shellcode. Besides, the distribution of normal and anomaly data is balanced both in training and testing datasets. Each sample in UNSW-NB15 contains 49 features, which can be divided into the five sections of flow features. The 7 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS detailed descriptions of every feature are listed in Table III. The official website provides a pair of training and testing datasets and there exist 82,332 records in the testing set, where the normal accounts for 45% and anomaly is 55%. The training set is comprised of 175,341 samples, where the ratio of normal records to the abnormal is 32% to 68%. In this research, the entire UNSW-NB15 dataset is adopted for model evaluation and analysis. To meet the input requirements of deep learning, data preprocessing is needed which mainly includes feature transformation, and feature normalization. To meet the requirements of the input format in neural network, data preprocessing is required and mainly includes feature transformation and normalization. Feature transformation is used to transform the symbolic features into numerical data such as service, state, and proto. This step is necessary because neural network calculations only allow numerical operations. Several feature transformation techniques exist, among which one-hot encoding is frequently adopted, especially in the case of attributes that are not serializable and cannot be compared in value. After encoding, the dimension of the samples is changed from 42 to 196. Feature normalization is highly useful in deep learning methods and is utilized in most neural network calculation works. This is related to the activation feature of neurons and updating of the weight [46]. In several partitions, the response of neurons is stronger than other parts which will accelerate the speed of training. In this paper, the min-max technique is adopted as follows: x∗ = x − min max − min (16) B. EVALUATION To evaluate the performance of a classifier, the confusion matrix is defined in Table IV. True Negative (TN) means the total number of normal examples correctly classified. False Negative (FN), contrary to TN, represents the amount of normal data wrongly judged. True Positive (TP) stands for the number of attack correctly classified. False Positive (FP) is the amount of attack samples that are wrongly divided into normal parts. TABLE 4: Confusion Matrix for Binary Classification Actual Class Predicted Class Attack Normal Attack Normal TP FN FP TN Based on the above definition, other advanced matrices can be obtained. Accuracy is a good measure when the classes are balanced. TP + TN (17) TP + FP + FN + TN The F AR is a traditional metric and reflects the situation in which records are misclassified. The definition is as follow: Accuracy = 1 FP FN ( + ) (18) 2 FP + TP FN + TN The P recision is the ratio of records correctly classified as attacks to the number of attacks and Recall is the fraction of correctly classified attacks to all records that are detected to be anomaly. F AR = P recision = Recall = TP TP + FN TP TP + FP (19) (20) C. MODEL CONFIGURATION AND TRAINING In this paper, Keras with T ensorf low is used as the backend to build the model. To meet the requirement of the input dimension for BiGRU, the dataset is reorganized into a 3D shape. All 196 features are then arranged in a single piece of data into a vector. Some samples at the end of the dataset are dropped in order to make the total number an integer multiple of the batch size. Thus, the final training dataset has a shape of (175340, timestep, 196) and the shape of the testing dataset is (825340, timestep, 196), where timestep is a hyper-parameter representing the length of historical events. To create such data, the T imeseriesGenerator in Keras is adopted. In the proposed model, to build the feature-based attention mechanism, a Dense layer with sof tmax activation is connected to the input layer. The number of hidden units in this dense layer is equal to that of the input layer. Two BiGRU layers with 32 and 12 units respectively are then stacked together for processing time-series data. Each timestep creates an output, and the dot-product attention is applied to all the steps. Finally, Dense layers are connected to the output of the attention layer and the output layer has only one unit. In the training phase, a batch of 1024 is used and the Adam optimizer is adopted. The parameters of Adam are set to be lr = 0.1, beta_1 = 0.9, beta_2 = 0.999. The binary_crossentropy is adopted as the loss function. D. RESULT AND ANALYSIS The three different structures of no attention, single attention, and hierarchical attention were individually explored in this research (other components were kept the same except the attention module). The influence of timestep on the convergence performance was first explored. Corresponding experiments were then conducted on the hierarchical attention model and several timesteps were randomly selected. The convergence curves plotted in Fig. 8 illustrate how the loss function changes with iterations during the training phase.It can be observed that even though timestep is different, the model will finally converge. Additionally, the larger the timestep, the lower the loss value, meaning that model performance is improved when the timestep is larger. Considering the speed of convergence, it appears that the value of timestep has no effect on this condition. 8 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS TABLE 3: UNSW-NB15 Dataset name dur proto service state spkts dpkts sbytes dbytes rate sttl dttl sload dload sloss dloss sintpkt dintpkt sjit djit swin stcpb dtcpb dwin tcprtt synack ackdat smeansz dmeansz trans_depth response_body_len ct_srv_src ct_state_ttl ct_dst_ltm ct_src_dport_ltm ct_dst_sport_ltm ct_dst_src_ltm is_ftp_login ct_ftp_cmd ct_flw_http_mthd ct_src_ltm ct_srv_dst is_sm_ips_ports attack_cat label Description Record total duration Transaction protocol service such as http, ftp, smtp, ssh, dns, and (-) (if not used state) Indicates to the state and its dependent protocol, e.g. ACC, CLO, CON, ECO, ECR,and (-) (if not used state) Source to destination packet count Destination to source packet count Source to destination transaction bytes Destination to source transaction bytes Transaction bytes per second Source to destination time to live value Destination to source time to live value Source bits per second Destination bits per second Source packets retransmitted or dropped estination packets retransmitted or dropped Source interpacket arrival time (mSec) Destination interpacket arrival time (mSec) Source jitter (mSec) Destination jitter (mSec) Source TCP window advertisement value Source TCP base sequence number Destination TCP base sequence number Destination TCP window advertisement value TCP connection setup round-trip time, the sum of synack and ackdat. TCP connection setup time, the time between the SY N and the SY N _ACK packets. TCP connection setup time, the time between the SY N _ACK and the ACK packets. Mean of the flow packet size transmitted by the src. Mean of the flow packet size transmitted by the dst. Represents the pipelined depth into the connection of http request or response transaction. Actual uncompressed content size of the data transferred from the serverâĂŹs http service. No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26). No. for each state (6) according to specific range of values for sourcedestination time to live (10) (11). No. of connections of the same destination address (3) in 100 connections according to the last time (26). No. of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26). No. of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26). No. of connections of the same source (1) and the destination (3) address in in 100 connections according to the last time (26). If the ftp session is accessed by user and password then 1 else 0. No. of flows that has a command in ftp session. No. of flows that has methods such as Get and Post in http service. No. of connections of the same source address (1) in 100 connections according to the last time (26). No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26). "If source (1) and destination (3)IP addresses equal and port numbers (2)(4) equal then, this variable takes value 1 else 0". "The name of each attack category. 0 for normal and 1 for attack records. The influence of timestep on accuracy and false alarm rate was also studied, and the results are provided in Fig.9. As illustrated in Fig.9(a), as the value of timestep increases, the accuracy also increases gradually on the testing dataset. Impressively, the promotion of performance due to timestep growth is clearly evident (detection accuracy can reach 91.69% and 98.88% when timestep = 1 and timestep = 11 respectively). To better characterize the impact of timestep, a gain ratio is introduced that is the value of accuracy improvement for each timestep. Starting from timestep = 2, a vertical line serves as the indicator of the increase of accuracy: the longer the line, the greater the increase of accuracy. Generally, the development of the model experiences a fast-ascension period and a slow-convergence period which is evident in the proposed model. Although the length of the black line reach- es its max value when timestep equals 2, the actual detection rate can improve further and is therefore still considered to be in the fast-ascension period. When timestep = 11, the accuracy improves to a relatively high level due to its smallest gain ratio and there is little room for further improvement. At the same time, the black line looks as if it will disappear. In summary, 10 is determined as an optimized candidate value for the parameter of timestep. Fig.9(b) shows a comparison of false alarm rate under different timestep values, where FAR is retained at a relatively low level when timestep = 10, again indicating the optimal value of timestep being 10. It can also be concluded that the hierarchical attentionbased model performs the best, and the single-level attentionbased model has a better performance than the model with no attention mechanism. When the value of timestep is small, for example, timestep equals 2 or 3, attention mechanism 9 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (a) Comparison of Accuracy using different Structures with different timestep on Testing Dataset. FIGURE 8: Convergence Curve of the Hierarchical Attention Model During the Training Phase with different timestep FIGURE 10: The Detection Process on Testing Dataset using BiGRU with 10 timestep. erarchical attention model with timestep = 10 value of 10 is shown in Fig. 10. The blue bold dots represent the real label of testing samples, orange middle dots denote the correct predictions, and the red small dots are incorrectly classified samples. In the beginning, despite the fluctuation in the classified data (orange dots), a satisfying performance can still be achieved. However, with increased time and samples, the performance of the model is reduced, roughly beginning at sample 40,000th. This may be attributed to the emergence of new types of attacks as time goes on. For some features, certain values exist in the testing dataset that are not available in the training dataset.Another reason may be the fluctuation of N ormal data, especially in features like proto and state. Online learning could provide a solution to these problems. In future research, the inclusion of online operations to this model will be considered. The proposed method was also compared with other works using the UNSW-NB15 dataset, as shown in Table V. The comparison results further illustrate the effectiveness and improvement of the proposed hierarchical attention model. TABLE 5: Comparison Between Our Proposed Model with Other Machine Learning Algorithms (b) Comparison of FAR using different Structures with different timestep on Testing Dataset. FIGURE 9: The Experiment Results on Training and Testing Dataset. has a significant impact on the performance improvement of the model. As the timestep increases, the effect of the attention mechanism is gradually reduced. This is likely because the features extracted by the BiGRU with a high timestep value are sufficient to characterize the data and is illustrated by the performance of the model without attention mechanism. The detecting phase on the testing dataset using the hi- Method Decision Tree(J48) [27] Ramp OCSVM [28] Autoencoder [29] Autoencoder & SVM [30] DAE-DFFNN [47] DFEL-GBT [48] BiLSTM [37] Our Single Attention model Our Hierarchical Attention model Accuracy 88.3% 97.24% 89.71% 89.62% 92.5% 91.22% 95.71% 98.64% 98.76% Precision 77.78% 91.33% 89.74% 77.93% 98.2% 90.38% 100% 98.54% 99.35% Recall 94.59% 98.5% 89.85% 94.18 99% 90.69% 96.00% 98.43% 98.94% E. VISUALIZATION OF ATTENTION To validate that attention effectively helps to select informative features or pieces of traffic, two pieces of traffic were randomly selected, representing normal and anomaly. The 10 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS attention probability was then visualized separately, using both slice-based attention and feature-based attention. To classify the current traffic, several previous traffic data points were also considered. Attention maps for a case of normal traffic and anomaly events are respectively illustrated in Fig. 11 and 12. The xaxis is the value of timestep and the y-axis is the feature number or feature name. There are two different ways to illustrate attention probability. A bar chart is used to represent the slice-based attention, and the color block is adopted for the illustration of feature-based attention. The darker the color, the greater the probability. In Fig. 11(a) and Fig. 11(b), the lower section in the subgraph displays the slice-based attention probability for 10 timesteps. The value of slice-based attention probability at varying timestep is close which is reasonable because the 10 traffic data points all belong to normal classes which are similar to each other. As can be seen from the dark areas of the attention map, the features extracted are similar at varying timestep. This occurrence is also reasonable as similar features have a similar effect on the final classification. Figure 11(a) also illustrates that the dload may be the most important feature for this kind of normal traffic. An example of the attention map for anomaly traffic is provided in Fig. 12. The probability distribution is completely different from that in the normal case in Fig. 11. First, the features with strong responses at each timestep are different from each other, that is, they are all different data types. Further evidence that the data is of different types is provided by the assigning of attention probability. Therefore, when classifying the current data, data at other timestep contributes nothing, which is reasonable. Additionally, features including sttl, dttl, dload, and cts rvd st, play an important role in judging this kind of attack data. The attention mechanism strengthens the existence of this feature resulting in the improvement of model performance. The hierarchical attention mechanism proposed in this work not only enhances the detection ability, but also helps to determine which feature plays a substantial role in the detection process. Feature selection can thus be conducted based on attention probability, which will be the focus of future work. VI. CONCLUSION This paper presented an intrusion detection model with hierarchical attention mechanism. Several traffic data are merged in order and the influence of the different number of previous traffic on performance was also investigated. The proposed model was demonstrated to achieve satisfactory performance on the UNSW-NB15 dataset, with accuracy of more than 98.76% and FAR lower than 1.2%. With the assistance of attention mechanism, an attention map was presented. The visualization may provide assistance for feature selection and contributes to the understanding of the differences between varied traffic classes in the future. Future developments will be focused on the evolution of attention mechanism and attempts of parallel computing. Besides, work will also be conducted on classifying specific types of attacks using the attention mechanism. VII. ACKNOWLEDGMENT The work was supported by program for scientific research start-up funds of Guangdong Ocean University. Meantime, all the authors declare that there is no conflict of interests regarding the publication of this article. We gratefully thank of very useful discussions of reviewers. REFERENCES [1] C. Alcaraz, R. Roman, P. Najera, and J. Lopez, “Security of industrial sensor network-based remote substations in the context of the internet of things,” Ad Hoc Networks, vol. 11, no. 3, pp. 1091–1104, 2013. [2] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, “Deep learning for super-resolution channel estimation and doa estimation based massive mimo system,” IEEE Transactions on Vehicular Technology, vol. 67, no. 9, pp. 8549–8560, 2018. [3] F. Salo, A. B. Nassif, and A. Essex, “Dimensionality reduction with igpca and ensemble classifier for network intrusion detection,” Computer Networks, vol. 148, pp. 164–175, 2019. [4] A.-R. Sadeghi, C. Wachsmann, and M. Waidner, “Security and privacy challenges in industrial internet of things,” in 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, IEEE, 2015. [5] Y. Lin, M. Wang, X. Zhou, G. Ding, and S. Mao, “Dynamic spectrum interaction of uav flight formation communication with priority: A deep reinforcement learning approach,” IEEE Transactions on Cognitive Communications and Networking, 2020. [6] S. Agrawal and J. Agrawal, “Survey on anomaly detection using data mining techniques,” Procedia Computer Science, vol. 60, pp. 708–713, 2015. [7] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Intrusion detection techniques in cloud environment: A survey,” Journal of Network and Computer Applications, vol. 77, pp. 18–47, 2017. [8] T. Liu, Y. Guan, and Y. Lin, “Research on modulation recognition with ensemble learning,” EURASIP Journal on Wireless Communications and Networking, vol. 2017, no. 1, p. 179, 2017. [9] Y. Lin, C. Wang, J. Wang, and Z. Dou, “A novel dynamic spectrum access framework based on reinforcement learning for cognitive radio sensor networks,” Sensors, vol. 16, no. 10, p. 1675, 2016. [10] Z. Zhang, X. Guo, and Y. Lin, “Trust management method of d2d communication based on rf fingerprint identification,” IEEE Access, vol. 6, pp. 66082–66087, 2018. [11] H. Wang, L. Guo, Z. Dou, and Y. Lin, “A new method of cognitive signal recognition based on hybrid information entropy and ds evidence theory,” Mobile Networks and Applications, vol. 23, no. 4, pp. 677–685, 2018. [12] R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection and big heterogeneous data: a survey,” Journal of Big Data, vol. 2, no. 1, p. 3, 2015. [13] Y. Lin, X. Zhu, Z. Zheng, Z. Dou, and R. Zhou, “The individual identification method of wireless device based on dimensionality reduction and machine learning,” The Journal of Supercomputing, vol. 75, no. 6, pp. 3010–3027, 2019. [14] C. Shi, Z. Dou, Y. Lin, and W. Li, “Dynamic threshold-setting for rfpowered cognitive radio networks in non-gaussian noise,” Physical Communication, vol. 27, pp. 99–105, 2018. [15] Y. Xiao, C. Xing, T. Zhang, and Z. Zhao, “An intrusion detection model based on feature reduction and convolutional neural networks,” IEEE Access, vol. 7, pp. 42210–42219, 2019. [16] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly detection techniques,” Journal of Network and Computer Applications, vol. 60, pp. 19–31, 2016. [17] Y. Tu, Y. Lin, J. Wang, and J.-U. Kim, “Semi-supervised learning with generative adversarial networks on digital signal modulation classification,” Comput. Mater. Continua, vol. 55, no. 2, pp. 243–254, 2018. [18] Q. Shi, J. Kang, R. Wang, H. Yi, Y. Lin, and J. Wang, “A framework of intrusion detection system based on bayesian network in iot.,” International Journal of Performability Engineering, vol. 14, no. 10, 2018. 11 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (a) Part of Attention Map for a Normal Case. (b) Part of Attention Map for a Normal Case. FIGURE 11: Attention Map for a case of Normal Traffic [19] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015. [20] D. Kwon, H. Kim, J. Kim, C. S. Sang, I. Kim, and K. J. Kim, “A survey of deep learning-based network anomaly detection,” Clust. Comput., no. 5, pp. 1–13, 2017. [21] R. Wu, X. Chen, H. Han, H. Zhao, and Y. Lin, “Abnormal information identification and elimination in cognitive networks,” International Journal of Performability Engineering, vol. 14, no. 10, pp. 2271–2279, 2018. [22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, pp. 577–585, 2015. [23] S. T. Ikram and A. K. Cherukuri, “Improving accuracy of intrusion detection model using pca and optimized svm,” Journal of computing and information technology, vol. 24, no. 2, pp. 133–148, 2016. [24] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach to network intrusion detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2018. [25] F. Farahnakian and J. Heikkonen, “A deep auto-encoder based approach for intrusion detection system,” in 2018 20th International Conference on Advanced Communication Technology (ICACT), pp. 178–183, IEEE, 2018. [26] M. S. Islam, W. Khreich, and A. Hamou-Lhadj, “Anomaly detection techniques based on kappa-pruned ensembles,” IEEE Transactions on Reliability, vol. 67, no. 1, pp. 212–229, 2018. [27] H. M. Anwer, M. Farouk, and A. Abdel-Hamid, “A framework for efficient network anomaly intrusion detection with features selection,” in 2018 9th International Conference on Information and Communication Systems (ICICS), pp. 157–162, IEEE, 2018. [28] Y. Tian, M. Mirzabagheri, S. M. H. Bamakan, H. Wang, and Q. Qu, “Ramp loss one-class support vector machine; a robust and effective approach to anomaly detection problems,” Neurocomputing, vol. 310, pp. 223–235, 2018. [29] A. F. A. Khan, A. Gumaei and A.Hussain, “A novel two-stage deep learning model for efficient network intrusion detection,” IEEE Access, vol. 7, pp. 30373–30385, 2019. [30] Q. Tian, J. Li, and H. Liu, “A method for guaranteeing wireless communication based on a combination of deep and shallow learning,” IEEE Access, vol. 7, pp. 38688–38695, 2019. [31] M. Sheikhan, Z. Jadidi, and A. Farrokhi, “Intrusion detection using reduced-size rnn based on feature grouping,” Neural Computing and 12 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS (a) Part of Attention Map for an Anomaly Case. (b) Part of Attention Map for an Anomaly Case. FIGURE 12: Attention Map for a case of Anomaly Traffic Applications, vol. 21, no. 6, pp. 1185–1190, 2012. [32] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory recurrent neural network classifier for intrusion detection,” in Proc. Int. Conf. Platform Technol. Service, pp. 1–5, 2016. [33] C. L. Yin, Y. F. Zhu, J. L. Fei, and X. Z. He, “A deep learning approach for intrusion detection using recurrent neural networks,” IEEE Access, vol. 5, no. 99, pp. 21954–21961, 2017. [34] C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system using a deep neural network with gated recurrent units,” IEEE Access, vol. 6, pp. 48697–48707, 2018. [35] W. Anani and J. Samarabandu, “Comparison of recurrent neural network algorithms for intrusion detection based on predicting packet sequences,” in 2018 IEEE Canadian Conf. on Electrical & Computer Engineering (CCECE), pp. 1–4, IEEE, 2018. [36] A. F. M. Agarap, “A neural network architecture combining gated recurrent unit (gru) and support vector machine (svm) for intrusion detection in network traffic data,” in Proc. of the 2018 10th Int. Conf. on Machine Learning and Computing, pp. 26–30, ACM, 2018. [37] B. Roy and H. Cheung, “A deep learning approach for intrusion detection in internet of things using bi-directional long short-term memory recurrent neural network,” in 2018 28th Int. Telecommun. Netw. and Appl. Conf. (ITNAC), pp. 1–6, IEEE, 2018. [38] A. H. Mirza and S. Cosan, “Computer network intrusion detection using sequential lstm neural networks autoencoders,” in Proc. IEEE Sign. Process. Commun. Appl. Conf. (SIU), pp. 1–4, IEEE, 2018. [39] H. Liu, B. Lang, M. Liu, and H. Yan, “Cnn and rnn based payload classification methods for attack detection,” Knowledge-Based Systems, vol. 163, pp. 332–341, 2019. [40] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. [41] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015. [42] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho, “Deep recurrent neural network for intrusion detection in sdn-based networks,” in Proc. IEEE NetSoft, pp. 202–206, IEEE, 2018. [43] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015. [44] Y. Guo, J. Ji, X. Lu, H. Huo, T. Fang, and D. Li, “Global-local attention network for aerial scene classification,” IEEE Access, 2019. [45] N. Moustafa and J. Slay, “The evaluation of network anomaly detection 13 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.2983568, IEEE Access Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS systems: Statistical analysis of the unsw-nb15 data set and the comparison with the kdd99 data set,” Information Security Journal: A Global Perspective, vol. 25, no. 1-3, pp. 18–31, 2016. [46] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in neural information processing systems, pp. 693–701, 2011. [47] A.-H. Muna, N. Moustafa, and E. Sitnikova, “Identification of malicious activities in industrial internet of things based on deep learning models,” Journal of Information Security and Applications, vol. 41, pp. 1–11, 2018. [48] Y. Zhou, M. Han, L. Liu, J. S. He, and Y. Wang, “Deep learning approach for cyberattack detection,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 262– 267, IEEE, 2018. JI WANG received the B.S. degree in Electronics and communication technology from LiaoNing University, China, in 1994 and got M.S. degrees in engineering from Guangdong University of Technology in 2010. He is currently a Professor in Institute of Electronics and Information Engineering, Guangdong Ocean University. He is the director of Guangdong intelligent ocean sensor network and its equipment engineering technology research center, senior member of China electronics society, member of Guangdong electronic information education and reference committee. His main studying area is s Wireless Sensor Network and ocean Internet of things, information processing and communication system. CHANG LIU received the B.S. degree in computer science and Technology from Kharkiv National University of Ukraine in 2008 and got M.S. degrees in computer science and Technology from Kharkiv National University of Ukraine in 2009. He got a doctoral position in Radio Technology and Television Systems from the Kharkiv National University of Radio Electronics in 2013, Kharkiv, Ukraine. He has been a Teacher with the Heilongjiang Agricultural University of China since 2011, and became an Associated Professor in 2018. He is currently an Associated Professor in Institute of Electronics and Information Engineering, Guangdong Ocean University. He is a member of IEEE. His main studying area is signal processing, artificial intelligence. YANG LIU received the B.S. degree in electronic information engineering from College of electronic information, Northwestern Polytechnical University, Xian, China, in 2015. He got M.S. degrees in China Aerospace Science and Technology Corporation, in 2018. He is currently working in Beijing Insititute of Astronautical Systems Engineering, and His research interest includes Command and control, Information Security. YU YAN received the B.S. degree from the College of Information and Communication Engineering, Harbin Engineering University, Harbin, China, in 2019. She holds a master position with the College of Information and Communication Engineering, Harbin Engineering University, Harbin, China. Her current research interests include network intrusion detection, machine learning, and data analysis. 14 VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.