A Comparative Study of Graph Neural Network and Transformer Based Approaches for Anomaly Detection in Multivariate Time Series in IoT Halah Shehada, Baza Somda Bradley Department of Electrical and Computer Engineering Virginia Polytechnic Institute and State University Blacksburg, VA 24060 shehada@vt.edu, bazarod@vt.edu 1 1 Introduction 24 Massive amounts of multivariate time series data have been produced as a result of the Internet of Things’ (IoT) explosive expansion, and these data sets frequently contain anomalies that could be signs of system flaws, cyberattacks, or other unforeseen occurrences. The dependability and security of IoT systems depend on efficient and precise anomaly detection in such data. Graph Neural Network (GNN) and Transformer-based models are two recent techniques that show promise for this purpose. This research gives a thorough comparison of these two approaches for anomaly identification in multivariate time series in IoT, with a particular emphasis on the performance of the models put forward in [1] and [2]. A method for anomaly identification in multivariate time series based on graph neural networks is proposed in [1]. Their methodology takes advantage of the data’s underlying network structure to describe interactions between various variables, producing a more precise and understandable representation of the time series. They were able to demonstrate the effectiveness of their method on a variety of real-world datasets, exceeding existing methods in terms of accuracy and scalability by using several GNN designs. In [2], a unique method for learning the graph structures found in multivariate time series data for anomaly detection in IoT is presented. This method makes use of the Transformer model’s potent representation learning capabilities. Their approach showed notable advantages over conventional approaches and other deep learning techniques by capturing intricate dependencies and temporal patterns in the data. In this work, we conduct a thorough and meticulous evaluation of these two cutting-edge techniques, analyzing their individual strengths and shortcomings and providing insights into their applicability in various IoT situations. Our research seeks to provide a thorough grasp of the capabilities and restrictions of GNN and Transformer-based models for anomaly detection in multivariate time series in IoT. Ultimately, this investigation seeks to guide practitioners and researchers in selecting the most suitable approach for their specific anomaly detection tasks, paving the way for more efficient and reliable IoT systems. 25 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 Similarities In these publications, the important subject of anomaly detection in multivariate time series data is discussed. This problem is particularly important for Internet of Things (IoT) applications because many sensors and devices provide interconnected data. The relationships between different data variables are modeled using graph-based techniques, and the efficiency of anomaly detection is improved by taking advantage of the underlying dependencies and structure. A Transformer-based model is used in [2], whereas a Graph Neural Network (GNN) model is used in [1] for this purpose. For enhanced anomaly detection, both models are built to capture complex connections and patterns in the data. Also, both publications make use of unsupervised learning strategies, which are helpful for anomaly detection because labeled data is sometimes difficult or expensive to get in real-world 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Table 1: Description of the Datasets Name Num. Sensors Train Test Anomalies SWaT WADI 51 127 47515 118795 44986 17275 11.97% 5.99% 36 contexts. They demonstrate the potency of their methods in identifying anomalies in multivariate time series data by evaluating them on various datasets. 37 3 35 Differences 45 Despite these commonalities, there are several differences between the two articles, particularly in the neural network architectures and methods of graph building that were chosen. In [1], a strategy is put forth for combining structure learning with graph neural networks (GNNs) and utilizing attention weights to improve the explainability of identified anomalies. The Graph Transformer Anomaly (GTA) detection methodology is presented in [2] in contrast, and it includes a connection learning policy based on Gumbel-softmax sampling for acquiring direct knowledge of the bi-directional links between sensors. A Transformer-based design is used in the framework to express temporal dependency, and it also provides a novel graph convolution called Influence Propagation convolution. 46 4 38 39 40 41 42 43 44 Analysis of GNN-based anomaly detection method (GDN) 52 In both papers, the data is formed by multiple sensor measurements over a period of time (multivariate time series data). As this is an unsupervised learning setup, the training dataset is only composed of normal, unlabeled data. The testing dataset contains both normal and attack data. The objective of the frameworks is to identify attacks through anomaly detection. In this milestone report, we focus on the description and reproduction of the Graph Deviation Network (GDN) approach introduced in [1]. The testing of the method is done by running experiments using the datasets described in table 1. 53 4.1 54 The method proposed in [1] learns the relationships between sensors as a graph. Deviations from the learned patterns are then identified as anomalies and explained. The framework is composed of 4 components: 47 48 49 50 51 55 56 57 58 59 60 61 62 63 64 65 Description of GDN 1. Sensor Embedding: Considering a system with N sensors, an embedding vector vi is introduced for each sensor. The embeddings represent the characteristics of each sensor. These vectors are randomly initialized and trained during the learning process. 2. Graph Structure learning: The relationships between sensors are learned as a graph structure (directed graph in this case). In the directed graph, sensors are represented by nodes and relationships are represented by edges. A learned adjacency matrix A represents the full graph. The similarity eji between two nodes (sensors) is computed as the cosine similarity between their respective embedding vectors. The TopK number of nodes having the highest similarity with a considered node are chosen based on the desired sparsity level of the graph. eji = 66 67 68 69 70 71 72 73 vi ⊤ vj ∥vi ∥ · ∥vj ∥ (1) 3. Graph Attention-based Forecasting: To determine whether sensors are deviating from regular behavior and how they are diverging from normal behavior are the objectives of this component. This is accomplished by predicting a sensor’s anticipated behavior for each time step based on historical data and contrasting the predicted behavior with the actual behavior. The graph attention-based feature extractor uses a ReLU activation function to aggregate each node’s information with its neighbors from the learned graph. Attention coefficients are then computed as real-valued vectors using a Leaky ReLU activation function. Finally, the attention coefficient are normalized using a Softmax function. Based on the node 2 Table 2: Results of the reimplementation SWaT WADI Metric GDN Paper Ours Difference GDN Paper Ours Difference Precision (%) Recall (%) F1 99.35 68.72 0.81 98.30 68.46 0.807 −1.05 −0.26 −0.003 97.50 40.19 0.57 93.05 29.00 0.44 −4.45 −11.19 −0.13 representations obtained through the feature extractor, a prediction of the sensor values at the following time step can be determined. It is necessary to minimize the loss function, which is the Mean Squared Error between the predicted value and the observed data. 4. Graph Deviation Scoring: The overall anomalousness score of the measurements at each timestep is computed. The event is classified as an anomaly if the score exceed a set threshold. The chosen threshold is arbitrary and has a significant effect the performance of the method. In [1], the threshold is set as the maximum anomalousness score obtained over the validation data. 74 75 76 77 78 79 80 81 82 83 4.2 84 95 The results of the reimplementation of the GDN method using the SWaT and WADI datasets are presented in table 2. Our reproduction achieves a generally good performance for the SWaT dataset but the differences are much higher in the case of the WADI dataset. This can be explained by a few factors. First, as the anomaly thresholds used in [1] are not specified, we were forced to run the experiments multiple times using different thresholds. It is thus not realistic to expect equal results from our implementation. Secondly, as the authors mentioned, the performance and stability of the method is greatly reduced as the size of the system (number of sensors) increases. This may explain why our results in the WADI dataset are less desirable than the ones seen in SWaT dataset. Finally, differences in the preprocessing of the datasets may be at the origin of the drop in performance observed. The original WADI dataset does not label the which datapoints are the result of cyberattacks. We had to add the attack labels manually by looking at the dates and times provided by the authors of the dataset. Mistakes during that process can influence the experiment result. 96 5 85 86 87 88 89 90 91 92 93 94 Analysis of reimplementation results Future Work 101 The next part of our project will focus on the reproduction of the Graph learning with Transformer for Anomaly detection (GTA) framework presented in [2]. Both methods (GDN and GTA) will then be implemented on the power systems ICS attack dataset developed by Mississippi State University and Oak Ridge National Laboratory. Finally, we will evaluate the interpretability of the models and how easily they can be adopted for cybersecurity research. 102 6 97 98 99 100 Contributions 106 Halah wrote the first three sections of this report and acquired suitable power system ICS data that can be used for the project. Baza wrote the remaining sections of the report, obtained and preprocessed the raw SWaT and WADI datasets. We both researched and implemented modifications of the dataloader and main implementation codes. 107 References 103 104 105 108 109 110 111 112 [1] A. Deng and B. Hooi, “Graph Neural Network-Based Anomaly Detection in Multivariate Time Series.” arXiv, Jun. 13, 2021. doi: 10.48550/arXiv.2106.06947. [2] Z. Chen, D. Chen, X. Zhang, Z. Yuan, and X. Cheng, “Learning Graph Structures with Transformer for Multivariate Time Series Anomaly Detection in IoT,” IEEE Internet Things J., vol. 9, no. 12, pp. 9179–9189, Jun. 2022, doi: 10.1109/JIOT.2021.3100509. 3