Time Series Data Augmentation Improving Anomaly Detection with Generative Adversarial Networks: Application in Power Generating Plants By Toheeb Aduramomi JIMOH (jimoh.toheeb@aims.ac.rw) African Institute for Mathematical Sciences (AIMS), Rwanda Supervised by: Dr Marcellin Atemkeng Rhodes University, South Africa June 2022 AN ESSAY PRESENTED TO AIMS RWANDA IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF MASTER OF SCIENCE IN MATHEMATICAL SCIENCES DECLARATION This work was carried out at AIMS Rwanda in partial fulfilment of the requirements for a Master of Science Degree. I hereby declare that except where due acknowledgement is made, this work has never been presented wholly or in part for the award of a degree at AIMS Rwanda or any other University. Student: Toheeb Aduramomi JIMOH i ACKNOWLEDGEMENTS I acknowledge the relentless efforts of my supervisor through his painstaking supervision and support. ii DEDICATION The dedication of this research work contains three intertwined parts that however will be split thus: To Allah, the uncreated created of all that exists. To my loving and conscientious father—for always believing, trusting and investing in me continually, and To my precious mother—may your soul rest more in peace; I always wish we both knew each other more. iii Abstract Anomaly detection is a crucial task that involves investigating data points that do not conform to a specific pattern. It is mostly used for fraud detection and other related activities. Different methods have been used for anomaly detection tasks, however, recent studies have shown that the deep learning framework would be more suitable since it is capable of detecting and learning complex patterns from a dataset. As a result, this study utilises the generative adversarial framework for anomaly detection in power generation plants. The study data were obtained from the power consumption record of TeleInfra Telecom Company in Cameroon, which was collected as a result of observed irregularities in the fuel consumption pattern of the generating sets at their base stations. The Telecom company had to resort to using power generating sets for operations due to irregular power supply in the country. Splitting the data points into anomalous and normal points using some variables, 64.88% were classified as normal while 35.12% were classified as normal. The feature importance analysis using the random forest classifier revealed that the Running Time Per Day has maximum relative importance in determining our output. Also, the generative adversarial network model was trained before and after carrying out data augmentation with the goal of increasing the data size. The generator model consists of 5 dense layers with the tanh activation function. The discriminator contains 6 dense layers, with each having a dropout layer to avoid overfitting. It utilised the relu function everywhere but with a sigmoid on its final layer. The accuracy of the model was 98.99% after data augmentation and 66.45% before augmentation. This revealed that the model almost perfectly classified the data points correctly into normal and anomalous and the augmented data improved the anomaly detection performance of the GANs. Hence, it is recommended that the GANs with a large dataset is suitable for carrying out anomaly detection tasks. KEYWORDS: Generative modelling, generative adversarial networks, zero-sum game, anomaly detection, power generation plants, telecommunication. iv Contents Declaration i Acknowledgements ii Dedication iii Abstract iv 1 Introduction 1 1.1 Background of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Statement of Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Structure of the Research Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Literature Review 4 2.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Back Propagation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7 CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8 Concluding Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Generative Adversarial Networks for Anomaly Detection 17 3.1 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . 17 3.2 Building a GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Mathematical Framework of the GANS . . . . . . . . . . . . . . . . . . . . . . 18 3.4 The GANs Algorithm 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 3.5 GANs for Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Pros and Cons of the GANs Framework . . . . . . . . . . . . . . . . . . . . . . 22 3.7 Concluding Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Methodology, Results and Discussion 24 4.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Data Collection, Description & Augmentation . . . . . . . . . . . . . . . . . . 25 4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Conclusion and Recommendation 34 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 References 38 1. Introduction 1.1 Background of the Study About 3% of the world’s electrical energy is utilised by the Information Communication Technologies (ICT) companies (Humar et al., 2011). The telecommunication industry is one of the dominant ICT industries that rely on a huge amount of electric power supply for their operations, and thus it is indispensable in their daily dealings. However, its availability in underdeveloped countries, particularly in Africa, has been a constant source of contention. Despite the industry’s rise through the creation of base stations, they have had to turn to alternative energy sources such as the use of gasoline or diesel with generators, and the use of solar power, to name a few. TeleInfra telecommunication company established in Cameroon is one of such companies hooked on these challenges due to the state of power supply in the country. The telecommunication equipment that is fixed in different parts of the rural and urban areas in Cameroon requires an uninterrupted supply of electricity to achieve the goal of establishing strong and seamless communication channels in the country, however, the country’s electrical generation is mostly based on hydropower (73%), with perpetual power interruptions, particularly during the dry seasons when water levels are low (Muh et al., 2018). Moreover, it is also noted that the availability of electricity is only to about 14 per cent of rural residents, and 65–88 per cent of urban residents. The consequence of the diversification to alternative sources of power, particularly the usage of generators posed another challenge of irregularities or anomalies in fuel consumption at the base stations due to the observed high consumption rate in the power generation plants. Previous research such as in Ayang et al. (2016) shows that factors like mismanagement of both the air-conditioning and lighting systems, as well as the type of buildings, increased the power consumption rate at each of the base stations. Anomalies are referred to as data points that do not conform to an expected pattern in a dataset and are often referred to as a different distribution within a distribution. They can be present in a dataset through malicious activities like the pilferage of fuels from a power generation plant, and frauds in the utilisation of credit cards, among others. Anomaly detection involves the task of finding these patterns that are different from the conceived normal observation in a dataset. It is applied in various industries such as manufacturing, medical imaging, CCTV, and social networks, to mention a few, and is commonly used for fraud detection in either credit cards, health care or insurance; detecting intrusion for cyber-security, and so on. Anomaly detection is a very vital concept in data science as it forms a potent nexus between statistics and data science (Bruce et al., 2020). The basic anomaly detection in a dataset using the regression approach that is typically meant for data analysis and model improvement is carried out through diagnostic plots that could help identify the extreme observations or anomalies. Such instance of plots is the quantile plot (Q-Q plot), which helps to assess if a dataset is from a theoretical distribution, basically the Gaussian or normal distribution, by showing at a glance its plot, as well points that could possibly constitute the deviations. Another simple approach 1 Section 1.2. Statement of Problem Page 2 to identifying anomalies in a dataset is by using the interquartile range, which is a concept in statistics that measures dispersion or variability by dividing the dataset into quartiles. However, with the continuous emergence of large volumes of data in many dominant industries like health, finance, and so on, advanced methods of machine learning and deep learning have been devised for detecting anomalies in these datasets, and they use the different types of approach, that is the supervised and the unsupervised learning approaches. For instance, Mulongo et al. (2020) utilised four supervised machine learning approaches —logistic regression, support vector machines, k-Nearest Neighbours and the Multilayer Perceptron —for detecting anomalies in power generation plants of a telecommunication company with the goal of comparing their performances. Furthermore, according to Pang et al. (2020), deep learning approach such as the Generative Adversarial Networks (GANs) has demonstrated incredible competency in learning expressive representations of complicated data such as high-dimensional data, temporal data, geographical data, and graph data in recent years, thus pushing the boundaries of many learning tasks. As a result, a deep learning approach is seen as a better way of learning complex patterns in huge datasets and thus has the tendency of generating high performance in terms of its accuracy. 1.2 Statement of Problem TeleInfra telecommunication company is faced with the challenge of unaccounted high fuel consumption for their operations at the base stations. Since they solely depend on generating plants as their major source of power supply, they necessarily have to continually refill these generators and these are done manually. Such activities are known to have emanated in possible cases of pilferage of fuel due to the observed anomalies in the fuel consumption. As a result, it is essential to investigate the likely factors contributing to the anomalies by collecting data on fuel consumption at each of the base stations for the purpose of minimizing the costs of operation. Moreover, according to Goodfellow et al. (2014), GANs and their training framework have been effectively employed to model both complex and high-dimensional real-world data distributions across time, and thus, their characteristics suggest that they can be used for anomaly detection. 1.3 Motivation The overarching stimulus of carrying out this piece of research work is to utilise artificial intelligence via a deep learning approach in investigating means of reducing the cost of operations in one of the vital industries that is essential for the advancement of technology in the world —the telecommunication industry. Research shows that different methods have been used in carrying out anomaly detection tasks. As stated previously, Mulongo et al. (2020) utilised four machine learning techniques in carrying out anomaly detection tasks and thereafter compared their performances. However, it is known that advanced learning approach such as using GANs is capable of identifying complex patterns in a dataset, and as such, it is desired to explore its usage in this task and possibly generate improved accuracy. Section 1.4. Aims and Objectives 1.4 Page 3 Aims and Objectives The fundamental aim of this research is to build a model for detecting anomalies or irregularities in the dataset by employing a supervised deep-learning technique through GANs. The other objectives of the study are as follows: • To use feature importance analysis and the random forest classifier to analyse the primary features that determine the high fuel consumption in the base station. • To divide the data into anomalous and normal data points and thereafter train the model with the normal datapoints only. • To generate the confusion matrix and the receiver operating characteristic (ROC) curve for model validation purpose. • To compare the results before and after data augmentation. 1.5 Structure of the Research Work The first chapter presents the study and specifies the problem to be addressed, as well as the study’s goals and objectives, as well as the motivation for conducting the research. The second chapter will concentrate on key topics that are crucial to this research. Artificial intelligence, machine learning, deep learning, artificial neural networks, activation functions, convolutional neural networks, and deep learning architectures such as the AlexNet (Krizhevsky et al., 2017), LeNet (LeCun et al., 1989), Inception (GoogLeNet) (Szegedy et al., 2014), and others are all included. In the third chapter, how to employ GANs to find anomalies and an explicit explanation of the generative adversarial framework was presented. Moreover, the numerous methodologies used in the analysis, as well as the results and their explanations, are presented in the fourth chapter. The conclusion and recommendation will be found in the fifth chapter, which is the final chapter. 2. Literature Review This section attempts to utilise the theoretical method in exploring related literature and concepts in respect of our given study. It vastly encompasses several concepts constituting AI, machine learning, deep learning, as well as other related constituents. Moreover, it gives an explicit explanation as required in each section with the minimal mathematical foundation since the required ones would be explored in the succeeding chapter. 2.1 Artificial Intelligence One of the achievements of technological advancement in the world is Artificial Intelligence (AI). With it, there has been more ease of carrying out enormous tasks by utilising machines —programming them to think and function like humans—specifically by exhibiting a high level of intelligence that was known to be attributed to humans only. The highly sought-after term, AI was first coined at the Dartmouth Conference by John McCarthy, a professor of Computer Science at Stanford University, and thus defined it as the science and engineering of building intelligent systems (McCarthy, 1997). More explicitly, this implies the theory and establishment of systems that can carry out enormous tasks that are usually attributed to humans since they require a high level of intelligence which humans exhibit. And to be more specific, what we term ”intelligence” was enunciated by the same pioneer of the term as the computational part of the ability to attain goals in the universe. Furthermore, Sutton (2019) corroborates it by explaining a goal-attaining system as one that can be usefully understood in respect of its outcomes rather than in respect of its mechanisms. Some of the tasks that are known to require the human level of intelligence as posited by AI researchers, Russell and Norvig (2021), are natural language processing, machine learning, automated reasoning, brain imaging, knowledge representation, decision making, etc. Moreover, among the myriads of concepts related to AI, machine learning and deep learning tend to have gained the most prominence due to their continuous usage of recent. The link between these three major notions is illustrated in Figure 2.1. The figure clearly demonstrates that machine learning is a sub-field of AI, and that deep learning is a sub-field of machine learning. Machine learning and deep learning are thus AI sub-fields. In Section 2.2, we will look at each concept’s application and the literature that supports it. 4 Section 2.2. Machine Learning Page 5 Figure 2.1: Venn Diagram of AI Components Source: (Simplilearn, 2022) 2.2 Machine Learning As mentioned above, machine learning is an important concept that relates to AI and is continually gaining prominence in recent days partly because a large volume of datasets is generated daily, and this concept utilises the enormous dataset by feeding them into a “machine” and help them learn or discover statistically significant patterns for the possibility of building predictive models that would be used in determining future outcomes of a specific phenomenon; this also results in the improvement from experience. An informal, old albeit prominent definition of machine learning was given by Samuel (1959) as a field of study that gives computers the ability to learn without being explicitly programmed. His idea was from the fact that he wrote a checkers playing program where the program learned over time and was able to improve from the experience of identifying the bad and better playing positions. A more modern and encompassing definition was given by Mitchell (1997) “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” By this, Tom Mitchell identifies the Experience E as one derived from carrying out a task repeatedly, and the performance P is regarded as the probability of the program improving over time bearing in mind its experience. Generally, different machine learning algorithms are utilised for solving problems depending on the particular type of learning problem. These algorithms are typically statistical models that are used in learning or uncovering possible patterns that are embedded in a dataset. There exist different categorizations of these types, however, we would utilise the three prominent types which are supervised learning, unsupervised learning and reinforcement learning as given by one of several books such as that of Russell and Norvig (2021). Also, Figure 2.2 encapsulates the three types of machine learning. The types of machine learning will thereafter be explained in subsequent subsections below. Section 2.2. Machine Learning Page 6 Figure 2.2: Types of Machine Learning Source: Analytic Steps 2.2.1 Supervised Machine Learning. In supervised machine learning, there is usually a given output, or better still, we know what the output looks like. Training a model when the outcome is known for making future predictions subsequently on data with an unknown outcome is referred to as supervised learning (Bruce et al., 2020). Moreover, supervised machine learning problems are mainly categorised as regression and classification problems, and sometimes, we might have structural prediction problem. In regression, the required output is usually continuous; that is, we are trying to map the features or independent variables to a continuous target or response variable. Whereas, in classification problems, the task is to map the features or inputs into a discrete categorical variable. The usual workflow of the supervised learning approach is shown in Figure 2.3. It indicates that the original data is divided into training and test data, that the model is trained on the training data, and that the test data is used to determine whether the model performs as expected before deployment. Linear regression, logistic regression, random forest, decision trees, and support vector machines are some of the most extensively used supervised learning techniques. 2.2.2 Unsupervised Machine Learning. Contrary to the supervised learning approach, here, we have little to no idea of what our output would look like. Moreover, we usually do not distinguish between training and test data. Furthermore, as noted by Chowdhury et al. (2017), the unsupervised learning approach is dependent only on the underlying unlabeled dataset where the task is typically to identify complex patterns based on the logic provided in the algorithm, rather than carrying out prediction based on some known input-output pairs. Clustering and association are two problems associated with the unsupervised learning approach. In clustering, the dataset is grouped into different clusters based on similar features of the data points. Whereas, the association tries to identify trends in the data. Some widely used unsupervised machine learning algorithms are K-Means Clustering, hierarchical clustering, and so on. Section 2.3. Artificial Neural Networks Page 7 Figure 2.3: Workflow of Supervised Machine Learning (Awodele et al., 2017) 2.2.3 Reinforcement Learning. Reinforcement learning differs from the other two approaches in that it entails creating a system that improves its performance as a result of its interactions with its surroundings. A sequence of feedback loops governs the learning process in reinforcement learning. Video games, resource management, and industrial simulation, to mention a few, are applications of reinforcement learning. 2.3 Artificial Neural Networks As previously said, AI means developing intelligent systems capable of performing jobs traditionally performed by humans. To aid in the processing of complex tasks, these systems require features that are closely related to the human brain; thus, Artificial Neural Networks (ANNs), commonly referred to as ”neural networks,” are clearly inspired by the human brain, as the brain is typically composed of billions of interconnected neurons that work together. More formally, Li (2014) noted that the inspiration for ANNs comes from the central nervous system, and as such, it is usually composed of artificial neurons or processing elements that are connected to generate an entire system that would function just like the biological neural network. ANNs are referred to as networks because they are made up of various functions that gather information by detecting links and patterns in data using previous experiences, which are referred to as training examples in most literature (Goodfellow et al., 2016). Generally, a neural network can be thought of as a machine that was specifically established to model how the brain carries out a particular task of interest. ANNs mainly consist of nodes (neurons) that function together in a distributed style, learning from the input for the purpose of optimising the resulting output (O’Shea and Nash, 2015). Previous research shows that the brain is composed of about 1011 neurons interconnected with each other. A neural network corresponds to the brain in the sense that it is through a learning process that the network acquires knowledge from its surroundings; also, the strength of interneuron connections, that is the association of the computing cells of the neural network often referred to as “neurons”, are utilized in storing the acquired knowledge (Haykin, 2009). One main appealing feature of ANNs according to Sharma et al. (2020), is the ability to modify their behaviour in response to changing system variables. A diagrammatic representation of what a neural network looks like is given in Figure 2.4. It can be Section 2.3. Artificial Neural Networks Page 8 seen that the diagram is composed of an input layer, and the layer is composed of nodes, which in turn have associated weights attached to them and are usually multiplied with each of the n inputs for information processing. It is further composed of hidden layers that perform most of the computations required by the neural network, as well as the output layer with n outputs, that reveal the result of the predictions of the system. A few of the many applications of a neural network include real-time translation, facial recognition, forecasting, and so on. Figure 2.4: A Neural Network (Sharma et al., 2020) 2.3.1 Biological Neurons. A biological neuron consists of basic components such as the nucleus, dendrites, cell body (soma), axon and the synapse—the junction that enables transmission of a signal between the dendrites and axons. The soma or cell body is the main structural part of the neuron that carries the nucleus; the dendrites are tree-like structural networks that are made up of nerve fibres connected to the cell body, and are used for carrying signals or information; also, the axon is a single but long connection that extends from the soma body, also carrying signals from the neuron. The synapse is also an important part because the interconnection of the neurons is possible via the synaptic connection, and this also facilitates the exchange of action potential. It is present on the dendrites as well as on the axons. Figure 2.5 below is a diagrammatic representation of a typical biological neuron. Section 2.3. Artificial Neural Networks Page 9 Figure 2.5: A Biological Neuron Source: Lumen Learning 2.3.2 Artificial Neurons. All of the functional and structural components constituting the biological neurons as highlighted in the section above are replicated in the artificial neuron since the latter was inspired by the former. Figure 2.6 is a diagram of an artificial neuron. In the diagram, the inputs to the neuron are x1 , x2 and x3 , the lines are the links that connect the neurons, and w1 , w2 and w3 are the weights associated with each connecting link which alter or change the input signal. Also, xnet is the net input to the whole artificial neuron expressed as the summation of the products of the individual input with the corresponding weight as seen in Equation 2.3.1. Furthermore, the f (xnet ) denotes the activation function and this is what determines the output of the node. This concept would be explained more explicitly in subsequent sections. xnet = n X xi w i (2.3.1) i=1 2.3.3 Relationship between the Biological Neuron and Artificial Neuron. Since the artificial neuron was formed bearing in mind the biological neuron, it is obvious that the different components embedded in the former will be carrying out similar functions with a corresponding component in the latter. For instance, the soma or the cell body accommodating the nucleus in the biological neuron is the same as the net input (xnet ); the axon is the same as the output in the biological neuron and the artificial neuron respectively; the whole cell is the same as the neuron or node, and the dendrites or synapse of the biological neuron is the same as the weights in the artificial neuron. Section 2.4. Activation Functions Page 10 Figure 2.6: An Artificial Neuron. 2.4 Activation Functions As explained briefly in Section 2.3.2, an activation function of a neural network is what determines the output of the node. It is one of the parameters that is utilised in neural network computation, alongside some other hyperparameters, and it typically informs us of how the input is manipulated to generate the output. They are often referred to as transfer functions in some literature. Each node or neuron has an activation function associated with them, and it is one of the important functions constituting a neural network that aids in improving generalisation and learning (Nwankpa et al., 2018). According to Sharma et al. (2020), a neural network’s prediction accuracy is determined by the number of input layers and, more crucially, the type of activation function. As a result, it’s clear that picking the right activation function is an excellent way to improve deep learning algorithms. Activation functions can be linear or non-linear, depending on the mathematical function they are linked with. Albeit there is no specific statement about the type of activation function to be used, non-linear functions are commonly used as activation functions since the errors associated with the feeding of input in the real world usually possess non-linear features. Moreover, the boundary associated with a linear activation function is linear and as such, the neural network’s adaptation will only be limited to linear changes in the input. Generally, activation functions are utilised in transforming the linear signal that results from learning in the neural network into a non-linear signal, which is then passed on to subsequent layers in the system. It is worthy of noting that the initial output before the application of activation functions is usually linear by default; hence, non-linear activation functions are usually required for the transformation. Furthermore, the position of an activation function in a deep learning architecture usually determines its function in the system. When it is placed after the hidden layers, it functions by converting the learned mapping of the inputs and the corresponding weights into non-linear form; when placed in the output layer, it performs prediction. There are various mathematical functions used in generating the different activation functions, and the activation functions are dependent on them. Some of the main activation functions are the Sigmoid activation function, Tanh activation function, ReLU activation function, Leaky ReLU activation function, and Softmax activation function, to mention a few. The model of a linear Section 2.4. Activation Functions Page 11 output of a learned mapping is given as the sum of the products of the individual inputs with their corresponding weights plus the bias as follows y = w1 x1 + w2 x2 + · · · + wn xn + b n X y= w i xi + b (2.4.1) i=1 From Equation (2.4.1), xi ’s are the individual inputs, wi ’s are the corresponding weights, bi is the associated bias, and y is the linear output. When an activation function, α is applied to this linear output, the non-linear output generated is thus given as y = α(w1 x1 + w2 x2 + · · · + wn xn + b) ! n X y=α w i xi + b (2.4.2) i=1 where σ is the activation function. 2.4.1 Sigmoid Activation Function. This is a non-linear activation function that is used mostly in feed-forward neural networks. It is sometimes referred to as squashing function or logistic function in some research. It is seen as the most widely-used activation function because it is a non-linear type since errors mostly exhibit non-linear features as discussed in Sharma et al. (2020). Its position in the deep learning framework is in the outer layer and thus, it is used for prediction involving binary classification, and so on. It transforms the inputs, xi into having range between 0 and 1. Moreover, according to Neal (1992), major advantages of the sigmoid activation function are that it is mostly utilised in neural networks consisting of only one or two hidden layers, known as shallow networks, as well as the fact that it is facile to comprehend. Also, the main concern about this activation function is the issue of vanishing gradient in multilayer or deep neural networks, which is a concept that arises when the derivates become smaller till it vanishes to zero due to continuous differentiation; also, it is a computationally expensive function. The sigmoid activation can be represented mathematically as ϕ(x) = 1 1 + e−x (2.4.3) Nwankpa et al. (2018) noted that there are other variants of the sigmoid activation functions which are the hard sigmoid function, Sigmoid-Weighted Linear Units (SiLU) and the Derivative of Sigmoid-Weighted Linear Units (dSiLU). 2.4.2 Tanh Activation Function. The hyperbolic tangent activation function commonly referred to as the Tanh function is differentiable and continuous and has values between −1 and 1, unlike that of the sigmoid function. It can be expressed mathematically as ϕ(x) = ex − e−x ex + e−x (2.4.4) The main benefit of the Tanh activation function is that it yields a zero-centred output, and thus aids the back-propagation process. However, it has been seen as deficient in the part of addressing Section 2.4. Activation Functions Page 12 the vanishing gradient concern of the sigmoid function even though it is seen to be more preferred to the latter because it gives better training performances for deep neural networks, that is multilayer neural networks. Furthermore, the Tanh function is used mainly in speech recognition, natural language processing, as well as for recurrent neural networks. Graphical representations of the sigmoid and Tanh activation function responses are given in Figure 2.7. Figure 2.7a shows vividly that the response of the sigmoid activation function ranges between 0 and 1. Figure 2.7b also shows clearly that the hyperbolic tangent (Tanh) activation function response ranges between −1 and +1. (a) Sigmoid Activation Function (b) Tanh Activation Function Figure 2.7: Sigmoid & Tanh Activation Function Responses A more computationally efficient and cheaper version of the Tanh activation function used in deep learning is referred to as the Hard Tanh function and it is represented mathematically by (2.4.5) as follows for x > 1 1 ϕ(x) = x for − 1 ≤ x ≤ 1 −1 for x < −1 (2.4.5) 2.4.3 Rectified Linear Unit (ReLU) Activation Function. The Rectified Linear Unit activation function commonly referred to as the ReLu function is also a non-linear activation function and has a response range between 0 and ∞. It is seen as the most widely used activation function in deep learning applications as it is used in almost all deep learning architectures. It is also known to be more efficient in deep learning when compared with the sigmoid and Tanh activation functions mainly because it does not have all the neurons activated once. That is, only a certain particular of neurons are activated at a time (Sharma et al., 2020). It addresses the vanishing gradient problem observed in the previous two functions by forcing inputs that are less than zero Section 2.4. Activation Functions Page 13 to zero as can be seen vividly in its mathematical representation given by Equation (2.4.6). ( xi if xi ≥ 0 ϕ(x) = max(0, x) = (2.4.6) 0 if xi < 0 Furthermore, it is advantageous in that it yields faster computation since it does not evaluate mathematical expressions like exponentiation and division as seen in the previous functions. Nwankpa et al. (2018) also noted that it is used in almost all the deep learning architectures such as AlexNet, VGGNet, GoogleNet, ResNet, and SegNet, to mention a few, due to its reliability and simplicity. However, an issue that decreases its ability to properly fit data is the fact that it vanishes negative inputs to zero, and this makes it not map the negative inputs as required. Thus, a variant of the ReLu Activation Function known as the Leaky ReLu was established to address this particular issue of the ReLu. 2.4.4 Leaky Rectified Linear Unit (LReLu) Activation Function. The Leaky Rectified Linear Unit Activation Function is commonly referred to as the Leaky ReLu, and it is an improvised version of the ReLu function. It attempts to address its problem of vanishing negative inputs to zero by rather replacing them with a relatively small linear component of the input variable x, which is usually 0.01. It encompasses a leak that helps to adjust for negative inputs and thus increases the range of the function’s response to having values between −∞ and +∞, and it is computed as represented in Equation (2.4.7) below: ( x if x > 0 ϕ(x) = (2.4.7) αx if x ≤ 0 The main improvement of the LReLu over the ReLu and the Tanh activation functions is in sparsity and dispersion. In cases where the small linear component of the input variable x that is usually multiplied with it is not 0.01, then it is referred to as a Randomized Leaky ReLu. Other variants of the Rectified Linear Unit function are the Parametric Rectified Linear Units (PReLU) and S-shaped ReLU (SReLU). 2.4.5 Softmax Activation Function. The softmax activation function is obtained by compounding multiple sigmoid functions and as such, it is normally used in multivariate classification as compared to the sigmoid function that is used in binary classification. As a result, it also has a response between 0 and 1. It is used for computing the probabilities of each class in a multivariate classification model and returns that of the target class as the highest probability. Furthermore, it is found in the output layer of almost all the deep learning architectures like AlexNet, VGGNet, GoogleNet, ResNet, SegNet, and so on, safe SeNet that uses the sigmoid function in the position. The mathematical representation of the function is given with Equation (2.4.8) as follows: exp(xi ) ϕ(xi ) = X exp(xj ) j (2.4.8) Section 2.5. Back Propagation Method 2.5 Page 14 Back Propagation Method This is an optimisation strategy used in computing the losses in the computation of the neural networks with respect to the associated weights of the individual inputs. It is used when the predicted output is significantly different from the actual value. It basically involves tracking back to adjust the weights associated with each input in the neural network. 2.6 Convolutional Neural Network Convolutional neural network commonly referred to as CNN or ConvNet is an ANN that is used mostly in image analysis albeit it is also used for classification, and according to Krizhevsky et al. (2017), it is the most commonly employed and famous algorithm in the field of deep learning. A CNN is known to be a regularized version of a multi-layer perceptron that is composed of convolutional layers that in turn contain filters used for especially detecting patterns and making sense of them. The filters of a CNN can be technically thought of as a relatively small matrix whose rows and columns are usually specified by the user. As noted by O’Shea and Nash (2015), the main significant difference between CNN and the traditional ANNs is the fact that the former is mostly utilised in identifying and recognising patterns within images. Furthermore, CNN’s are composed of three types of layers that are stacked together to form the CNN architecture. These are the convolutional layers, pooling layers and the fully-connected layers. The three vital benefits of the CNN are parameter sharing, equivalent representation and sparse interactions (Goodfellow et al., 2016). Also, Alzubaidi et al. (2021) corroborated that the advantages of using CNN include the weight sharing features that help to lower the number of trainable network parameters thus allowing the network to become more generic, avoiding overfitting; ease of large-scale network implementation; and high organisation and reliance of the model output as a result of the simultaneous learning of the classification layer and the feature extraction layers. Among the wide range of fields in which CNN is being utilised extensively are computer vision, face recognition, speech processing, etc. A typical architecture for a CNN for image classification is shown in Figure 2.8. The CNN elements that will be discussed briefly in subsequent subsections are padding, max pooling, average pooling, and global pooling. 2.6.1 Padding. Padding is used for addressing possible problems of reduction in the size of the output that might arise after convolution in CNN, thereby causing loss of information. After applying padding in a CNN, it makes the input (image) have the same size as the output (image). 2.6.2 Max Pooling. To shrink the feature map the dimensionality of representation and the computational complexity in a CNN through the pooling layers, maximum pooling is a pooling type that utilises the maximum value of each local cluster of neurons that is contained in the feature map. It is a constant filtering operation that broadcasts and calculates the maximum value of a certain region. 2.6.3 Average Pooling. While the maximum pooling uses the maximum value, the average pooling makes use of the average value of each local cluster of the neurons constituting the feature map. Section 2.7. CNN Architectures Page 15 Figure 2.8: CNN Architecture for Image Classification (Alzubaidi et al., 2021) 2.6.4 Global Pooling. This acts on the entire neurons constituting the feature map. 2.7 CNN Architectures There have been no doubt numerous CNNs in recent years. The architectures of CNN refer to the different set-ups made to the neural network for the purpose of having them in a way that best suits their specific task or kind of input data such as being comprised of images. According to Alzubaidi et al. (2021), CNN architectures have witnessed various modifications since 1989 till date, and this includes parameter optimizations, regularization, and structural reformulation, among others. Furthermore, Szegedy et al. (2014) noted that most of the development and progress the technology world has witnessed so far are not limited to more voluminous datasets and more complex models, as well as powerful hardware, but also as a result of improved network architectures, algorithms and new ideas. There are myriads of CNN architectures already, however, a few prominent ones in LeNet-5 (LeCun et al., 1989), AlexNet (Krizhevsky et al., 2017), VGG 16 (Simonya and Zissermanh, 2015), Inception (GoogLeNet) (Szegedy et al., 2014), ResNet (He et al., 2016), and DenseNet will be discussed in subsequent subsections. 2.7.1 LeNet-5. LeNet-5 is the earliest deep learning architecture for CNN and as such has the least number of layers—5 layers—consisting of 3 fully-connected layers and 2 convolutional layers, hence the “5” associated with its name (LeCun et al., 1989). It is known to be one of the simplest architectures, and it has about 60,000 parameters. 2.7.2 AlexNet. AlexNet was first proposed by Krizhevsky et al. (2017), and thereafter improved the learning ability of the CNN by upgrading its depth as well as implementing numerous strategies for optimizing parameters (Alzubaidi et al., 2021). It was the earliest improvement on the LeNet with twelve layers. It is known to have a considerable significance in the modern-day CNN generations. The activation functions in the hidden layers and output layers are respectively the Rectified Linear Unit (ReLu) and the Softmax function. Section 2.8. Concluding Section Page 16 2.7.3 Visual Geometry Group (VGG). Visual Geometry Group (VGG) architecture is an innovative and efficient design principle that was proposed by Simonya and Zissermanh (2015), and it is an improvement on the AlexNet in terms of increase in depth of the network, with sixteen and nineteen layers in its two different variants. VGG-16 is a variant of the VGG that is widely used for other CNNs, and it is known to have good classification performance. The activation functions in its hidden layers and output layers are respectively the Rectified Linear Unit (ReLu) and the Softmax function. 2.7.4 Inception (GoogLeNet). The GoogLeNet architecture, also known as Inception-V1, is composed of twenty-two layers and was proposed by Szegedy et al. (2014). It is known to have emerged as the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition in 2014. Its sole purpose of design is to achieve a high level of accuracy with minimal computational cost by proposing a novel inception block concept in the context of the CNN (Alzubaidi et al., 2021). Moreover, its motivation was to enhance the learning capacity and thereby improve the efficiency of CNN parameters. Like the VGG and AlexNet, the activation functions in the hidden layers and output layers are respectively the Rectified Linear Unit (ReLu) and the Softmax function. 2.7.5 Residual Network (ResNet). The Residual Network (ResNet) architecture was developed by He et al. (2016) and its largest architecture is known to contain one hundred and fifty-two layers. However, the most common type of this architecture was ResNet50, and it contains 49 convolutional layers and a single fully-connected layer. ResNet emerged as the winner of both the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and Common Objects in Context (COCO) Detection Challenge in 2016 with 152 layers of depth, representing 20 times the depth of AlexNet and 8 times the depth of VGG. However, it has lower computational complexity when compared with the VGG. The activation functions in its hidden layers and output layers are respectively the Rectified Linear Unit (ReLu) and the Softmax function. 2.7.6 Dense Convolutional Network (DenseNet). Dense Convolutional Network (DenseNet) was suggested solely for the purpose of addressing the issue of vanishing gradient but following the same direction as ResNet. It connects the individual layers to all layers in the network through a feed-forward medium. It has some compelling advantages, among which are the strengthening of feature propagation, elimination of the vanishing-gradient problem, encouraging the reuse of features, as well as reducing the number of parameters significantly. 2.8 Concluding Section This chapter has thoroughly explored the different concepts related to our study including related literature using the theoretical approach. It basically placed emphasis on assessing important concepts such as artificial intelligence, and its sub-fields. It also explicitly gives an exposition of the concept of ANNs and all its vital components such as the activation functions; also an important ANN framework is the CNN. The next chapter is to present GANs and their framework including the algorithm, their mathematical representations, as well as how they are used for anomaly detection tasks. 3. Generative Adversarial Networks for Anomaly Detection 3.1 Generative Adversarial Networks (GANs) Anomalies can occur when incomplete or corrupted data points are allowed within a dataset. It is assumed that non-anomalous data can be simplified further due to recognisable patterns and correlations within the dataset – the data points can be mapped to some lower-dimensional distribution, then inversely mapped to the original distribution, with minimal loss, e.g., autoencoder. This mapping will lead to a new distribution where typical data points are denser in certain areas than the others. The GANs was introduced in 2014 by Goodfellow et al. (2014), and it is a deep-learning architecture used to replicate dataset distributions. It is an instance of a modern generative modelling algorithms. It is comprised of two parts: the Generator and the Discriminator. The Discriminator is a deep-learning classifier that learns to classify between real and fake (generated) examples that are taken from the domain as input. The Generator is a deep learning model that takes a latent vector of random numbers from a normal distribution as input and transforms it. Its goal is to continually generate examples that fools the discriminator model into thinking they are real. Moreover, the generator and discriminator learn alongside each other simultaneously until they attain an equilibrium where the generator produces realistic examples. 3.2 Building a GAN Figure 3.1: Building Block of a GAN Figure 3.1 is a structure that typically depicts the architecture of a GAN. Generally, a sample of noise is passed as a multi-layer perceptron to the generator model, G. Afterwards, the generator 17 Section 3.3. Mathematical Framework of the GANS Page 18 generates fake sample referred to as G(z), and this sample alongside the real data x, are passed on to the discriminator neural network, D. The task of the discriminator is to output a single value which is the probability of having the input as real. Thus, it outputs D(x) if the initial input is real, otherwise it outputs D(G(z)). Furthermore, the adversarial framework as depicted above is transformed from an unsupervised approach to a supervised approach in the sense that the resultant of the process comes from the discriminator D which acts as a classifier. 3.3 Mathematical Framework of the GANS As originally proposed by Goodfellow et al. (2014), the adversarial framework is formulated mathematically via the minimax of a target function between the generator function G : Rd → Rn , and the discriminator function D(x) : Rn → [0, 1]. Thus, the target loss function is proposed to be V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z))] (3.3.1) From (3.3.1), Ex∼pdata [log D(x)] is referred to as the prediction of the discriminator on a real data, while the second term, Ez∼pz [log(1 − D(G(z))] is the prediction of the discriminator on a fake data. Bearing in mind (3.3.1), it is required to solve the minimax problem below min max V (D, G) = min max Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z))] G D G D (3.3.2) The generator’s goal is to maximize its probability of winning, that is, having the discriminator make mistake, hence, it is required to minimize the value function given by (3.3.2). On the other hand, the discriminator’s goal is to minimize the winning probability of the generator, hence it has to maximize (3.3.2). In summary, the discriminator wants D(G(z)) to be as small as possible and D(x) to be as large as possible. 3.3.1 Proposition. The optimal discriminator is given by a simple relation of the probability distributions of the real and fake samples. When these distributions are equal, the cost function is a constant given as − log 0.4. Proof. V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))] Z Z = pdata (x) log[D(x)] dx + pz (z) log[1 − D(G(z))] dz z Zx V (D, G) = pdata (x) log[D(x)] dx + pg (x) log[1 − D(x)] dx (3.3.3) x The final part of (3.3.3) can be represented with the form given below as V (D, G) = a log y + b log(1 − y) (3.3.4) Section 3.3. Mathematical Framework of the GANS Page 19 where a = pdata (x) b = pg (x) (3.3.5) (3.3.6) Thus, to find y that maximizes (3.3.4), we differentiate partially with respect to y and equate to zero thus: ∂ a log(y) + b log(1 − y) = 0 ∂y a b + (−1) = 0 y 1−y a b = y 1−y a ∴y= a+b Then, by substituting (3.3.5) and (3.3.6) into y above, we would have D(x)∗ = pdata (x) pg (x) + pdata (x) (3.3.7) Equation (3.3.7) is the final form for the optimal discriminator, that is, a relation of the probability distributions of the real and fake samples. Furthermore, if the probability distribution of the real data and the generated data are identical, that is pdata (x) = pg (x) (3.3.8) pdata (x) 2 × pdata (x) ∗ ∴ D(x) = 1/2 (3.3.9) Then, (3.3.7) becomes D(x)∗ = Thus, the optimal discriminator inputs 1/2. By substituting this into the cost function via (3.3.1), then we would have V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z))] = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(x))] = Ex∼pdata [log(1/2)] + Ez∼pz [log(1/2)] = − log 2 − log 2 ∗ ∴ V (D , G) = − log 4 Section 3.3. Mathematical Framework of the GANS Page 20 3.3.2 Kullback-Leibler Divergence. The Kullback-Leibler divergence, also called relative entropy or KL divergence is a statistical distance that measures how an initial probability distribution P differs from a second reference probability distribution, Q. It is the expected value of the logarithm of the ratio of the two probability distributions represented mathematically as follows h P (x) i DKL (P ||Q) = Ex∼p log Q(x) h 2P (x) i = Ex∼p log (3.3.10) 2Q(x) i h P (x) − log 2 = Ex∼p log (Q(x)/2) Recall the cost function can be expressed as follows V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))] = Ex∼pdata (x) [log D(x)] + Ex∼pg (x) [log(1 − D(x))] (3.3.11) By substituting Equation (3.3.7) into Equation(3.3.11), the cost function becomes V (D, G) = Ex∼pdata (x) [log D(x)] + Ex∼pg (x) [log(1 − D(x))] i i h h pdata (x) pdata (x) = Ex∼pdata log + Ex∼pg log 1 − pg (x) + pdata (x) pg (x) + pdata (x) i i h h pg (x) pdata (x) + Ex∼pg log = Ex∼pdata log pg (x) + pdata (x) pg (x) + pdata (x) i h h pg (x) + pdata (x) pg (x) + pdata (x) i = Ex∼pdata log pdata ÷ + Ex∼pg log pg ÷ − 2 log 2 2 2 h h pg (x) + pdata (x) i pg (x) + pdata (x) i = Ex∼pdata log pdata ÷ + Ex∼pg log pg ÷ − log 4 2 2 (3.3.12) Thus, by comparing the last part of Equation (3.3.12) with that of Equation (3.3.10), the Kullback-Leibler divergence enables us to have pdata + pg pdata + pg + KL pg (3.3.13) V (D, G) = − log 4 + KL pdata 2 2 3.3.3 Jensen-Shannon Divergence. The Jensen-Shannon Divergence, also called total divergence to the average is a concept in probability theory for measuring the similarity between two probability distributions. It is represented mathematically as 1 P + Q 1 P + Q J SD(P ||Q) = DKL P + DKL Q 2 2 2 2 (3.3.14) The Jensen-Shannon Divergence is similar to the Kullback-Leibler divergence, with only difference being that the former is symmetric. This implies that the JSD from P to Q is the same as that Section 3.4. The GANs Algorithm Page 21 from Q to P —which is not true for the KL divergence. It is also used as a distance measure between two probability distributions. Also, by using the last part of Equation (3.3.12), we would have h h pg (x) + pdata (x) i pg (x) + pdata (x) i + Ex∼pg log pg ÷ − log 4 minV (D, G) = Ex∼pdata log pdata ÷ G 2 2 h h 1 pg + pdata i 1 pg + pdata i = 2 × Ex∼pdata log pdata ÷ + 2 × Ex∼pg log pg ÷ − log 4 2 2 2 2 (3.3.15) Thus, the minimum of the cost function for the generator is given as min V (D, G) = − log 4 + 2.J SD(pdata ||pg ) G (3.3.16) That is, this minimum cost function is also the Jensen-Shannon Divergence between the real and fake data distributions with a difference of log 4, where − log 4 is the maximum constant incurred for having the maximum discriminator when the data distribution match points. Thus, for an optimal discriminator, the generator aims to minimize the cost as given in Equation (3.3.15). In addition, the minimum for any Jensen-Shannon Divergence is 0, and this occurs if and only if the two probability distributions are equal, that is, if Equation (3.3.8) holds. Thus, the cost function has only one minimum which is achieved when the generator perfectly maps the data distribution. In this case, the only minimum of the cost function is a constant given as min V (D, G) = − log 4 G (3.3.17) Thus, Equation (3.3.17) is referred to as the global minimum of the loss function. 3.4 The GANs Algorithm The algorithm of the generative adversarial framework is presented below. It briefly summarizes the whole steps involved in the generative adversarial framework. To learn the discriminator, it is treated as a classification problem wherein input are real data, x, and the fake data is G(z), and the goal is to predict the probability of the output being real. In the algorithm, we use the positive direction of the gradient (gradient ascent) of the loss function to update the discriminator in an attempt to maximize the cost; likewise the negative direction of the gradient (gradient descent) to update the generator to minimize the cost. The algorithm also corroborates the image presented in Figure 3.1. Section 3.5. GANs for Anomaly Detection Page 22 Begin for each training iteration do for k steps do Sample m noise samples {z1 , z2 , . . . , zm } and transform with Generator; Sample m real samples {x1 , x2 , . . . , xm } from real data; Update the Discriminator by ascending the gradient: m h X ~ i (i) (i) w ∇θ 1 log D x + log 1 − D G z ; dm i=1 end for Sample m noise samples {z1 , z2 , . . . , zm } and transform with Generator; Update the Generator by descending the gradient: m X w (i) ∇θg 1 ; log 1 − D G z m i=1 end for End 3.5 GANs for Anomaly Detection Schlegl et al. (2017) noted that the task of anomaly detection using GANs involves modelling the normal behaviour using an adversarial training process and thereby detecting anomalies by measuring the anomaly score. It is formulated such that the framework learns a generative model referred to as the generator alongside another called the discriminator, where the generator maps samples from an arbitrary latent distribution to data, and the discriminator tries to detect which is the generated or real samples. A BiGAN architecture was proposed by Donahue et al. (2016) as an extension of the initial framework by including inverse mapping. The inverse mapping works such that it maps data back to their latent representation. Consequently, the usage of the GANs for anomaly detection is usually comprised of a function which acts by mapping the input data to the latent space, as well as a function known as the generator that acts entirely in the opposite direction, known as the generator (Mattia et al., 2019). 3.6 Pros and Cons of the GANs Framework The fundamental advantage of the generative adversarial framework is that it updates the two models—the generator model and the discriminator model—via the back-propagation paradigm and as such, it does not require the Markov chain. In addition, no inference is necessary during learning, and a wide range of interactions and factors can be simply integrated into the model. However, the least interesting part of using the generative adversarial network for anomaly detection is that the anomaly scores are not easy to interpret. Section 3.7. Concluding Section 3.7 Page 23 Concluding Section This chapter explicitly explored the GANs framework, the mathematics, as well as the algorithm that depicts how the framework can be implemented with programming software. Moreover, it briefly explained what exactly is meant by using GANs for anomaly detection. The next chapter will be presenting the methods of data analysis, including data augmentation and results achieved alongside their comprehensive discussions. 4. Methodology, Results and Discussion 4.1 Performance Evaluation This helps to examine the performance of our model and the various measures are as follows • Confusion Matrix: This is a matrix that shows the number of correct and incorrect c True Negative (TN), c prediction by the model. It is usually composed of True Positive (TP), c and the False Negative (FN). c The structure of a typical confusion False Positive (FP), matrix is given in Table 4.1. Actual (y) Positive Negative Prediction fb(x) Positive Negative c c True Positive (TP) False Negative (FN) c c False Positive (FP) True Negative (TN) Table 4.1: Confusion Matrix d te (fb) : This is the estimated probability of correct predictions, given as • Accuracy PCC the total number of correct predictions divided by the total number number of observations in the test set. The best accuracy is 1, and the worst is 0. It is expressed as TP c + TN c c + TN c TP d b Accuracy PCCte (f ) = = c + FP c + TN c + FN c |Dte | TP (4.1.1) • Sensitivity (Sn): This is the number of correct positive predictions divided by the total number of positives. It is also called Recall or the True Positive Rate. The best sensitivity is 1, and the worst is 0. It can be quantified as Sn = c TP (4.1.2) c + FN c TP • Specificity (Sp): This is the number of correct negative predictions divided by the total number of negatives. It is also called the True Negative Rate, and it shows the proportion of correctly predicted negative examples. The best specificity is 1, and the worst is 0. It is quantified with Sp = c TN (4.1.3) c + FP c TN • Precision (P): This is the number of correct positive predictios divided by the total number of positive predictions, and it is also called the Positive Predictive Value. It also gives 24 Section 4.2. Data Collection, Description & Augmentation Page 25 information about how our model effectively avoids false positives. The best precision is 1, and the worst is 0. It can be expressed as c TP P= (4.1.4) c + FP c TP • False Positive Rate (FPR): This is the number of incorrect positive predictions divided by the total number of negatives. The best FPR rate is 0, and the worst is 1. FPR = c FP = 1 − Specificity (4.1.5) c + FP c TN • F1 -Score: It is defined by the harmonic mean of precision and recall. It is represented as F1 -Score = 1 Recall Recall × P recision 2 =2× 1 Recall + P recision + P recision (4.1.6) • Area under ROC (AUC): The AUC measures the entire area under the Receiver Operating Characteristic Curve (ROC) curve. It graphically shows the overall performance of the classifier ranking. 4.2 Data Collection, Description & Augmentation 4.2.1 Data Collection and Description. The dataset utilised in this study was obtained from the base stations of TeleInfra Telecommunication Company situated in Cameroon, using generator as a major source of power, and it spans across September 2017 to August 2018. The dataset is initially composed of 6010 rows or records, which was maintained at 5905 records after necessary data cleaning. The dataset is initially comprised of variables like Running Time, which accounts for the time of running of the generator plants; Power Type, denoting the source of power; Consumption Rate; GENERATOR CAPACITY (kVA), measuring the total amount of power in the generator in Kilo-Volt-Amperes (kVA), and so on. Furthermore, to be able to adequately measure the irregularities or anomalies in the dataset, additional variables were generated and this include Daily Consumption within a Period, gotten by dividing the Comsumption HIS by the Number of Days; Running Time per Day, derived by dividing the Running Time by the Number of Days; and also the Daily Consumed Quantity between Visits. Thus, prior to the feature selection, the columns variables were 21, excluding the class labels. The variables and brief description is given in Table 4.2. Section 4.2. Data Collection, Description & Augmentation Variable NUMBER OF DAYS RUNNING TIME RUNNING TIME PER DAY CONSUMPTION HIS CONSUMPTION RATE CURRENT HOUR METER GE1 MAXIMUM CONSUMPTION PER DAY PREVIOUS HOUR METER G1 DAILY CONSUMPTION QTE FUEL FOUND QTE CONSUMED BTN VISIT PER DAY Page 26 Description The number of days before the next generator refuelling. Total number of working hours of the generator before the next refueling The number of working hours per day of a generator. Total fuel consumed between a specific period before the next refueling The number of hourly consumed litres by the generator. Hourly meter reading of the generator Maximum fuel the generator can consume in a day. Previous meter reading of generator Quantity of daily consumed fuel by the generator base on its consumption hourly consumption rate and daily working hours Quantity of fuel inside the generator tank before refueling. (QTE CONSUMED BTN VISIT)/(NUMBER OF DAYS) Table 4.2: Variables and Descriptions 4.2.2 Data Augmentation. It is well-known that deep learning models like the GANs rely to a greater extent on a huge volume of datasets, hence data augmentation was carried out on the available 5902 observations after the cleaning. The augmentation procedure uses the standard deviation of each feature in generating additional records by adding a uniformly distributed random number that has a value between 0 and the standard deviation of the features to the available values in the row of each feature. It embeds this summation as dictionaries while iterating over the rows in the dataset using the definite for loop iterative loop. It further appends the newly generated records to the existing records and thus augments them together to generate a large amount of dataset. The augmented data contains a total of 187281 records with the existing features. This study will basically be examined with the initial dataset that is comprised of 5092 records, as well after augmentation is performed on the existing dataset to increase its size, perhaps if there would be observed improvement in the trained model. Section 4.3. Results and Discussion 4.3 Page 27 Results and Discussion 4.3.1 Feature Importance. The feature importance analysis is simply used to show the order of relevance of the various features constituting the dataset. Generally, it is necessary to determine which feature has the maximum influence on the output, and in this case, it is to determine which variable will be of utmost importance in classifying the datapoints as either anomalous or normal. The random forest classifier was utilised in this task and it can be seen from Figure 4.1 that the Running Time Per Day shows the maximum importance among all the features, followed by the Daily Consumption within a Period, Running Time, Consumption HIS, Maximum Consumption per Day, Daily Consumed Quantity between Visits, in that order. Furthermore, it is also seen that the least important among all the features is the Total Quantity (QTE) Left. Figure 4.1: Feature Importance 4.3.2 Anomaly Plots. Figure 4.2b below is a scatterplot that shows the distribution of the running time per day of the generating plants at the base stations of TeleInfra Telecommunication company. For a start, it can be seen vividly that the certain data points are above the threshold of 24 hours in the plot and are thus can be referred to as anomalies. This clearly indicates that there exists anomalies in the fuel consumption in respect of the running time per day. Also, Figure 4.2a shows that some data points are clearly outlying in the dataset when compared to other observations. Both plots are used to confirm that there exist anomalies or irregularities in the dataset. Section 4.3. Results and Discussion Page 28 (a) Scatter Plot of Running Time (b) Scatter Plot of Running Time Figure 4.2: Anomaly Plots of the Running Time Per Day at the Base Stations 4.3.3 Label Classification. After the feature selection, the required target variable in the dataset was established by classfying the data points into anomaly and normal cases using variables like the Running Time per Day, Maximum Consumption Per Day, etc. The normal records are seen to be 3829 (64.88%) and the anomaly is 2073 (35.12%). To further provide a clear demarcation between the normal and the anomalous data points observed in the running time per day, a pie chart is generated in Figure 4.3, showing the percentage of the anomaly and the normal data points observed. It is seen that about 64.88% of the data points are suggested as normal, while the other 35.12% are classified as anomaly. Furthermore, it indicates that there exists a class imbalance in the data labels. 4.3.4 Correlation Analysis. Now, we try to show the extent of linear association between each pair of features in the dataset with the correlation plot (also known as correlation matrix) as seen in Figure 4.4. The correlation matrix has values between −1 and +1, where a correlation of −1 and +1 represents a perfect negative and a perfect positive correlation respectively. Accessing the correlation matrix shows for instance that the Consumption HIS and the Running Time has a high correlation with a value of 0.83, which is a strong positive correlation indicating that the Section 4.3. Results and Discussion Figure 4.3: Anomaly vs Normal Running Time Per Day Figure 4.4: Correlation Matrix of the Association between Features Page 29 Section 4.3. Results and Discussion Page 30 higher the Running Time, the more the consumption. Thus, it can be ascertained that these are relevant in studying the fuel consumption pattern and thus detect anomalies at the base stations. 4.3.5 Training Losses of Initial Dataset. The Training Losses graph shown in Figure 4.5 below is used to show how both the discriminator and generator models converge. As recommended in previous research, the Adam optimizer was used for the generator network, while the stochastic gradient descent optimizer was used for the discriminator network. Furthermore, the generator network contains five dense layers, all with the tanh activation function; whereas, the discriminator network is composed of size dense layers, with each accompanied by a dropout layer that helps to avoid overfitting. Sigmoid activation function was also used on its final layer. Thus, it can be seen vividly from 4.5a that both the losses of the generator and discriminator were fluctuating and the convergence pattern is not really appreciable. The final loss for the discriminator was 0.110683, while that of the generator was 0.037645. 4.3.6 Training Losses of Augmented Dataset. Figure 4.5b also shows the training loss of the dataset after augmentation. Apparently, it is seen that the losses for both the generator and discriminator converge to zero towards the end of the training, implying that the GANs model reached optimality. Thus, it can be said that the data augmentation improved the model training process. 4.3.7 Model Evaluation (Initial Dataset). Model evaluation is a way of further determining the performance of our model using evaluation metrics such as the accuracy score, F1 Score, Recall and Precision. The accuracy score is about 98.99% for the augmented dataset and about 66.45% for the initial dataset as shown in Table 4.3 and 4.4 respectively. Table 4.3: Model Evaluation of Initial Dataset Measure Accuracy Score Precision Recall F1 Score Value 0.6645 0.5455 0.0151 0.0294 Table 4.4: Model Evaluation of Augmented Dataset Measure Accuracy Score Precision Recall F1 Score Value 0.9899 0.78538 0.7966 0.6457 Section 4.3. Results and Discussion Page 31 (a) Training Losses of Initial Dataset (b) Training Losses of Augmented Dataset Figure 4.5: Training Losses of Datasets 4.3.8 ROC Curves of Dataset. The ROC curve shown in Figure 4.6a depicts a straight line, showing that the model is not fit. Whereas Figure 4.6b shows that the area covered is about 0.74. Section 4.3. Results and Discussion Page 32 (a) ROC Curve of the Initial Dataset (b) ROC Curve of Augmented Dataset Figure 4.6: ROC Curves of Dataset 4.3.9 Confusion Matrix. The confusion matrix shows the number of True Positive, False Positive, True Negative and False negative in our model. It is a key to determining the accuracy and other metrics associated with the model. Figure 4.7a shows the confusion matrix of the initial dataset, while Figure 4.7b shows the confusion matrix of the augmented dataset. Section 4.3. Results and Discussion (a) Confusion Matrix of the Initial Dataset (b) Confusion of Augmented Dataset Figure 4.7: Confusion Matrices of Dataset Page 33 5. Conclusion and Recommendation 5.1 Conclusion The irregularities in power supply in Cameroon have prompted TeleInfra, a telecommunication company to resort to alternative sources of power, such as solar, generator, etc. In the process of generating power for operation purpose, particularly using generators, the company observed that there are anomalies in the fuel consumption patterns which was observed by examining variables such as the running time per day, consumption rate per day, and so on. Thus, the dataset was in the range of a year and was collected on the different features. Mulongo et al. (2020) used the machine learning approaches which are the K-Nearest Neighbours, Logistic regression, Multilayer Perceptron and the Support Vector Machines. The purpose of this study is then to use a deep learning framework with the objective of trying to examine improvement in the accuracy of the model. As such, the study used the Generative Adversarial Network Framework in modelling the dataset. However, it was observed initially that the model tends not to be performing well with conjecture about the small size of the dataset. Hence, data augmentation was performed on the initial dataset to generate more datasets. Following this, the generator and discriminator models were trained on the large dataset and an accuracy of 0.9899 was achieved. The losses convergence for the two cases was also observed and thus suggested that the augmented dataset improved the accuracy of the model more. However, exceptional values were not achieved for other metrics, and this could be a result of incomplete implementation of the algorithm by ensuring further hyperparameter tuning for instance, and this was limited to the time constraint. 5.2 Recommendation The generative Adversarial Network Framework can be adequately used for carrying out anomaly detection tasks as can be seen that it yielded improved accuracy compared to that of Mulongo et al. (2020) which was about 96.1%. In addition, further research on anomaly detection can be carried out using the GANs by probably adjusting the hyperparameters and other features, and also an interesting area that is based on the GANs ensemble for anomaly detection. 34 References Alan Adolphson, Steven Sperber, and Marvin Tretkoff, editors. p-adic Methods in Number Theory and Algebraic Geometry. Number 133 in Contemporary Mathematics. American Mathematical Society, Providence, RI, 1992. Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J. Santamarı́a, Mohammed A. Fadhel, Muthana Al-Amidie, and Laith Farhan. Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. Journal of Big Data, 8(53):84–90, March 2021. doi: 10.1186/s40537-021-00444-8. URL https://doi.org/10. 1186/s40537-021-00444-8. O. Awodele, J. Akinjobi, and J. E. T. Akinsola. A framework for web based detection of journal entries frauds using data mining algorithm. International Journal of Computer Trends and Technology (IJCTT), 51(1), 2017. doi: http://dx.doi.org/10.1155/2020/9424725. Albert Ayang, Paul Ekam, Bossou Videme, and Jean Temga. Power consumption: Base stations of telecommunication in sahel zone of cameroon: Typology based on the power consumption—model and energy savings. Journal of Energy, 2016:1–15, 01 2016. doi: 10.1155/2016/3161060. Alan Beardon. From problem solving to research, 2006. Unpublished manuscript. Peter Bruce, Andrew Bruce, and Peter Gedeck. Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python. O’Reilly Media, Sebastopol, California, 2nd edition, 2020. ISBN 978-038093-3-4. Mashrur Chowdhury, Amy Apon, and Kakan Dey. Data Analytics for Intelligent Transportation Systems. Elsevier, Amsterdam, Netherlands, 2017. ISBN 978-038093-3-4. Matthew Davey. Error-correction using Low-Density Parity-Check Codes. Phd, University of Cambridge, 1999. Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. CoRR, abs/1605.09782, 2016. URL http://arxiv.org/abs/1605.09782. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014. URL https://arxiv.org/abs/1406.2661. Simon Haykin. Neural Networks and Learning Machines. Pearson Education, Upper Saddle River, New Jersey, 3rd edition, 2009. ISBN 978-0-13-147139-9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. 35 REFERENCES Page 36 Iztok Humar, Xiaohu Ge, Lin Xiang, Minho Jo, Min Chen, and Jing Zhang. Rethinking energy efficiency models of cellular networks with embodied energy. IEEE network, 25(2):40–49, 2011. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, may 2017. ISSN 0001-0782. doi: 10.1145/3065386. URL https://doi.org/10.1145/3065386. Leslie Lamport. LATEX: A Document Preparation System. Addison-Wesley, 1986. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4): 541–551, 12 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.4.541. URL https://doi.org/ 10.1162/neco.1989.1.4.541. Xiaoguang Li. Research on the development and applications of artificial neural networks. Applied Mechanics and Materials, 556 – 562:6011–6014, 2014. doi: http://dx.doi.org/10.1155/2020/ 9424725. D. J. C. MacKay and R. M. Neal. Good codes based on very sparse matrices. Available from www.inference.phy.cam.ac.uk, 1995. David MacKay. Statistical testing of high precision digitisers. Technical Report 3971, Royal Signals and Radar Establishment, Malvern, Worcester. WR14 3PS, 1986a. David MacKay. A free energy minimization framework for inference problems in modulo 2 arithmetic. In B. Preneel, editor, Fast Software Encryption (Proceedings of 1994 K.U. Leuven Workshop on Cryptographic Algorithms), number 1008 in Lecture Notes in Computer Science Series, pages 179–195. Springer, 1995b. Federico Di Mattia, Paolo Galeone, Michele De Simoni, and Emanuele Ghelfi. A survey on gans for anomaly detection. CoRR, abs/1906.11632, 2019. URL http://arxiv.org/abs/1906.11632. John McCarthy. What is artificial intelligence?, 1997. URL http://www-formal.stanford.edu/ jmc/whatisai/whatisai.html. Last accessed: 28 April 2022. Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. ISBN 978-0-07-042807-2. Erasmus Muh, Sofiane Amara, and Fouzi Tabet. Sustainable energy policies in cameroon: A holistic overview. Renewable and Sustainable Energy Reviews, 82:3420–3429, 2018. ISSN 13640321. doi: https://doi.org/10.1016/j.rser.2017.10.049. URL https://www.sciencedirect.com/ science/article/pii/S1364032117314168. Jecinta Mulongo, Marcellin Atemkeng, Theophilus Ansah-Narh, Rockefeller Rockefeller, Gabin Maxime Nguegnang, and Marco Andrea Garuti. Anomaly detection in power generation plants using machine learning and neural networks. Applied Artificial Intelligence, 34(1): 64–79, 2020. doi: 10.1080/08839514.2019.1691839. URL https://doi.org/10.1080/08839514. 2019.1691839. REFERENCES Page 37 R. M. Neal. Connectionist learning of belief networks. Artificial Intelligence, 56(1):71 – 113, 1992. doi: https://doi.org/10.1016/0004-3702(92)90065-6. Chigozie Enyinna Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation functions: Comparison of trends in practice and research for deep learning. CoRR, abs/1811.03378, 2018. URL http://arxiv.org/abs/1811.03378. Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. CoRR, abs/1511.08458, 2015. URL http://arxiv.org/abs/1511.08458. Guansong Pang, Chunhua Shen, Longbing Cao, and Anton van den Hengel. Deep learning for anomaly detection: A review. CoRR, abs/2007.02500, 2020. URL https://arxiv.org/abs/2007. 02500. S. J. Russell and P Norvig. Artificial Intelligence: A Modern Approach. Pearson Series, New York City, New York, 4th edition, 2021. ISBN 978-038093-3-4. Arthur L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210. Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. CoRR, abs/1703.05921, 2017. URL http://arxiv.org/abs/1703.05921. Claude Shannon. A mathematical theory of communication. Bell Sys. Tech. J., 27:379–423, 623–656, 1948. Claude Shannon. The best detection of pulses. In N. J. A. Sloane and A. D. Wyner, editors, Collected Papers of Claude Shannon, pages 148–150. IEEE Press, New York, 1993. Siddharth Sharma, Simone Sharma, and Anidhya Athaiya. Activation functions in neural networks. International Journal of Engineering, Applied Sciences and Technology, 4(2):310 – 316, 2020. doi: http://dx.doi.org/10.1155/2020/9424725. Karen Simonya and Andrew Zissermanh. Very deep convolutional networks for large-scale image recognition. ABS, abs/1511.08458, 2015. URL https://arxiv.org/pdf/1409.1556. Simplilearn. Discover the differences between ai vs. machine learning vs. deep learning, 2022. URL https://www.simplilearn.com/tutorials/artificial-intelligence-tutorial/ ai-vs-machine-learning-vs-deep-learning. [Online; accessed April 28, 2022]. Richard S. Sutton. John mccarthy’s definition of intelligence. Journal of Artificial General Intelligence, 11(2):66–67, 2019. doi: http://dx.doi.org/10.1155/2020/9424725. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, 2014. URL https://arxiv.org/abs/1409.4842. TeleInfra. The company, teleinfra, 2022. URL http://www.art.cm/en/node/3111. [Online; accessed May 23, 2022]. REFERENCES Page 38 Web12. Commercial mobile robot simulation software. Webots, www.cyberbotics.com, Accessed April 2013. Wik12. Black scholes. Wikipedia, the Free Encyclopedia, http://en.wikipedia.org/wiki/Black% E2%80%93Scholes, Accessed April 2012.