Uploaded by MUNIB KHALID KAYANI

Toheeb Jimoh essay2022 final

advertisement
Time Series Data Augmentation Improving Anomaly
Detection with Generative Adversarial Networks:
Application in Power Generating Plants
By
Toheeb Aduramomi JIMOH (jimoh.toheeb@aims.ac.rw)
African Institute for Mathematical Sciences (AIMS), Rwanda
Supervised by: Dr Marcellin Atemkeng
Rhodes University, South Africa
June 2022
AN ESSAY PRESENTED TO AIMS RWANDA IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF
MASTER OF SCIENCE IN MATHEMATICAL SCIENCES
DECLARATION
This work was carried out at AIMS Rwanda in partial fulfilment of the requirements for a Master
of Science Degree.
I hereby declare that except where due acknowledgement is made, this work has never been
presented wholly or in part for the award of a degree at AIMS Rwanda or any other University.
Student: Toheeb Aduramomi JIMOH
i
ACKNOWLEDGEMENTS
I acknowledge the relentless efforts of my supervisor through his painstaking supervision and
support.
ii
DEDICATION
The dedication of this research work contains three intertwined parts that however will be split
thus:
To Allah, the uncreated created of all that exists.
To my loving and conscientious father—for always believing, trusting and investing in me
continually, and
To my precious mother—may your soul rest more in peace; I always wish we both knew each
other more.
iii
Abstract
Anomaly detection is a crucial task that involves investigating data points that do not conform
to a specific pattern. It is mostly used for fraud detection and other related activities. Different
methods have been used for anomaly detection tasks, however, recent studies have shown that
the deep learning framework would be more suitable since it is capable of detecting and learning
complex patterns from a dataset. As a result, this study utilises the generative adversarial framework for anomaly detection in power generation plants. The study data were obtained from the
power consumption record of TeleInfra Telecom Company in Cameroon, which was collected as
a result of observed irregularities in the fuel consumption pattern of the generating sets at their
base stations. The Telecom company had to resort to using power generating sets for operations
due to irregular power supply in the country.
Splitting the data points into anomalous and normal points using some variables, 64.88% were
classified as normal while 35.12% were classified as normal. The feature importance analysis
using the random forest classifier revealed that the Running Time Per Day has maximum relative
importance in determining our output. Also, the generative adversarial network model was trained
before and after carrying out data augmentation with the goal of increasing the data size. The
generator model consists of 5 dense layers with the tanh activation function. The discriminator
contains 6 dense layers, with each having a dropout layer to avoid overfitting. It utilised the relu
function everywhere but with a sigmoid on its final layer. The accuracy of the model was 98.99%
after data augmentation and 66.45% before augmentation. This revealed that the model almost
perfectly classified the data points correctly into normal and anomalous and the augmented data
improved the anomaly detection performance of the GANs. Hence, it is recommended that the
GANs with a large dataset is suitable for carrying out anomaly detection tasks.
KEYWORDS: Generative modelling, generative adversarial networks, zero-sum game, anomaly
detection, power generation plants, telecommunication.
iv
Contents
Declaration
i
Acknowledgements
ii
Dedication
iii
Abstract
iv
1 Introduction
1
1.1
Background of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Statement of Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.5
Structure of the Research Work . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Literature Review
4
2.1
Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.3
Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.4
Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5
Back Propagation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.6
Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.7
CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.8
Concluding Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3 Generative Adversarial Networks for Anomaly Detection
17
3.1
Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . .
17
3.2
Building a GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.3
Mathematical Framework of the GANS . . . . . . . . . . . . . . . . . . . . . .
18
3.4
The GANs Algorithm
21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
3.5
GANs for Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.6
Pros and Cons of the GANs Framework . . . . . . . . . . . . . . . . . . . . . .
22
3.7
Concluding Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4 Methodology, Results and Discussion
24
4.1
Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.2
Data Collection, Description & Augmentation
. . . . . . . . . . . . . . . . . .
25
4.3
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
5 Conclusion and Recommendation
34
5.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.2
Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
References
38
1. Introduction
1.1
Background of the Study
About 3% of the world’s electrical energy is utilised by the Information Communication Technologies (ICT) companies (Humar et al., 2011). The telecommunication industry is one of the
dominant ICT industries that rely on a huge amount of electric power supply for their operations,
and thus it is indispensable in their daily dealings. However, its availability in underdeveloped
countries, particularly in Africa, has been a constant source of contention. Despite the industry’s
rise through the creation of base stations, they have had to turn to alternative energy sources
such as the use of gasoline or diesel with generators, and the use of solar power, to name a few.
TeleInfra telecommunication company established in Cameroon is one of such companies hooked
on these challenges due to the state of power supply in the country. The telecommunication
equipment that is fixed in different parts of the rural and urban areas in Cameroon requires
an uninterrupted supply of electricity to achieve the goal of establishing strong and seamless
communication channels in the country, however, the country’s electrical generation is mostly
based on hydropower (73%), with perpetual power interruptions, particularly during the dry
seasons when water levels are low (Muh et al., 2018). Moreover, it is also noted that the
availability of electricity is only to about 14 per cent of rural residents, and 65–88 per cent of
urban residents.
The consequence of the diversification to alternative sources of power, particularly the usage
of generators posed another challenge of irregularities or anomalies in fuel consumption at the
base stations due to the observed high consumption rate in the power generation plants. Previous research such as in Ayang et al. (2016) shows that factors like mismanagement of both
the air-conditioning and lighting systems, as well as the type of buildings, increased the power
consumption rate at each of the base stations.
Anomalies are referred to as data points that do not conform to an expected pattern in a dataset
and are often referred to as a different distribution within a distribution. They can be present in a
dataset through malicious activities like the pilferage of fuels from a power generation plant, and
frauds in the utilisation of credit cards, among others. Anomaly detection involves the task of
finding these patterns that are different from the conceived normal observation in a dataset. It is
applied in various industries such as manufacturing, medical imaging, CCTV, and social networks,
to mention a few, and is commonly used for fraud detection in either credit cards, health care or
insurance; detecting intrusion for cyber-security, and so on.
Anomaly detection is a very vital concept in data science as it forms a potent nexus between
statistics and data science (Bruce et al., 2020). The basic anomaly detection in a dataset using the
regression approach that is typically meant for data analysis and model improvement is carried out
through diagnostic plots that could help identify the extreme observations or anomalies. Such
instance of plots is the quantile plot (Q-Q plot), which helps to assess if a dataset is from a
theoretical distribution, basically the Gaussian or normal distribution, by showing at a glance
its plot, as well points that could possibly constitute the deviations. Another simple approach
1
Section 1.2. Statement of Problem
Page 2
to identifying anomalies in a dataset is by using the interquartile range, which is a concept in
statistics that measures dispersion or variability by dividing the dataset into quartiles.
However, with the continuous emergence of large volumes of data in many dominant industries like
health, finance, and so on, advanced methods of machine learning and deep learning have been
devised for detecting anomalies in these datasets, and they use the different types of approach,
that is the supervised and the unsupervised learning approaches. For instance, Mulongo et al.
(2020) utilised four supervised machine learning approaches —logistic regression, support vector
machines, k-Nearest Neighbours and the Multilayer Perceptron —for detecting anomalies in power
generation plants of a telecommunication company with the goal of comparing their performances.
Furthermore, according to Pang et al. (2020), deep learning approach such as the Generative
Adversarial Networks (GANs) has demonstrated incredible competency in learning expressive representations of complicated data such as high-dimensional data, temporal data, geographical
data, and graph data in recent years, thus pushing the boundaries of many learning tasks. As
a result, a deep learning approach is seen as a better way of learning complex patterns in huge
datasets and thus has the tendency of generating high performance in terms of its accuracy.
1.2
Statement of Problem
TeleInfra telecommunication company is faced with the challenge of unaccounted high fuel consumption for their operations at the base stations. Since they solely depend on generating plants
as their major source of power supply, they necessarily have to continually refill these generators
and these are done manually. Such activities are known to have emanated in possible cases of
pilferage of fuel due to the observed anomalies in the fuel consumption. As a result, it is essential
to investigate the likely factors contributing to the anomalies by collecting data on fuel consumption at each of the base stations for the purpose of minimizing the costs of operation. Moreover,
according to Goodfellow et al. (2014), GANs and their training framework have been effectively
employed to model both complex and high-dimensional real-world data distributions across time,
and thus, their characteristics suggest that they can be used for anomaly detection.
1.3
Motivation
The overarching stimulus of carrying out this piece of research work is to utilise artificial intelligence via a deep learning approach in investigating means of reducing the cost of operations
in one of the vital industries that is essential for the advancement of technology in the world
—the telecommunication industry. Research shows that different methods have been used in
carrying out anomaly detection tasks. As stated previously, Mulongo et al. (2020) utilised four
machine learning techniques in carrying out anomaly detection tasks and thereafter compared
their performances. However, it is known that advanced learning approach such as using GANs
is capable of identifying complex patterns in a dataset, and as such, it is desired to explore its
usage in this task and possibly generate improved accuracy.
Section 1.4. Aims and Objectives
1.4
Page 3
Aims and Objectives
The fundamental aim of this research is to build a model for detecting anomalies or irregularities
in the dataset by employing a supervised deep-learning technique through GANs. The other
objectives of the study are as follows:
• To use feature importance analysis and the random forest classifier to analyse the primary
features that determine the high fuel consumption in the base station.
• To divide the data into anomalous and normal data points and thereafter train the model
with the normal datapoints only.
• To generate the confusion matrix and the receiver operating characteristic (ROC) curve for
model validation purpose.
• To compare the results before and after data augmentation.
1.5
Structure of the Research Work
The first chapter presents the study and specifies the problem to be addressed, as well as the
study’s goals and objectives, as well as the motivation for conducting the research. The second
chapter will concentrate on key topics that are crucial to this research. Artificial intelligence,
machine learning, deep learning, artificial neural networks, activation functions, convolutional
neural networks, and deep learning architectures such as the AlexNet (Krizhevsky et al., 2017),
LeNet (LeCun et al., 1989), Inception (GoogLeNet) (Szegedy et al., 2014), and others are all
included.
In the third chapter, how to employ GANs to find anomalies and an explicit explanation of the
generative adversarial framework was presented. Moreover, the numerous methodologies used in
the analysis, as well as the results and their explanations, are presented in the fourth chapter. The
conclusion and recommendation will be found in the fifth chapter, which is the final chapter.
2. Literature Review
This section attempts to utilise the theoretical method in exploring related literature and concepts
in respect of our given study. It vastly encompasses several concepts constituting AI, machine
learning, deep learning, as well as other related constituents. Moreover, it gives an explicit
explanation as required in each section with the minimal mathematical foundation since the
required ones would be explored in the succeeding chapter.
2.1
Artificial Intelligence
One of the achievements of technological advancement in the world is Artificial Intelligence
(AI). With it, there has been more ease of carrying out enormous tasks by utilising machines
—programming them to think and function like humans—specifically by exhibiting a high level
of intelligence that was known to be attributed to humans only.
The highly sought-after term, AI was first coined at the Dartmouth Conference by John McCarthy,
a professor of Computer Science at Stanford University, and thus defined it as the science and
engineering of building intelligent systems (McCarthy, 1997). More explicitly, this implies the
theory and establishment of systems that can carry out enormous tasks that are usually attributed
to humans since they require a high level of intelligence which humans exhibit. And to be more
specific, what we term ”intelligence” was enunciated by the same pioneer of the term as the
computational part of the ability to attain goals in the universe. Furthermore, Sutton (2019)
corroborates it by explaining a goal-attaining system as one that can be usefully understood in
respect of its outcomes rather than in respect of its mechanisms.
Some of the tasks that are known to require the human level of intelligence as posited by AI
researchers, Russell and Norvig (2021), are natural language processing, machine learning, automated reasoning, brain imaging, knowledge representation, decision making, etc. Moreover,
among the myriads of concepts related to AI, machine learning and deep learning tend to have
gained the most prominence due to their continuous usage of recent. The link between these
three major notions is illustrated in Figure 2.1. The figure clearly demonstrates that machine
learning is a sub-field of AI, and that deep learning is a sub-field of machine learning. Machine
learning and deep learning are thus AI sub-fields. In Section 2.2, we will look at each concept’s
application and the literature that supports it.
4
Section 2.2. Machine Learning
Page 5
Figure 2.1: Venn Diagram of AI Components
Source: (Simplilearn, 2022)
2.2
Machine Learning
As mentioned above, machine learning is an important concept that relates to AI and is continually gaining prominence in recent days partly because a large volume of datasets is generated daily,
and this concept utilises the enormous dataset by feeding them into a “machine” and help them
learn or discover statistically significant patterns for the possibility of building predictive models
that would be used in determining future outcomes of a specific phenomenon; this also results
in the improvement from experience. An informal, old albeit prominent definition of machine
learning was given by Samuel (1959) as a field of study that gives computers the ability to learn
without being explicitly programmed. His idea was from the fact that he wrote a checkers playing
program where the program learned over time and was able to improve from the experience of
identifying the bad and better playing positions. A more modern and encompassing definition was
given by Mitchell (1997) “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.” By this, Tom Mitchell identifies the Experience E as one
derived from carrying out a task repeatedly, and the performance P is regarded as the probability
of the program improving over time bearing in mind its experience.
Generally, different machine learning algorithms are utilised for solving problems depending on
the particular type of learning problem. These algorithms are typically statistical models that
are used in learning or uncovering possible patterns that are embedded in a dataset. There exist
different categorizations of these types, however, we would utilise the three prominent types which
are supervised learning, unsupervised learning and reinforcement learning as given by one
of several books such as that of Russell and Norvig (2021). Also, Figure 2.2 encapsulates the
three types of machine learning. The types of machine learning will thereafter be explained in
subsequent subsections below.
Section 2.2. Machine Learning
Page 6
Figure 2.2: Types of Machine Learning
Source: Analytic Steps
2.2.1 Supervised Machine Learning. In supervised machine learning, there is usually a given
output, or better still, we know what the output looks like. Training a model when the outcome is
known for making future predictions subsequently on data with an unknown outcome is referred
to as supervised learning (Bruce et al., 2020). Moreover, supervised machine learning problems
are mainly categorised as regression and classification problems, and sometimes, we might have
structural prediction problem. In regression, the required output is usually continuous; that is,
we are trying to map the features or independent variables to a continuous target or response
variable. Whereas, in classification problems, the task is to map the features or inputs into a
discrete categorical variable. The usual workflow of the supervised learning approach is shown in
Figure 2.3. It indicates that the original data is divided into training and test data, that the model
is trained on the training data, and that the test data is used to determine whether the model
performs as expected before deployment. Linear regression, logistic regression, random forest,
decision trees, and support vector machines are some of the most extensively used supervised
learning techniques.
2.2.2 Unsupervised Machine Learning. Contrary to the supervised learning approach, here,
we have little to no idea of what our output would look like. Moreover, we usually do not
distinguish between training and test data. Furthermore, as noted by Chowdhury et al. (2017), the
unsupervised learning approach is dependent only on the underlying unlabeled dataset where the
task is typically to identify complex patterns based on the logic provided in the algorithm, rather
than carrying out prediction based on some known input-output pairs. Clustering and association
are two problems associated with the unsupervised learning approach. In clustering, the dataset
is grouped into different clusters based on similar features of the data points. Whereas, the
association tries to identify trends in the data. Some widely used unsupervised machine learning
algorithms are K-Means Clustering, hierarchical clustering, and so on.
Section 2.3. Artificial Neural Networks
Page 7
Figure 2.3: Workflow of Supervised Machine Learning (Awodele et al., 2017)
2.2.3 Reinforcement Learning. Reinforcement learning differs from the other two approaches
in that it entails creating a system that improves its performance as a result of its interactions
with its surroundings. A sequence of feedback loops governs the learning process in reinforcement
learning. Video games, resource management, and industrial simulation, to mention a few, are
applications of reinforcement learning.
2.3
Artificial Neural Networks
As previously said, AI means developing intelligent systems capable of performing jobs traditionally
performed by humans. To aid in the processing of complex tasks, these systems require features
that are closely related to the human brain; thus, Artificial Neural Networks (ANNs), commonly
referred to as ”neural networks,” are clearly inspired by the human brain, as the brain is typically
composed of billions of interconnected neurons that work together. More formally, Li (2014)
noted that the inspiration for ANNs comes from the central nervous system, and as such, it is
usually composed of artificial neurons or processing elements that are connected to generate an
entire system that would function just like the biological neural network. ANNs are referred to
as networks because they are made up of various functions that gather information by detecting
links and patterns in data using previous experiences, which are referred to as training examples
in most literature (Goodfellow et al., 2016).
Generally, a neural network can be thought of as a machine that was specifically established to
model how the brain carries out a particular task of interest. ANNs mainly consist of nodes
(neurons) that function together in a distributed style, learning from the input for the purpose of
optimising the resulting output (O’Shea and Nash, 2015). Previous research shows that the brain
is composed of about 1011 neurons interconnected with each other. A neural network corresponds
to the brain in the sense that it is through a learning process that the network acquires knowledge
from its surroundings; also, the strength of interneuron connections, that is the association of the
computing cells of the neural network often referred to as “neurons”, are utilized in storing the
acquired knowledge (Haykin, 2009). One main appealing feature of ANNs according to Sharma
et al. (2020), is the ability to modify their behaviour in response to changing system variables. A
diagrammatic representation of what a neural network looks like is given in Figure 2.4. It can be
Section 2.3. Artificial Neural Networks
Page 8
seen that the diagram is composed of an input layer, and the layer is composed of nodes, which
in turn have associated weights attached to them and are usually multiplied with each of the n
inputs for information processing. It is further composed of hidden layers that perform most of
the computations required by the neural network, as well as the output layer with n outputs, that
reveal the result of the predictions of the system. A few of the many applications of a neural
network include real-time translation, facial recognition, forecasting, and so on.
Figure 2.4: A Neural Network (Sharma et al., 2020)
2.3.1 Biological Neurons. A biological neuron consists of basic components such as the nucleus,
dendrites, cell body (soma), axon and the synapse—the junction that enables transmission of a
signal between the dendrites and axons. The soma or cell body is the main structural part of
the neuron that carries the nucleus; the dendrites are tree-like structural networks that are made
up of nerve fibres connected to the cell body, and are used for carrying signals or information;
also, the axon is a single but long connection that extends from the soma body, also carrying
signals from the neuron. The synapse is also an important part because the interconnection
of the neurons is possible via the synaptic connection, and this also facilitates the exchange of
action potential. It is present on the dendrites as well as on the axons. Figure 2.5 below is a
diagrammatic representation of a typical biological neuron.
Section 2.3. Artificial Neural Networks
Page 9
Figure 2.5: A Biological Neuron
Source: Lumen Learning
2.3.2 Artificial Neurons. All of the functional and structural components constituting the
biological neurons as highlighted in the section above are replicated in the artificial neuron since
the latter was inspired by the former. Figure 2.6 is a diagram of an artificial neuron. In the
diagram, the inputs to the neuron are x1 , x2 and x3 , the lines are the links that connect the
neurons, and w1 , w2 and w3 are the weights associated with each connecting link which alter
or change the input signal. Also, xnet is the net input to the whole artificial neuron expressed
as the summation of the products of the individual input with the corresponding weight as
seen in Equation 2.3.1. Furthermore, the f (xnet ) denotes the activation function and this is what
determines the output of the node. This concept would be explained more explicitly in subsequent
sections.
xnet =
n
X
xi w i
(2.3.1)
i=1
2.3.3 Relationship between the Biological Neuron and Artificial Neuron. Since the artificial neuron was formed bearing in mind the biological neuron, it is obvious that the different
components embedded in the former will be carrying out similar functions with a corresponding
component in the latter. For instance, the soma or the cell body accommodating the nucleus
in the biological neuron is the same as the net input (xnet ); the axon is the same as the output
in the biological neuron and the artificial neuron respectively; the whole cell is the same as the
neuron or node, and the dendrites or synapse of the biological neuron is the same as the weights
in the artificial neuron.
Section 2.4. Activation Functions
Page 10
Figure 2.6: An Artificial Neuron.
2.4
Activation Functions
As explained briefly in Section 2.3.2, an activation function of a neural network is what determines the output of the node. It is one of the parameters that is utilised in neural network
computation, alongside some other hyperparameters, and it typically informs us of how the input
is manipulated to generate the output. They are often referred to as transfer functions in some
literature. Each node or neuron has an activation function associated with them, and it is one of
the important functions constituting a neural network that aids in improving generalisation and
learning (Nwankpa et al., 2018). According to Sharma et al. (2020), a neural network’s prediction
accuracy is determined by the number of input layers and, more crucially, the type of activation
function. As a result, it’s clear that picking the right activation function is an excellent way to
improve deep learning algorithms. Activation functions can be linear or non-linear, depending
on the mathematical function they are linked with. Albeit there is no specific statement about
the type of activation function to be used, non-linear functions are commonly used as activation
functions since the errors associated with the feeding of input in the real world usually possess
non-linear features. Moreover, the boundary associated with a linear activation function is linear
and as such, the neural network’s adaptation will only be limited to linear changes in the input.
Generally, activation functions are utilised in transforming the linear signal that results from
learning in the neural network into a non-linear signal, which is then passed on to subsequent
layers in the system. It is worthy of noting that the initial output before the application of
activation functions is usually linear by default; hence, non-linear activation functions are usually
required for the transformation. Furthermore, the position of an activation function in a deep
learning architecture usually determines its function in the system. When it is placed after the
hidden layers, it functions by converting the learned mapping of the inputs and the corresponding
weights into non-linear form; when placed in the output layer, it performs prediction. There
are various mathematical functions used in generating the different activation functions, and the
activation functions are dependent on them. Some of the main activation functions are the
Sigmoid activation function, Tanh activation function, ReLU activation function, Leaky ReLU
activation function, and Softmax activation function, to mention a few. The model of a linear
Section 2.4. Activation Functions
Page 11
output of a learned mapping is given as the sum of the products of the individual inputs with
their corresponding weights plus the bias as follows
y = w1 x1 + w2 x2 + · · · + wn xn + b
n
X
y=
w i xi + b
(2.4.1)
i=1
From Equation (2.4.1), xi ’s are the individual inputs, wi ’s are the corresponding weights, bi is
the associated bias, and y is the linear output.
When an activation function, α is applied to this linear output, the non-linear output generated
is thus given as
y = α(w1 x1 + w2 x2 + · · · + wn xn + b)
!
n
X
y=α
w i xi + b
(2.4.2)
i=1
where σ is the activation function.
2.4.1 Sigmoid Activation Function. This is a non-linear activation function that is used
mostly in feed-forward neural networks. It is sometimes referred to as squashing function or
logistic function in some research. It is seen as the most widely-used activation function because
it is a non-linear type since errors mostly exhibit non-linear features as discussed in Sharma et al.
(2020). Its position in the deep learning framework is in the outer layer and thus, it is used
for prediction involving binary classification, and so on. It transforms the inputs, xi into having
range between 0 and 1. Moreover, according to Neal (1992), major advantages of the sigmoid
activation function are that it is mostly utilised in neural networks consisting of only one or two
hidden layers, known as shallow networks, as well as the fact that it is facile to comprehend. Also,
the main concern about this activation function is the issue of vanishing gradient in multilayer
or deep neural networks, which is a concept that arises when the derivates become smaller till it
vanishes to zero due to continuous differentiation; also, it is a computationally expensive function.
The sigmoid activation can be represented mathematically as
ϕ(x) =
1
1 + e−x
(2.4.3)
Nwankpa et al. (2018) noted that there are other variants of the sigmoid activation functions
which are the hard sigmoid function, Sigmoid-Weighted Linear Units (SiLU) and the Derivative
of Sigmoid-Weighted Linear Units (dSiLU).
2.4.2 Tanh Activation Function. The hyperbolic tangent activation function commonly referred to as the Tanh function is differentiable and continuous and has values between −1 and
1, unlike that of the sigmoid function. It can be expressed mathematically as
ϕ(x) =
ex − e−x
ex + e−x
(2.4.4)
The main benefit of the Tanh activation function is that it yields a zero-centred output, and thus
aids the back-propagation process. However, it has been seen as deficient in the part of addressing
Section 2.4. Activation Functions
Page 12
the vanishing gradient concern of the sigmoid function even though it is seen to be more preferred
to the latter because it gives better training performances for deep neural networks, that is multilayer neural networks. Furthermore, the Tanh function is used mainly in speech recognition,
natural language processing, as well as for recurrent neural networks. Graphical representations
of the sigmoid and Tanh activation function responses are given in Figure 2.7. Figure 2.7a shows
vividly that the response of the sigmoid activation function ranges between 0 and 1. Figure
2.7b also shows clearly that the hyperbolic tangent (Tanh) activation function response ranges
between −1 and +1.
(a) Sigmoid Activation Function
(b) Tanh Activation Function
Figure 2.7: Sigmoid & Tanh Activation Function Responses
A more computationally efficient and cheaper version of the Tanh activation function used in
deep learning is referred to as the Hard Tanh function and it is represented mathematically by
(2.4.5) as follows


for x > 1
1
ϕ(x) = x
for − 1 ≤ x ≤ 1


−1 for x < −1
(2.4.5)
2.4.3 Rectified Linear Unit (ReLU) Activation Function. The Rectified Linear Unit activation function commonly referred to as the ReLu function is also a non-linear activation function
and has a response range between 0 and ∞. It is seen as the most widely used activation function in deep learning applications as it is used in almost all deep learning architectures. It is also
known to be more efficient in deep learning when compared with the sigmoid and Tanh activation
functions mainly because it does not have all the neurons activated once. That is, only a certain
particular of neurons are activated at a time (Sharma et al., 2020). It addresses the vanishing
gradient problem observed in the previous two functions by forcing inputs that are less than zero
Section 2.4. Activation Functions
Page 13
to zero as can be seen vividly in its mathematical representation given by Equation (2.4.6).
(
xi if xi ≥ 0
ϕ(x) = max(0, x) =
(2.4.6)
0 if xi < 0
Furthermore, it is advantageous in that it yields faster computation since it does not evaluate mathematical expressions like exponentiation and division as seen in the previous functions.
Nwankpa et al. (2018) also noted that it is used in almost all the deep learning architectures such
as AlexNet, VGGNet, GoogleNet, ResNet, and SegNet, to mention a few, due to its reliability
and simplicity. However, an issue that decreases its ability to properly fit data is the fact that
it vanishes negative inputs to zero, and this makes it not map the negative inputs as required.
Thus, a variant of the ReLu Activation Function known as the Leaky ReLu was established to
address this particular issue of the ReLu.
2.4.4 Leaky Rectified Linear Unit (LReLu) Activation Function. The Leaky Rectified Linear
Unit Activation Function is commonly referred to as the Leaky ReLu, and it is an improvised
version of the ReLu function. It attempts to address its problem of vanishing negative inputs to
zero by rather replacing them with a relatively small linear component of the input variable x,
which is usually 0.01. It encompasses a leak that helps to adjust for negative inputs and thus
increases the range of the function’s response to having values between −∞ and +∞, and it is
computed as represented in Equation (2.4.7) below:
(
x
if x > 0
ϕ(x) =
(2.4.7)
αx if x ≤ 0
The main improvement of the LReLu over the ReLu and the Tanh activation functions is in
sparsity and dispersion. In cases where the small linear component of the input variable x that is
usually multiplied with it is not 0.01, then it is referred to as a Randomized Leaky ReLu. Other
variants of the Rectified Linear Unit function are the Parametric Rectified Linear Units (PReLU)
and S-shaped ReLU (SReLU).
2.4.5 Softmax Activation Function. The softmax activation function is obtained by compounding multiple sigmoid functions and as such, it is normally used in multivariate classification
as compared to the sigmoid function that is used in binary classification. As a result, it also has a
response between 0 and 1. It is used for computing the probabilities of each class in a multivariate
classification model and returns that of the target class as the highest probability. Furthermore,
it is found in the output layer of almost all the deep learning architectures like AlexNet, VGGNet,
GoogleNet, ResNet, SegNet, and so on, safe SeNet that uses the sigmoid function in the position.
The mathematical representation of the function is given with Equation (2.4.8) as follows:
exp(xi )
ϕ(xi ) = X
exp(xj )
j
(2.4.8)
Section 2.5. Back Propagation Method
2.5
Page 14
Back Propagation Method
This is an optimisation strategy used in computing the losses in the computation of the neural
networks with respect to the associated weights of the individual inputs. It is used when the
predicted output is significantly different from the actual value. It basically involves tracking
back to adjust the weights associated with each input in the neural network.
2.6
Convolutional Neural Network
Convolutional neural network commonly referred to as CNN or ConvNet is an ANN that is used
mostly in image analysis albeit it is also used for classification, and according to Krizhevsky et al.
(2017), it is the most commonly employed and famous algorithm in the field of deep learning.
A CNN is known to be a regularized version of a multi-layer perceptron that is composed of
convolutional layers that in turn contain filters used for especially detecting patterns and making
sense of them. The filters of a CNN can be technically thought of as a relatively small matrix
whose rows and columns are usually specified by the user. As noted by O’Shea and Nash (2015),
the main significant difference between CNN and the traditional ANNs is the fact that the former
is mostly utilised in identifying and recognising patterns within images.
Furthermore, CNN’s are composed of three types of layers that are stacked together to form
the CNN architecture. These are the convolutional layers, pooling layers and the fully-connected
layers. The three vital benefits of the CNN are parameter sharing, equivalent representation and
sparse interactions (Goodfellow et al., 2016). Also, Alzubaidi et al. (2021) corroborated that
the advantages of using CNN include the weight sharing features that help to lower the number
of trainable network parameters thus allowing the network to become more generic, avoiding
overfitting; ease of large-scale network implementation; and high organisation and reliance of the
model output as a result of the simultaneous learning of the classification layer and the feature
extraction layers. Among the wide range of fields in which CNN is being utilised extensively are
computer vision, face recognition, speech processing, etc. A typical architecture for a CNN for
image classification is shown in Figure 2.8. The CNN elements that will be discussed briefly in
subsequent subsections are padding, max pooling, average pooling, and global pooling.
2.6.1 Padding. Padding is used for addressing possible problems of reduction in the size of
the output that might arise after convolution in CNN, thereby causing loss of information. After
applying padding in a CNN, it makes the input (image) have the same size as the output (image).
2.6.2 Max Pooling. To shrink the feature map the dimensionality of representation and the
computational complexity in a CNN through the pooling layers, maximum pooling is a pooling
type that utilises the maximum value of each local cluster of neurons that is contained in the
feature map. It is a constant filtering operation that broadcasts and calculates the maximum
value of a certain region.
2.6.3 Average Pooling. While the maximum pooling uses the maximum value, the average
pooling makes use of the average value of each local cluster of the neurons constituting the
feature map.
Section 2.7. CNN Architectures
Page 15
Figure 2.8: CNN Architecture for Image Classification (Alzubaidi et al., 2021)
2.6.4 Global Pooling. This acts on the entire neurons constituting the feature map.
2.7
CNN Architectures
There have been no doubt numerous CNNs in recent years. The architectures of CNN refer to
the different set-ups made to the neural network for the purpose of having them in a way that
best suits their specific task or kind of input data such as being comprised of images. According
to Alzubaidi et al. (2021), CNN architectures have witnessed various modifications since 1989
till date, and this includes parameter optimizations, regularization, and structural reformulation,
among others. Furthermore, Szegedy et al. (2014) noted that most of the development and
progress the technology world has witnessed so far are not limited to more voluminous datasets
and more complex models, as well as powerful hardware, but also as a result of improved network
architectures, algorithms and new ideas. There are myriads of CNN architectures already, however,
a few prominent ones in LeNet-5 (LeCun et al., 1989), AlexNet (Krizhevsky et al., 2017), VGG
16 (Simonya and Zissermanh, 2015), Inception (GoogLeNet) (Szegedy et al., 2014), ResNet (He
et al., 2016), and DenseNet will be discussed in subsequent subsections.
2.7.1 LeNet-5. LeNet-5 is the earliest deep learning architecture for CNN and as such has the
least number of layers—5 layers—consisting of 3 fully-connected layers and 2 convolutional layers,
hence the “5” associated with its name (LeCun et al., 1989). It is known to be one of the simplest
architectures, and it has about 60,000 parameters.
2.7.2 AlexNet. AlexNet was first proposed by Krizhevsky et al. (2017), and thereafter improved
the learning ability of the CNN by upgrading its depth as well as implementing numerous strategies
for optimizing parameters (Alzubaidi et al., 2021). It was the earliest improvement on the LeNet
with twelve layers. It is known to have a considerable significance in the modern-day CNN
generations. The activation functions in the hidden layers and output layers are respectively the
Rectified Linear Unit (ReLu) and the Softmax function.
Section 2.8. Concluding Section
Page 16
2.7.3 Visual Geometry Group (VGG). Visual Geometry Group (VGG) architecture is an
innovative and efficient design principle that was proposed by Simonya and Zissermanh (2015),
and it is an improvement on the AlexNet in terms of increase in depth of the network, with sixteen
and nineteen layers in its two different variants. VGG-16 is a variant of the VGG that is widely
used for other CNNs, and it is known to have good classification performance. The activation
functions in its hidden layers and output layers are respectively the Rectified Linear Unit (ReLu)
and the Softmax function.
2.7.4 Inception (GoogLeNet). The GoogLeNet architecture, also known as Inception-V1, is
composed of twenty-two layers and was proposed by Szegedy et al. (2014). It is known to have
emerged as the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
competition in 2014. Its sole purpose of design is to achieve a high level of accuracy with
minimal computational cost by proposing a novel inception block concept in the context of the
CNN (Alzubaidi et al., 2021). Moreover, its motivation was to enhance the learning capacity and
thereby improve the efficiency of CNN parameters. Like the VGG and AlexNet, the activation
functions in the hidden layers and output layers are respectively the Rectified Linear Unit (ReLu)
and the Softmax function.
2.7.5 Residual Network (ResNet). The Residual Network (ResNet) architecture was developed
by He et al. (2016) and its largest architecture is known to contain one hundred and fifty-two
layers. However, the most common type of this architecture was ResNet50, and it contains 49
convolutional layers and a single fully-connected layer. ResNet emerged as the winner of both the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and Common Objects in Context
(COCO) Detection Challenge in 2016 with 152 layers of depth, representing 20 times the depth
of AlexNet and 8 times the depth of VGG. However, it has lower computational complexity when
compared with the VGG. The activation functions in its hidden layers and output layers are
respectively the Rectified Linear Unit (ReLu) and the Softmax function.
2.7.6 Dense Convolutional Network (DenseNet). Dense Convolutional Network (DenseNet)
was suggested solely for the purpose of addressing the issue of vanishing gradient but following
the same direction as ResNet. It connects the individual layers to all layers in the network through
a feed-forward medium. It has some compelling advantages, among which are the strengthening
of feature propagation, elimination of the vanishing-gradient problem, encouraging the reuse of
features, as well as reducing the number of parameters significantly.
2.8
Concluding Section
This chapter has thoroughly explored the different concepts related to our study including related
literature using the theoretical approach. It basically placed emphasis on assessing important
concepts such as artificial intelligence, and its sub-fields. It also explicitly gives an exposition
of the concept of ANNs and all its vital components such as the activation functions; also an
important ANN framework is the CNN. The next chapter is to present GANs and their framework
including the algorithm, their mathematical representations, as well as how they are used for
anomaly detection tasks.
3. Generative Adversarial Networks for
Anomaly Detection
3.1
Generative Adversarial Networks (GANs)
Anomalies can occur when incomplete or corrupted data points are allowed within a dataset. It
is assumed that non-anomalous data can be simplified further due to recognisable patterns and
correlations within the dataset – the data points can be mapped to some lower-dimensional distribution, then inversely mapped to the original distribution, with minimal loss, e.g., autoencoder.
This mapping will lead to a new distribution where typical data points are denser in certain areas
than the others.
The GANs was introduced in 2014 by Goodfellow et al. (2014), and it is a deep-learning architecture used to replicate dataset distributions. It is an instance of a modern generative modelling
algorithms. It is comprised of two parts: the Generator and the Discriminator. The Discriminator
is a deep-learning classifier that learns to classify between real and fake (generated) examples
that are taken from the domain as input. The Generator is a deep learning model that takes a
latent vector of random numbers from a normal distribution as input and transforms it. Its goal
is to continually generate examples that fools the discriminator model into thinking they are real.
Moreover, the generator and discriminator learn alongside each other simultaneously until they
attain an equilibrium where the generator produces realistic examples.
3.2
Building a GAN
Figure 3.1: Building Block of a GAN
Figure 3.1 is a structure that typically depicts the architecture of a GAN. Generally, a sample of
noise is passed as a multi-layer perceptron to the generator model, G. Afterwards, the generator
17
Section 3.3. Mathematical Framework of the GANS
Page 18
generates fake sample referred to as G(z), and this sample alongside the real data x, are passed
on to the discriminator neural network, D. The task of the discriminator is to output a single
value which is the probability of having the input as real. Thus, it outputs D(x) if the initial
input is real, otherwise it outputs D(G(z)).
Furthermore, the adversarial framework as depicted above is transformed from an unsupervised
approach to a supervised approach in the sense that the resultant of the process comes from the
discriminator D which acts as a classifier.
3.3
Mathematical Framework of the GANS
As originally proposed by Goodfellow et al. (2014), the adversarial framework is formulated mathematically via the minimax of a target function between the generator function G : Rd → Rn ,
and the discriminator function D(x) : Rn → [0, 1]. Thus, the target loss function is proposed to
be
V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z))]
(3.3.1)
From (3.3.1), Ex∼pdata [log D(x)] is referred to as the prediction of the discriminator on a real
data, while the second term, Ez∼pz [log(1 − D(G(z))] is the prediction of the discriminator on a
fake data.
Bearing in mind (3.3.1), it is required to solve the minimax problem below
min max V (D, G) = min max Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z))]
G
D
G
D
(3.3.2)
The generator’s goal is to maximize its probability of winning, that is, having the discriminator
make mistake, hence, it is required to minimize the value function given by (3.3.2). On the other
hand, the discriminator’s goal is to minimize the winning probability of the generator, hence it
has to maximize (3.3.2). In summary, the discriminator wants D(G(z)) to be as small as possible
and D(x) to be as large as possible.
3.3.1 Proposition. The optimal discriminator is given by a simple relation of the probability
distributions of the real and fake samples. When these distributions are equal, the cost function
is a constant given as − log 0.4.
Proof.
V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))]
Z
Z
= pdata (x) log[D(x)] dx + pz (z) log[1 − D(G(z))] dz
z
Zx
V (D, G) = pdata (x) log[D(x)] dx + pg (x) log[1 − D(x)] dx
(3.3.3)
x
The final part of (3.3.3) can be represented with the form given below as
V (D, G) = a log y + b log(1 − y)
(3.3.4)
Section 3.3. Mathematical Framework of the GANS
Page 19
where
a = pdata (x)
b = pg (x)
(3.3.5)
(3.3.6)
Thus, to find y that maximizes (3.3.4), we differentiate partially with respect to y and equate to
zero thus:
∂ a log(y) + b log(1 − y) = 0
∂y
a
b
+
(−1) = 0
y 1−y
a
b
=
y
1−y
a
∴y=
a+b
Then, by substituting (3.3.5) and (3.3.6) into y above, we would have
D(x)∗ =
pdata (x)
pg (x) + pdata (x)
(3.3.7)
Equation (3.3.7) is the final form for the optimal discriminator, that is, a relation of the probability
distributions of the real and fake samples. Furthermore, if the probability distribution of the real
data and the generated data are identical, that is
pdata (x) = pg (x)
(3.3.8)
pdata (x)
2 × pdata (x)
∗
∴ D(x) = 1/2
(3.3.9)
Then, (3.3.7) becomes
D(x)∗ =
Thus, the optimal discriminator inputs 1/2.
By substituting this into the cost function via (3.3.1), then we would have
V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z))]
= Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(x))]
= Ex∼pdata [log(1/2)] + Ez∼pz [log(1/2)]
= − log 2 − log 2
∗
∴ V (D , G) = − log 4
Section 3.3. Mathematical Framework of the GANS
Page 20
3.3.2 Kullback-Leibler Divergence. The Kullback-Leibler divergence, also called relative entropy or KL divergence is a statistical distance that measures how an initial probability distribution
P differs from a second reference probability distribution, Q. It is the expected value of the logarithm of the ratio of the two probability distributions represented mathematically as follows
h
P (x) i
DKL (P ||Q) = Ex∼p log
Q(x)
h
2P (x) i
= Ex∼p log
(3.3.10)
2Q(x)
i
h
P (x)
− log 2
= Ex∼p log
(Q(x)/2)
Recall the cost function can be expressed as follows
V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z))]
= Ex∼pdata (x) [log D(x)] + Ex∼pg (x) [log(1 − D(x))]
(3.3.11)
By substituting Equation (3.3.7) into Equation(3.3.11), the cost function becomes
V (D, G) = Ex∼pdata (x) [log D(x)] + Ex∼pg (x) [log(1 − D(x))]
i
i
h
h
pdata (x)
pdata (x)
= Ex∼pdata log
+ Ex∼pg log 1 −
pg (x) + pdata (x)
pg (x) + pdata (x)
i
i
h
h
pg (x)
pdata (x)
+ Ex∼pg log
= Ex∼pdata log
pg (x) + pdata (x)
pg (x) + pdata (x)
i
h
h
pg (x) + pdata (x)
pg (x) + pdata (x) i
= Ex∼pdata log pdata ÷
+ Ex∼pg log pg ÷
− 2 log 2
2
2
h
h
pg (x) + pdata (x) i
pg (x) + pdata (x) i
= Ex∼pdata log pdata ÷
+ Ex∼pg log pg ÷
− log 4
2
2
(3.3.12)
Thus, by comparing the last part of Equation (3.3.12) with that of Equation (3.3.10), the
Kullback-Leibler divergence enables us to have
pdata + pg pdata + pg + KL pg
(3.3.13)
V (D, G) = − log 4 + KL pdata
2
2
3.3.3 Jensen-Shannon Divergence. The Jensen-Shannon Divergence, also called total divergence to the average is a concept in probability theory for measuring the similarity between two
probability distributions. It is represented mathematically as
1
P + Q 1
P + Q
J SD(P ||Q) = DKL P
+ DKL Q
2
2
2
2
(3.3.14)
The Jensen-Shannon Divergence is similar to the Kullback-Leibler divergence, with only difference
being that the former is symmetric. This implies that the JSD from P to Q is the same as that
Section 3.4. The GANs Algorithm
Page 21
from Q to P —which is not true for the KL divergence. It is also used as a distance measure
between two probability distributions. Also, by using the last part of Equation (3.3.12), we would
have
h
h
pg (x) + pdata (x) i
pg (x) + pdata (x) i
+ Ex∼pg log pg ÷
− log 4
minV (D, G) = Ex∼pdata log pdata ÷
G
2
2
h
h
1
pg + pdata i
1
pg + pdata i
= 2 × Ex∼pdata log pdata ÷
+ 2 × Ex∼pg log pg ÷
− log 4
2
2
2
2
(3.3.15)
Thus, the minimum of the cost function for the generator is given as
min V (D, G) = − log 4 + 2.J SD(pdata ||pg )
G
(3.3.16)
That is, this minimum cost function is also the Jensen-Shannon Divergence between the real and
fake data distributions with a difference of log 4, where − log 4 is the maximum constant incurred
for having the maximum discriminator when the data distribution match points. Thus, for an
optimal discriminator, the generator aims to minimize the cost as given in Equation (3.3.15).
In addition, the minimum for any Jensen-Shannon Divergence is 0, and this occurs if and only
if the two probability distributions are equal, that is, if Equation (3.3.8) holds. Thus, the cost
function has only one minimum which is achieved when the generator perfectly maps the data
distribution. In this case, the only minimum of the cost function is a constant given as
min V (D, G) = − log 4
G
(3.3.17)
Thus, Equation (3.3.17) is referred to as the global minimum of the loss function.
3.4
The GANs Algorithm
The algorithm of the generative adversarial framework is presented below. It briefly summarizes
the whole steps involved in the generative adversarial framework. To learn the discriminator, it
is treated as a classification problem wherein input are real data, x, and the fake data is G(z),
and the goal is to predict the probability of the output being real.
In the algorithm, we use the positive direction of the gradient (gradient ascent) of the loss function
to update the discriminator in an attempt to maximize the cost; likewise the negative direction
of the gradient (gradient descent) to update the generator to minimize the cost. The algorithm
also corroborates the image presented in Figure 3.1.
Section 3.5. GANs for Anomaly Detection
Page 22
Begin
for each training iteration do
for k steps do
Sample m noise samples {z1 , z2 , . . . , zm } and transform with Generator;
Sample m real samples {x1 , x2 , . . . , xm } from real data;
Update the Discriminator by ascending the gradient:
m h
X
~
i
(i)
(i)
w ∇θ 1
log D x
+ log 1 − D G z
;
dm
i=1
end for
Sample m noise samples {z1 , z2 , . . . , zm } and transform with Generator;
Update the Generator by descending the gradient:
m
X
w
(i)
 ∇θg 1
;
log 1 − D G z
m
i=1
end for
End
3.5
GANs for Anomaly Detection
Schlegl et al. (2017) noted that the task of anomaly detection using GANs involves modelling
the normal behaviour using an adversarial training process and thereby detecting anomalies by
measuring the anomaly score. It is formulated such that the framework learns a generative model
referred to as the generator alongside another called the discriminator, where the generator maps
samples from an arbitrary latent distribution to data, and the discriminator tries to detect which
is the generated or real samples. A BiGAN architecture was proposed by Donahue et al. (2016)
as an extension of the initial framework by including inverse mapping. The inverse mapping works
such that it maps data back to their latent representation. Consequently, the usage of the GANs
for anomaly detection is usually comprised of a function which acts by mapping the input data to
the latent space, as well as a function known as the generator that acts entirely in the opposite
direction, known as the generator (Mattia et al., 2019).
3.6
Pros and Cons of the GANs Framework
The fundamental advantage of the generative adversarial framework is that it updates the two
models—the generator model and the discriminator model—via the back-propagation paradigm
and as such, it does not require the Markov chain. In addition, no inference is necessary during
learning, and a wide range of interactions and factors can be simply integrated into the model.
However, the least interesting part of using the generative adversarial network for anomaly detection is that the anomaly scores are not easy to interpret.
Section 3.7. Concluding Section
3.7
Page 23
Concluding Section
This chapter explicitly explored the GANs framework, the mathematics, as well as the algorithm
that depicts how the framework can be implemented with programming software. Moreover, it
briefly explained what exactly is meant by using GANs for anomaly detection. The next chapter
will be presenting the methods of data analysis, including data augmentation and results achieved
alongside their comprehensive discussions.
4. Methodology, Results and Discussion
4.1
Performance Evaluation
This helps to examine the performance of our model and the various measures are as follows
• Confusion Matrix: This is a matrix that shows the number of correct and incorrect
c True Negative (TN),
c
prediction by the model. It is usually composed of True Positive (TP),
c and the False Negative (FN).
c The structure of a typical confusion
False Positive (FP),
matrix is given in Table 4.1.
Actual (y)
Positive
Negative
Prediction fb(x)
Positive
Negative
c
c
True Positive (TP)
False Negative (FN)
c
c
False Positive (FP)
True Negative (TN)
Table 4.1: Confusion Matrix
d te (fb) : This is the estimated probability of correct predictions, given as
• Accuracy PCC
the total number of correct predictions divided by the total number number of observations
in the test set. The best accuracy is 1, and the worst is 0. It is expressed as
TP
c + TN
c
c + TN
c
TP
d
b
Accuracy PCCte (f ) =
=
c + FP
c + TN
c + FN
c
|Dte |
TP
(4.1.1)
• Sensitivity (Sn): This is the number of correct positive predictions divided by the total
number of positives. It is also called Recall or the True Positive Rate. The best sensitivity
is 1, and the worst is 0. It can be quantified as
Sn =
c
TP
(4.1.2)
c + FN
c
TP
• Specificity (Sp): This is the number of correct negative predictions divided by the total
number of negatives. It is also called the True Negative Rate, and it shows the proportion
of correctly predicted negative examples. The best specificity is 1, and the worst is 0. It is
quantified with
Sp =
c
TN
(4.1.3)
c + FP
c
TN
• Precision (P): This is the number of correct positive predictios divided by the total number
of positive predictions, and it is also called the Positive Predictive Value. It also gives
24
Section 4.2. Data Collection, Description & Augmentation
Page 25
information about how our model effectively avoids false positives. The best precision is 1,
and the worst is 0. It can be expressed as
c
TP
P=
(4.1.4)
c + FP
c
TP
• False Positive Rate (FPR): This is the number of incorrect positive predictions divided
by the total number of negatives. The best FPR rate is 0, and the worst is 1.
FPR =
c
FP
= 1 − Specificity
(4.1.5)
c + FP
c
TN
• F1 -Score: It is defined by the harmonic mean of precision and recall. It is represented as
F1 -Score =
1
Recall
Recall × P recision
2
=2×
1
Recall + P recision
+ P recision
(4.1.6)
• Area under ROC (AUC): The AUC measures the entire area under the Receiver Operating
Characteristic Curve (ROC) curve. It graphically shows the overall performance of the
classifier ranking.
4.2
Data Collection, Description & Augmentation
4.2.1 Data Collection and Description. The dataset utilised in this study was obtained from
the base stations of TeleInfra Telecommunication Company situated in Cameroon, using generator
as a major source of power, and it spans across September 2017 to August 2018. The dataset is
initially composed of 6010 rows or records, which was maintained at 5905 records after necessary
data cleaning. The dataset is initially comprised of variables like Running Time, which accounts
for the time of running of the generator plants; Power Type, denoting the source of power;
Consumption Rate; GENERATOR CAPACITY (kVA), measuring the total amount of power in
the generator in Kilo-Volt-Amperes (kVA), and so on. Furthermore, to be able to adequately
measure the irregularities or anomalies in the dataset, additional variables were generated and
this include Daily Consumption within a Period, gotten by dividing the Comsumption HIS by the
Number of Days; Running Time per Day, derived by dividing the Running Time by the Number of
Days; and also the Daily Consumed Quantity between Visits. Thus, prior to the feature selection,
the columns variables were 21, excluding the class labels. The variables and brief description is
given in Table 4.2.
Section 4.2. Data Collection, Description & Augmentation
Variable
NUMBER OF DAYS
RUNNING TIME
RUNNING TIME PER DAY
CONSUMPTION HIS
CONSUMPTION RATE
CURRENT HOUR METER GE1
MAXIMUM CONSUMPTION PER DAY
PREVIOUS HOUR METER G1
DAILY CONSUMPTION
QTE FUEL FOUND
QTE CONSUMED BTN VISIT PER DAY
Page 26
Description
The number of days before the next generator refuelling.
Total number of working hours of the generator before the next refueling
The number of working hours per day of a generator.
Total fuel consumed between a specific period before the next refueling
The number of hourly consumed litres by the generator.
Hourly meter reading of the generator
Maximum fuel the generator can consume in a day.
Previous meter reading of generator
Quantity of daily consumed fuel by the generator base on its consumption hourly consumption rate and daily working hours
Quantity of fuel inside the generator tank before refueling.
(QTE CONSUMED BTN VISIT)/(NUMBER OF DAYS)
Table 4.2: Variables and Descriptions
4.2.2 Data Augmentation. It is well-known that deep learning models like the GANs rely to a
greater extent on a huge volume of datasets, hence data augmentation was carried out on the
available 5902 observations after the cleaning. The augmentation procedure uses the standard
deviation of each feature in generating additional records by adding a uniformly distributed random
number that has a value between 0 and the standard deviation of the features to the available
values in the row of each feature. It embeds this summation as dictionaries while iterating over
the rows in the dataset using the definite for loop iterative loop. It further appends the newly
generated records to the existing records and thus augments them together to generate a large
amount of dataset. The augmented data contains a total of 187281 records with the existing
features.
This study will basically be examined with the initial dataset that is comprised of 5092 records,
as well after augmentation is performed on the existing dataset to increase its size, perhaps if
there would be observed improvement in the trained model.
Section 4.3. Results and Discussion
4.3
Page 27
Results and Discussion
4.3.1 Feature Importance. The feature importance analysis is simply used to show the order of
relevance of the various features constituting the dataset. Generally, it is necessary to determine
which feature has the maximum influence on the output, and in this case, it is to determine which
variable will be of utmost importance in classifying the datapoints as either anomalous or normal.
The random forest classifier was utilised in this task and it can be seen from Figure 4.1 that the
Running Time Per Day shows the maximum importance among all the features, followed by the
Daily Consumption within a Period, Running Time, Consumption HIS, Maximum Consumption
per Day, Daily Consumed Quantity between Visits, in that order. Furthermore, it is also seen
that the least important among all the features is the Total Quantity (QTE) Left.
Figure 4.1: Feature Importance
4.3.2 Anomaly Plots. Figure 4.2b below is a scatterplot that shows the distribution of the
running time per day of the generating plants at the base stations of TeleInfra Telecommunication
company. For a start, it can be seen vividly that the certain data points are above the threshold
of 24 hours in the plot and are thus can be referred to as anomalies. This clearly indicates that
there exists anomalies in the fuel consumption in respect of the running time per day. Also,
Figure 4.2a shows that some data points are clearly outlying in the dataset when compared to
other observations. Both plots are used to confirm that there exist anomalies or irregularities in
the dataset.
Section 4.3. Results and Discussion
Page 28
(a) Scatter Plot of Running Time
(b) Scatter Plot of Running Time
Figure 4.2: Anomaly Plots of the Running Time Per Day at the Base Stations
4.3.3 Label Classification. After the feature selection, the required target variable in the dataset
was established by classfying the data points into anomaly and normal cases using variables like
the Running Time per Day, Maximum Consumption Per Day, etc. The normal records are seen
to be 3829 (64.88%) and the anomaly is 2073 (35.12%). To further provide a clear demarcation
between the normal and the anomalous data points observed in the running time per day, a pie
chart is generated in Figure 4.3, showing the percentage of the anomaly and the normal data
points observed. It is seen that about 64.88% of the data points are suggested as normal, while
the other 35.12% are classified as anomaly. Furthermore, it indicates that there exists a class
imbalance in the data labels.
4.3.4 Correlation Analysis. Now, we try to show the extent of linear association between each
pair of features in the dataset with the correlation plot (also known as correlation matrix) as seen
in Figure 4.4. The correlation matrix has values between −1 and +1, where a correlation of −1
and +1 represents a perfect negative and a perfect positive correlation respectively. Accessing
the correlation matrix shows for instance that the Consumption HIS and the Running Time has
a high correlation with a value of 0.83, which is a strong positive correlation indicating that the
Section 4.3. Results and Discussion
Figure 4.3: Anomaly vs Normal Running Time Per Day
Figure 4.4: Correlation Matrix of the Association between Features
Page 29
Section 4.3. Results and Discussion
Page 30
higher the Running Time, the more the consumption. Thus, it can be ascertained that these are
relevant in studying the fuel consumption pattern and thus detect anomalies at the base stations.
4.3.5 Training Losses of Initial Dataset. The Training Losses graph shown in Figure 4.5 below
is used to show how both the discriminator and generator models converge. As recommended in
previous research, the Adam optimizer was used for the generator network, while the stochastic
gradient descent optimizer was used for the discriminator network. Furthermore, the generator
network contains five dense layers, all with the tanh activation function; whereas, the discriminator
network is composed of size dense layers, with each accompanied by a dropout layer that helps
to avoid overfitting. Sigmoid activation function was also used on its final layer.
Thus, it can be seen vividly from 4.5a that both the losses of the generator and discriminator
were fluctuating and the convergence pattern is not really appreciable. The final loss for the
discriminator was 0.110683, while that of the generator was 0.037645.
4.3.6 Training Losses of Augmented Dataset. Figure 4.5b also shows the training loss of
the dataset after augmentation. Apparently, it is seen that the losses for both the generator and
discriminator converge to zero towards the end of the training, implying that the GANs model
reached optimality. Thus, it can be said that the data augmentation improved the model training
process.
4.3.7 Model Evaluation (Initial Dataset). Model evaluation is a way of further determining
the performance of our model using evaluation metrics such as the accuracy score, F1 Score,
Recall and Precision. The accuracy score is about 98.99% for the augmented dataset and about
66.45% for the initial dataset as shown in Table 4.3 and 4.4 respectively.
Table 4.3: Model Evaluation of Initial Dataset
Measure
Accuracy Score
Precision
Recall
F1 Score
Value
0.6645
0.5455
0.0151
0.0294
Table 4.4: Model Evaluation of Augmented Dataset
Measure
Accuracy Score
Precision
Recall
F1 Score
Value
0.9899
0.78538
0.7966
0.6457
Section 4.3. Results and Discussion
Page 31
(a) Training Losses of Initial Dataset
(b) Training Losses of Augmented Dataset
Figure 4.5: Training Losses of Datasets
4.3.8 ROC Curves of Dataset. The ROC curve shown in Figure 4.6a depicts a straight line,
showing that the model is not fit. Whereas Figure 4.6b shows that the area covered is about
0.74.
Section 4.3. Results and Discussion
Page 32
(a) ROC Curve of the Initial Dataset
(b) ROC Curve of Augmented Dataset
Figure 4.6: ROC Curves of Dataset
4.3.9 Confusion Matrix. The confusion matrix shows the number of True Positive, False Positive, True Negative and False negative in our model. It is a key to determining the accuracy and
other metrics associated with the model. Figure 4.7a shows the confusion matrix of the initial
dataset, while Figure 4.7b shows the confusion matrix of the augmented dataset.
Section 4.3. Results and Discussion
(a) Confusion Matrix of the Initial Dataset
(b) Confusion of Augmented Dataset
Figure 4.7: Confusion Matrices of Dataset
Page 33
5. Conclusion and Recommendation
5.1
Conclusion
The irregularities in power supply in Cameroon have prompted TeleInfra, a telecommunication
company to resort to alternative sources of power, such as solar, generator, etc. In the process of
generating power for operation purpose, particularly using generators, the company observed that
there are anomalies in the fuel consumption patterns which was observed by examining variables
such as the running time per day, consumption rate per day, and so on. Thus, the dataset was in
the range of a year and was collected on the different features. Mulongo et al. (2020) used the
machine learning approaches which are the K-Nearest Neighbours, Logistic regression, Multilayer
Perceptron and the Support Vector Machines. The purpose of this study is then to use a deep
learning framework with the objective of trying to examine improvement in the accuracy of the
model. As such, the study used the Generative Adversarial Network Framework in modelling the
dataset. However, it was observed initially that the model tends not to be performing well with
conjecture about the small size of the dataset. Hence, data augmentation was performed on the
initial dataset to generate more datasets.
Following this, the generator and discriminator models were trained on the large dataset and an
accuracy of 0.9899 was achieved. The losses convergence for the two cases was also observed and
thus suggested that the augmented dataset improved the accuracy of the model more. However,
exceptional values were not achieved for other metrics, and this could be a result of incomplete
implementation of the algorithm by ensuring further hyperparameter tuning for instance, and this
was limited to the time constraint.
5.2
Recommendation
The generative Adversarial Network Framework can be adequately used for carrying out anomaly
detection tasks as can be seen that it yielded improved accuracy compared to that of Mulongo
et al. (2020) which was about 96.1%. In addition, further research on anomaly detection can be
carried out using the GANs by probably adjusting the hyperparameters and other features, and
also an interesting area that is based on the GANs ensemble for anomaly detection.
34
References
Alan Adolphson, Steven Sperber, and Marvin Tretkoff, editors. p-adic Methods in Number Theory
and Algebraic Geometry. Number 133 in Contemporary Mathematics. American Mathematical
Society, Providence, RI, 1992.
Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma,
J. Santamarı́a, Mohammed A. Fadhel, Muthana Al-Amidie, and Laith Farhan. Review of deep
learning: concepts, cnn architectures, challenges, applications, future directions. Journal of Big
Data, 8(53):84–90, March 2021. doi: 10.1186/s40537-021-00444-8. URL https://doi.org/10.
1186/s40537-021-00444-8.
O. Awodele, J. Akinjobi, and J. E. T. Akinsola. A framework for web based detection of journal
entries frauds using data mining algorithm. International Journal of Computer Trends and
Technology (IJCTT), 51(1), 2017. doi: http://dx.doi.org/10.1155/2020/9424725.
Albert Ayang, Paul Ekam, Bossou Videme, and Jean Temga. Power consumption: Base
stations of telecommunication in sahel zone of cameroon: Typology based on the power
consumption—model and energy savings. Journal of Energy, 2016:1–15, 01 2016. doi:
10.1155/2016/3161060.
Alan Beardon. From problem solving to research, 2006. Unpublished manuscript.
Peter Bruce, Andrew Bruce, and Peter Gedeck. Practical Statistics for Data Scientists 50+
Essential Concepts Using R and Python. O’Reilly Media, Sebastopol, California, 2nd edition,
2020. ISBN 978-038093-3-4.
Mashrur Chowdhury, Amy Apon, and Kakan Dey. Data Analytics for Intelligent Transportation
Systems. Elsevier, Amsterdam, Netherlands, 2017. ISBN 978-038093-3-4.
Matthew Davey. Error-correction using Low-Density Parity-Check Codes. Phd, University of
Cambridge, 1999.
Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. CoRR,
abs/1605.09782, 2016. URL http://arxiv.org/abs/1605.09782.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http:
//www.deeplearningbook.org.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014. URL
https://arxiv.org/abs/1406.2661.
Simon Haykin. Neural Networks and Learning Machines. Pearson Education, Upper Saddle River,
New Jersey, 3rd edition, 2009. ISBN 978-0-13-147139-9.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 770–778, 2016. doi: 10.1109/CVPR.2016.90.
35
REFERENCES
Page 36
Iztok Humar, Xiaohu Ge, Lin Xiang, Minho Jo, Min Chen, and Jing Zhang. Rethinking energy
efficiency models of cellular networks with embodied energy. IEEE network, 25(2):40–49, 2011.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, may 2017. ISSN 0001-0782. doi:
10.1145/3065386. URL https://doi.org/10.1145/3065386.
Leslie Lamport. LATEX: A Document Preparation System. Addison-Wesley, 1986.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel.
Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):
541–551, 12 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.4.541. URL https://doi.org/
10.1162/neco.1989.1.4.541.
Xiaoguang Li. Research on the development and applications of artificial neural networks. Applied
Mechanics and Materials, 556 – 562:6011–6014, 2014. doi: http://dx.doi.org/10.1155/2020/
9424725.
D. J. C. MacKay and R. M. Neal. Good codes based on very sparse matrices. Available from
www.inference.phy.cam.ac.uk, 1995.
David MacKay. Statistical testing of high precision digitisers. Technical Report 3971, Royal
Signals and Radar Establishment, Malvern, Worcester. WR14 3PS, 1986a.
David MacKay. A free energy minimization framework for inference problems in modulo 2 arithmetic. In B. Preneel, editor, Fast Software Encryption (Proceedings of 1994 K.U. Leuven
Workshop on Cryptographic Algorithms), number 1008 in Lecture Notes in Computer Science
Series, pages 179–195. Springer, 1995b.
Federico Di Mattia, Paolo Galeone, Michele De Simoni, and Emanuele Ghelfi. A survey on gans
for anomaly detection. CoRR, abs/1906.11632, 2019. URL http://arxiv.org/abs/1906.11632.
John McCarthy. What is artificial intelligence?, 1997. URL http://www-formal.stanford.edu/
jmc/whatisai/whatisai.html. Last accessed: 28 April 2022.
Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. ISBN 978-0-07-042807-2.
Erasmus Muh, Sofiane Amara, and Fouzi Tabet. Sustainable energy policies in cameroon: A
holistic overview. Renewable and Sustainable Energy Reviews, 82:3420–3429, 2018. ISSN 13640321. doi: https://doi.org/10.1016/j.rser.2017.10.049. URL https://www.sciencedirect.com/
science/article/pii/S1364032117314168.
Jecinta Mulongo, Marcellin Atemkeng, Theophilus Ansah-Narh, Rockefeller Rockefeller,
Gabin Maxime Nguegnang, and Marco Andrea Garuti. Anomaly detection in power generation plants using machine learning and neural networks. Applied Artificial Intelligence, 34(1):
64–79, 2020. doi: 10.1080/08839514.2019.1691839. URL https://doi.org/10.1080/08839514.
2019.1691839.
REFERENCES
Page 37
R. M. Neal. Connectionist learning of belief networks. Artificial Intelligence, 56(1):71 – 113,
1992. doi: https://doi.org/10.1016/0004-3702(92)90065-6.
Chigozie Enyinna Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation functions: Comparison of trends in practice and research for deep learning. CoRR,
abs/1811.03378, 2018. URL http://arxiv.org/abs/1811.03378.
Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. CoRR,
abs/1511.08458, 2015. URL http://arxiv.org/abs/1511.08458.
Guansong Pang, Chunhua Shen, Longbing Cao, and Anton van den Hengel. Deep learning for
anomaly detection: A review. CoRR, abs/2007.02500, 2020. URL https://arxiv.org/abs/2007.
02500.
S. J. Russell and P Norvig. Artificial Intelligence: A Modern Approach. Pearson Series, New York
City, New York, 4th edition, 2021. ISBN 978-038093-3-4.
Arthur L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal
of Research and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210.
Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein, Ursula Schmidt-Erfurth, and Georg
Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker
discovery. CoRR, abs/1703.05921, 2017. URL http://arxiv.org/abs/1703.05921.
Claude Shannon. A mathematical theory of communication. Bell Sys. Tech. J., 27:379–423,
623–656, 1948.
Claude Shannon. The best detection of pulses. In N. J. A. Sloane and A. D. Wyner, editors,
Collected Papers of Claude Shannon, pages 148–150. IEEE Press, New York, 1993.
Siddharth Sharma, Simone Sharma, and Anidhya Athaiya. Activation functions in neural networks.
International Journal of Engineering, Applied Sciences and Technology, 4(2):310 – 316, 2020.
doi: http://dx.doi.org/10.1155/2020/9424725.
Karen Simonya and Andrew Zissermanh. Very deep convolutional networks for large-scale image
recognition. ABS, abs/1511.08458, 2015. URL https://arxiv.org/pdf/1409.1556.
Simplilearn.
Discover the differences between ai vs. machine learning vs. deep
learning, 2022. URL https://www.simplilearn.com/tutorials/artificial-intelligence-tutorial/
ai-vs-machine-learning-vs-deep-learning. [Online; accessed April 28, 2022].
Richard S. Sutton. John mccarthy’s definition of intelligence. Journal of Artificial General Intelligence, 11(2):66–67, 2019. doi: http://dx.doi.org/10.1155/2020/9424725.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions,
2014. URL https://arxiv.org/abs/1409.4842.
TeleInfra. The company, teleinfra, 2022. URL http://www.art.cm/en/node/3111. [Online;
accessed May 23, 2022].
REFERENCES
Page 38
Web12. Commercial mobile robot simulation software. Webots, www.cyberbotics.com, Accessed
April 2013.
Wik12. Black scholes. Wikipedia, the Free Encyclopedia, http://en.wikipedia.org/wiki/Black%
E2%80%93Scholes, Accessed April 2012.
Download