UNIVERSITY OF CYPRUS
DEPARTMENT OF COMPUTER SCIENCE
Computational Intelligence
Author
:
Maria Papaioannou
Supervisor :
Prof. Christos N. Schizas
Date
6/7/2011
:
Table of Contents
Chapter 1 - Introduction ..................................................................................................................................... 4
Chapter 2 - Neural Networks .............................................................................................................................. 7
2.1 - Neurons .................................................................................................................................................. 7
2.1.1 - Biological Neurons.......................................................................................................................... 7
2.1.2 - Artificial Neuron ............................................................................................................................. 8
2.2 - Artificial Neural Networks and Architecture ......................................................................................... 8
2.3 - Artificial Neural Networks and Learning............................................................................................... 8
2.4 - Feed-Forward Neural Network............................................................................................................ 12
2.4.1 - The perceptron algorithm ............................................................................................................. 12
2.4.2 - Functions used by Feed-forward Network .................................................................................... 14
2.4.3 - Back propagation algorithm ......................................................................................................... 16
2.5 - Self Organizing Maps.......................................................................................................................... 22
2.5.1 - Biology .......................................................................................................................................... 22
2.5.2 - Self Organizing Maps Introduction ............................................................................................... 23
2.5.3 - Algorithm ...................................................................................................................................... 25
2.6 - Kernel Functions .................................................................................................................................. 28
2.7 - Support Vector Machines ..................................................................................................................... 33
2.7.1 - SVM for two-class classification ................................................................................................... 33
2.7.2 - Multi-class problems ..................................................................................................................... 42
Chapter 3 - Fuzzy Logic.................................................................................................................................... 45
3.1 - Basic fuzzy set operators ...................................................................................................................... 50
3.2 - Fuzzy Relations .................................................................................................................................... 52
3.3 - Imprecise Reasoning ............................................................................................................................ 58
3.4 - Generalized Modus Ponens.................................................................................................................. 59
3.5 - Fuzzy System Development .................................................................................................................. 70
Chapter 4 - Evolutionary Algorithms ............................................................................................................... 73
4.1 - Biology ................................................................................................................................................. 73
4.2 - EC Terminology and History ............................................................................................................... 76
4.2.1 - Genetic Algorithms (GA) ............................................................................................................... 77
4.2.2 - Genetic Programming (GP) .......................................................................................................... 77
2
4.2.3 - Evolutionary Strategies (ES) ......................................................................................................... 78
4.2.4 - Evolutionary Programming (EP) .................................................................................................. 79
4.3 - Genetic Algorithms............................................................................................................................... 79
4.3.1 - Search space.................................................................................................................................. 80
4.3.2 - Simple Genetic Algorithm ............................................................................................................. 80
4.3.3 - Overall .......................................................................................................................................... 92
Chapter 5 - Fuzzy Cognitive Maps ................................................................................................................... 94
5.1 - Introduction .......................................................................................................................................... 94
5.2
- FCM technical background............................................................................................................ 95
5.2.1 - FCM structure ............................................................................................................................... 95
5.2.3 - Static Analysis ............................................................................................................................. 101
5.2.4 - Dynamical Analysis of FCM (Inference) .................................................................................... 103
5.2.5 - Building a FCM........................................................................................................................... 108
5.2.6 - Learning and Optimization of FCM ............................................................................................ 110
5.2.7 - Time in FCM ............................................................................................................................... 116
5.2.8 - Modified FCM ............................................................................................................................. 119
5.3 - Issues about FCMs ............................................................................................................................. 121
Chapter 6 - Discussion ................................................................................................................................... 126
References ...................................................................................................................................................... 129
3
Chapter 1 - Introduction
During the last century world has come to an important finding: How organized and structural thinking arises
from a bunch of individual processing units, which co-function and interact in harmony under no specific
organized supervision of a super unit. In other words humanity shed a light on how the brain works. The
appearance of computers and the outstanding evolution of their computational capabilities enhanced even
more the human need in exploring how human brain works by modeling single neurons or group of
interconnected neurons and run simulations over them. That meant the first generation of the well known
Artificial Neural Networks (ANN). Through time scientists have made more observations on how human and
nature works. Behavioral concepts of individuals or organizations of individuals, psychological aspects,
biological and genetic findings have contributed into computerized methods which firstly went down the
algorithms class of “Artificial Intelligence” but later on created a novel category of algorithms, the
“Computational Intelligence”(CI).
Computer Science is a field where algorithmic processes, methods and techniques describing, transforming
and generally utilizing information are created to satisfy different human needs. Humans have to deal with a
variety of complex problems in different domains as Medicine, Engineering, Ecology, Biology and so on. In
parallel, humans have aggregated through years vast databases of a plethora kind of information. The need to
search, identify structures and distributions and overall take an advantage on all this information along with
the simultaneous development of computer machines (in terms of hardware level) led humans in inventing
different ways on how to handle information for their own benefit. Computational Intelligence lies under the
general umbrella of Computer Science and hence encloses algorithms/ methods/ techniques of utilizing and
manipulating data/information. However, as indicated by the name of this area, CI methods aim in
incorporating some features of human and nature intelligence in handling information through computations.
Although there is no strict definition of intelligence, approximately one could say that intelligence is the
ability of an individual to, efficiently and effectively, adapt to a new environment, altering from its present
state to something new. In other words, being able under certain circumstances to take the correct decisions
about the next action. In order to accomplish something like that certain intelligence features must exist like
learning through experience, using language in communications, spatial and environmental cognition,
invention, generalization, inference, applying calculations based on common logic and so on. It is clear that
humans are the only beings in this world gathering all the intelligent features.
However computers are more able to take a big amount of effort into solving a specific and explicitly defined
by humans problem, taking an advantage on a vast amount of available data. CI technologies are offered in
4
mimicking intelligence properties found in humans and nature and use these properties effectively and
efficiently in a method/algorithm body and eventually bringing better results faster than other conventional
mathematical or CS methods in the same problem domain.
The main objective of CI research is the understanding of the fashion in which natural systems function and
work. CI combines different elements of adaptation, learning, evolution, fuzzy logic, distributed and parallel
processing, and generally, CI offers models which employ intelligent (to a degree) functions. CI research is
not competitive to other classic mathematical methods which act in the same domains, rather it comprises an
alternative technique in problem solving where the employment of conventional methods is not feasible.
The three main sub-areas of CI are Artificial Neural Networks, Fuzzy Logic and Evolutionary Computation.
Some of the application areas of CI are medical diagnostics, economics, physics, time scheduling, robotics,
control systems, natural language processing, and many others. The growing needs in all these areas and their
application in humans’ everyday life intensified CI research. This need guided researchers in the direction of
investigating further the behavior of nature that lead to the definition of evolutionary systems.
The second chapter is about Artificial Neural Network models which are inspired by brain processes and
structures. A first description of how ANN were first inspired and modeled is given and thereafter a basic
description of three ANN models is given. Among the various configurations of ANN topologies and training
algorithms, Multilayer Perceptron (MLP), Kohonen and Support Vector Machines are described. ANN are
applicable in a plethora of areas like general classification, regression, pattern recognition, optimisation,
control, function approximation, time series prediction, data mining and so on.
The third chapter addresses Fuzzy Logic elements and aspects. Coming from the Fuzzy Logic inventor
himself, Dr. Zadeh, the word “fuzzy” indicates blurred, unclear or confused states or situations. He even
stated that the choice of the word “fuzzy” for describing his theory could had been wrong. However, what
essentially Fuzzy Logic tries to give is a theoretical schema on how humans think and reason. Fuzzy Logic
allows computers to handle information as humans do, described in linguistic terms, not precisely defined yet
being able to apply inference on data and make correct conclusions. Fuzzy Logic has been applied in a wide
range of fields as control, decision support, information systems and in a lot of industrial products as
cameras, washing machines, air-conditions, optimized timetables etc.
The fourth chapter is about Evolutionary Computation which essentially draws ideas from natural evolution
as first stated by Darwin. Evolutionary Computation methods are focused in optimization problems. Potential
solutions to such problems are created and evolved through discrete iterations (named generations) under
selection, crossover and mutation operators. The solutions are encoded into strings (or trees) which act
5
analogously to biological chromosomes. In these algorithms, the parameters comprising a problem are
represented either as genotypes (which are evolved based on inheritance) or phenotypes (which are evolved
based on environment interaction aspects). The EC area can be analyzed into four directions: Evolutionary
Strategies, Evolutionary Programming, Genetic Programming and finally Genetic Algorithms. All these EC
subcategories differ in the solution representation and on the genetic operators they use. Fields where EC has
been successfully applied are telecommunications, scheduling, engineering design, ANN architectures and
structural optimisation, control optimization, classification and clustering, function approximation and time
series modeling, regression and so on.
The fifth chapter describes the model of Fuzzy Cognitive Maps which comprise a soft computing
methodology of modeling systems which constitute causal relationships amongst their parameters. They
manage to represent human knowledge and experience, in a certain system’s domain, into a weighted
directed graph model and apply inference presenting potential behaviours of the model under specific
circumstances.
FCMs can be used to adequately describe and handle fuzzy and not precisely defined knowledge. Their
construction is based on knowledge extracted from experts on the modeled system. The expert must define
the number and linguistically describe the type of the concepts. Additionally they have to define the causal
relationships between the concepts and describe their type (direct or inverse) as well as their strength. The
main features of FCMs are their simplicity in use and their flexibility in designing, modeling and inferring
complex and big-sized systems. The basic structure of FCMs and their inference process is described in this
chapter. Some proposed algorithms of the FCM’s weight adaptation and optimization are also presented.
Such algorithms are needed to be used in cases where available data describing the system behavior is
feasible and experts are not in position in describing the system adequately or completely. Moreover some
tries in introducing time dependencies are given along with some hybridized FCMs (mostly with ANN)
which are mainly used in classification problems. Finally some open issues about FCMs are discussed in the
end of this chapter.
The last chapter is an open discussion on CI area and more specifically on Evolutionary Computation and
Fuzzy Cognitive Maps which gained the great of interest on behalf of the writer of this study.
6
Chapter 2 - Neural Networks
Neural Networks comprise an efficient mathematical/computational tool which is widely used for pattern
recognition. As indicated by the term «neural», ANN is constituted by a set of artificial neurons which are
interconnected resembling the architecture and the functionality of the biological neurons of the brain.
2.1 - Neurons
2.1.1 - Biological Neurons
In biological level, neurons are the structural units of the brain and of the nervous system, which are
characterized by a fundamental property: their cell membrane allows the exchange of signals to
interconnected neurons. A biological neuron is constituted by three basic elements: the body, the dendrites
and the axon. A neuron accepts the input signals from neighbouring neurons through its dendrites and sends
output signals to other neurons through its axon. Each axon ends to a synapse. A synapse is the connection
between an axon and the dendrites. Chemical processes take place in synapses which define the excitation or
inhibition of the electrical signal travelling to the neuron’s body. The arrival of an electrical pulse to a
synapse causes the release of neurotransmitting chemicals which travel across to the postsynaptic area. The
postsynaptic area is a part of the receiving neuron’s dendrite. Since a neuron has many dendrites it is clear
that it may also receive several electrical pulses (input signals) at the same time. The neuron is then able to
integrate these input signals and if the strength of the total input overcomes a threshold, transform the input
signals into an output signal to its neighbour neurons through its axon.
Synapses are believed to play a significant role in learning since the strength of a synaptic connection
between neurons is adjusted with chemical processes reflecting the favourable and unfavourable incoming
stimuli helping the organism to optimize its response to the particular task and environment.
The processing power of the brain depends a lot on the parallel interconnected function of neurons. Neurons
comprise a very important piece in the puzzle of processing complex behaviours and features of a living
organism such as learning, language, memory, attention, solving problems, consciousness, etc. The very first
neural networks were developed to mimic the biological information processing mechanism (McCulloch and
Pitts in 1943 and later others Widrow and Hoff with Adaline in 1960, Rosenblatt with Perceptron in 1962).
7
This study however is focused in presenting the ANN as a computational model for regression and
classification problems. More specifically the feed forward neural network will be introduced since it appears
to gain greatest practical value.
2.1.2 - Artificial Neuron
Following the functionality and the structure of a biological neuron, an artificial neuron:
1. Receives multiple inputs, in the form of real valued variables x n = [x1 ,… ,xD]. A neuron with D
inputs, defines a D-dimensional space. The artificial neuron will create a D-1 dimensional hyperplane
(decision boundary) which will divide the input space in two space areas.
2. Processes the inputs by integrating the input values with a set of synaptic parameters w = [w 1,…,wD]
to produce a single numerical output signal.
The output signal is then transformed using a
differentiable, nonlinear function called activation function.
3. Transmits the output signal to its interconnected artificial neurons.
2.2 - Artificial Neural Networks and Architecture
Most ANNs are divided in layers. Each layer is a set of artificial neurons called nodes. The first layer of the
network is the input layer. Each node of the input layer corresponds to a feature variable of the input vector.
The last layer of the network is the output layer. Each node on the output layer denotes the final
decision/answer of the network to the problem being modelled. The rest layers between input and output
layer are called hidden layers and their neurons hidden units. The connections between the nodes are
characterized by a real value called synaptic weight and its role is in analogy with the biological synapses. A
connection with positive weight can be seen as an excitatory synapse while a negative weight resembles an
inhibitory synapse. There is a variety of ANN categories, depending on their architectures and connection
topologies. In this study feed – forward networks will be presented, which constitute one of the simplest
forms of ANN.
2.3 - Artificial Neural Networks and Learning
Artificial Neural Networks are data processing systems. The main goal of ANN is to predict/classify correctly
given a specific input instance. Similarly to biological neurons, learning in ANN is achieved by adapting the
synaptic weights to the input data. There is a variety of learning methods for ANNs, drawn from Supervised
Learning, Unsupervised Learning and Reinforcement Learning fields. In any case, the general goal of an
ANN is to recover the mapping function between the input data and the correct/wanted/”good”
classification/clustering/prediction of the output data. Some basic description of the supervised and
unsupervised learning classes is given in the Table 2.2.
8
Stimuli
Synapse
Dendrites
Soma
Axon
Biological
Action
Electrical signals are “filtered” by
The dendrites
Ιnput signals
Transmits output
neuron
potentials
the strength of the synapse.
receive the
integration.
signals to
expressed as
electrical
neighbouring
electrical
signals
neurons.
impulses
through their
(electrical
synapse.
signals).
Biological
neuron
graphically
Artificial
A vector of
Each input variable is weighted by
The updated -
Integration of the input
A single real
neuron
real valued
the corresponding synaptic
weighted
variables. The result is
variable which is
(AN)
variables.
weight.
input signal is
transformed using an
available to pass
received in the
activation function
as input into the
AN.
h(∙).
neighboring
artificial neurons.
Artificial
Neuron
x1
graphically
w1
w 1 . x1
x2
.
w2
.
.
.
.
.
xD
w 2 . x2
Σ
S
h(s)
y
w D . xD
wD
Table 2.1
9
Dataset
Network Response
Goal
Methods
Output vector: The
Adjust parameters w so that
Least-mean-square algorithm
S
 Input vector: Vector
u
containing D input
answer of the ANN given
the difference (tn - On) ,
(for a single neuron) and its
p
variables/features so that:
the instance Xn in a vector
called training error, is
generalization widely known
containing the values for
minimized.
as back-propagation algorithm
e
r
Xn = x1, x2 , … , xD.
 Target vector: Vectors
v
containing the values of the
i
output variables that the
s
ANN should answer, given
e
an input sample.
d
each output variable.
(for multilayer interconnection
On = o1, … , oP.
of neurons).
Tn = t1, … , tP.
 For 1 ≤ n ≤ N
U
 Input vector: Vector
Assign a neuron or a set
Adjust the synaptic
Unsupervised learning using
n
containing D input
of neurons to respond to
parameters w in such a way
Hebbian learning (e.g. Self
s
variables/features so that:
specific input instances
that for each input
Organized Maps).
recovering their
class/group, a specific
p
underlying relations and
artificial neuron/neurons
e
projecting them into
response intensively while
r
clusters/groups.
the rest neurons don’t
u
Xn = x1, x2 , … , xD.
v
response.
i
s
e
d
Table 2.2: Supervised and Unsupervised learning overview.
The choice of which learning method to use depends on many parameters such as the data set structure, the
ANN architecture, the type of the problem, etc. An ANN learns by running for a number of iterations the
steps defined by the chosen learning algorithm. The process of applying learning in an ANN is called training
period/phase. The stopping criterion of the training period maybe one of the following (or a combination of
them):
1. The maximum number of iterations is reached.
2. The cost-error function is minimized/maximized adequately.
10
3. The difference of all synaptic weight values between 2 sequential runs is substantially small.
Generalization is one of the most considerable properties of a trained computational model. Generalization is
achieved if the ANN (or any other computational model) is able to categorize/predict/decide correctly given
input instances that were not presented to the network during the training phase (test dataset). A measure of
the generalization of the network is the testing error (evaluation of the difference between the output of the
network given an input instance, taken from the test dataset, and its target value). If the ANN fails to
generalize (meaning the testing error is large – considerably bigger than the training error) then phenomena
as overfitting and underfitting is possible to appear. Overfitting happens when the ANN is said to be
overtrained so that the model captures the exact relationship between the specific input-output used during
training phase. Eventually the network becomes just a “look-up table” memorizing every input instance (even
noise points) connecting it with the corresponding output. As a result, the network is incapable of recognizing
a new testing input instance, responding fairly randomly. Underfitting appears when the network is not able
to capture the underlying function mapping input – output data, either due to the small size of the training
dataset or the poor architecture of the model. That is why designing an ANN model requires special attention
in terms of making the appropriate choice of an ANN model suited to the problem to be solved-modeled.
Such considerations are:
1. If the training dataset is a sample of the actual dataset then the chosen input instances should be
representative covering all the possible classes/clusters/categories of the actual dataset. For example
if the problem is the diagnosis whether a patient brings a particular disease or not, then the sample
dataset should include instances of diseased patients and non-diseased patients.
2. The number of input instances is also important. If the number of the input data is fairly small then
underfitting is dangerous to happen since the network will not be given the right amount of
information needed to shape the input-output mapping function. Note that adequate instances should
be given for all the classes of the input data. Additionally, the size of the training dataset should be
defined in respect to the dimension of the input data space. The larger the data dimensionality, the
larger the dataset size should be allowing the network to adopt the synaptic weights (whose number
increases/decreases accordingly to the increase/decrease of the input feature variables).
3. The network architecture meaning the number of hidden layers and the number of their hidden units,
the network topology, etc.
11
4. If the stopping criterion is defined exclusively by the maximum number of epochs and the network is
left to run for a long time, there is increased risk for overfitting. On the other side, if the network is
trained only for a small number of epochs, undefitting is possible.
Figure 2.1: How ANN response to a regression problem when underfitting, good fitting and
overfitting occur.
2.4 - Feed-Forward Neural Network
Artificial neural networks has become through years a big family of different models combined with a variety
of learning methods. One of the most popular and prevalent forms of neural network is the feed-forward
network combined with backpropagation learning. A feed-forward network consists of an input layer, an
output layer and optionally one or more hidden layers. Their name, “feed-forward”, denotes the direction of
the information processing which is guided by the connectivity between the network layers. The nodes
(artificial neurons) of each layer are connected with all the nodes of the next layer whereas the neurons
belonging to the same layer are independent.
Feed-forward networks are also called “multilayer-perceptron” as they follow the basic idea of the
elementary perceptron model in a multilayered form. For that reason, the elementary perceptron algorithm
description is given paving the way for better understanding the “multilayer-perceptron” neural network.
2.4.1 - The perceptron algorithm
The perceptron algorithm (Rosenblatt, 1962) belongs to the family of linear discriminant models. The model
is divided in two phases:
1. The nonlinear transformation of input vector X to a feature vector φ(x) (e.g. by using fixed basis
functions).
2. Build the generalized linear model as:
y(xn) = f(wT ∙ φ(xn))
Eq. 2.1
12
where for N input instances 1 ≤ n ≤ N, w is the weight vector and f∙(∙) is a step size function of the form:
f(a) ={
𝑎 = +1, 𝑖𝑓 𝑎 ≥ 0
𝑎 = −1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Eq. 2.2
Perceptron is a two class model meaning that if the output y(xn) is +1, input pattern xn belongs to C1 class
whereas if y(xn) is -1 then xn belongs C2 class. This model can be rendered by a single artificial neuron, as
described previously in this study. The output of the perceptron depends on the weighted feature vector φ(xn).
Therefore perceptrons algorithm intention is to find a weight vector w* such that patterns xn which are
members of the class C1 will satisfy wTφ(xn) > 0 while patterns xn of class C2 will satisfy wTφ(xn) < 0. Setting
the target output tn є {+1, -1} for class C1 and class C2 respectively, all the correctly classified patterns should
satisfy:
wTφ(xn) tn > 0
Eq. 2.3
The error function (named perceptron criterion) is zero for correctly classified patterns (y(x n) is equal to tn)
while for the set of the misclassified patterns denoted by M, gives the quantity:
E(w) = -ΣmєM (wTφ(xm)tm)
Eq. 2.4
The goal is the minimization of the error function and to do so, the weight vector should be updated every
time the error function is not zero. The weight update function uses gradient descent algorithm so that:
w(i+1) = w(i) – η∇E(w)
= w(i) – ηφ(xm)tm
Eq 2.5
where i is the counter of the algorithm iterations and η is the learning rate of the algorithm defining the
weight vector changing rate.
The overall perceptron algorithm is divided in the following steps:
For each input instance xn at step i apply
1. y(xn) = f(w(i)T ∙ φ(xn))
2. if not(equal(y(xn), tn))
if equal(tn , + 1)
w(i+1) = w(i) + ηφ(xn)
13
else
w(i+1) = w(i) - ηφ(xn)
else move forward
3. increase step counter i and move to next input instance
2.4.2 - Functions used by Feed-forward Network
A feed-forward network is also named as multilayered perceptron since it can be analyzed to HL +1 stages of
processing, each one containing multiple neurons behaving like elementary perceptrons, where HL is the
number of the hidden layers of the network. However, attention should be given to the fact that ANN hidden
neurons use continuous sigmoidal nonlinearities whereas the perceptron uses step-function nonlinearities.
Figure 2.2: A feed-forward network with 2 hidden layers
An example of a feed-forward network is shown in Figure 2.2. The particular network is constituted of one
input layer, two hidden layers and an output layer. The features of an input pattern are assigned to the nodes
of the input layer. In Figure 2.2 the input pattern has two features x1 and x2. Hence the input layer consists of
two nodes. The data is then processed by the hidden layer nodes ending in one output node which gives the
answer to the problem modelled by the input patter xn and the synaptic weight vector w.
14
As it can be seen, there is an extra node in every layer, excluding output layer, notated by a zero marker. It is
called bias parameter and the synaptic weights beginning from the bias nodes are represented by thinner
lines, on purpose, since adding a bias node is optional for every layer. The bias parameter is generally added
to a layer to allow a fixed offset in the data. For that reason their value is always 1 so that their weight values
are just added to the rest weighted node values.
The connectivity of the network resembles the direction of the information propagation. It all begins from the
input layer and sequentially ends to the output layer. In the first place, an input pattern X n = x1,…,xD is
introduced to the network (that means the input layer consists of D nodes1). M linear combinations between
the input nodes and hidden units of the first hidden layer are created as:
pj = h(∑𝐷
𝑖=1 𝑤𝑖𝑗 𝑥𝑖 + 𝑤0𝑗 )
Eq. 2.6
where j = 1,.., M (M is the number of the hidden units of the first hidden layer excluding the bias node) and
h(∙) is a differentiable, nonlinear activation function generally used from sigmoidal functions as the logistic
sigmoid or the tanh. The output values of the first hidden layer neurons are linearly combined in the form of:
qk = h(∑𝑀
𝑗=1 𝑤𝑗𝑘 𝑝𝑗 + 𝑤0𝑘 )
Eq. 2.7
to give the output of the second hidden layer units. The same process is repeated for every extra hidden layer
units. The last step is the evaluation of the output units (belonging to the output layer) where again there are
K linear combinations of the last hidden layer variables:
or = 𝑓(∑𝐾
𝑘=1 𝑤𝑘𝑟 𝑞𝑘 + 𝑤0𝑟 )
Eq. 2.8
where r = 1,…R and R is the total number of outputs and f might be an activation function (e.g. same as h(∙))
or the identity.
Taking the network presented in Figure 2.2 the whole model functioning is described as:
𝑀
𝐷
yr(x,w) = 𝑓(∑𝐾
𝑘=1 𝑤𝑘𝑟 h(∑𝑗=1 𝑤𝑗𝑘 h(∑𝑖=1 𝑤𝑖𝑗 𝑥𝑖 + 𝑤0𝑖 )+ w0j) + w0k)
Eq. 2.9
The final output of the network, given an input pattern Xn, is the vector On = o1 ,…, oR. Each element of the
vector is the network’s answer/decision for the corresponding element of the target vector t n. Alike
perceptron model, if the target vector differs from the network output then there is error (given by an error
function). What is left is to train the network by adopting the synaptic interconnections guided by an error
1
The word “nodes” is used but for the input layer case it is more preferable to be understood as incoming signals.
15
function minimization. One of the most well-known learning algorithms, for feed-forward neural networks, is
the back propagation learning.
2.4.3 - Back propagation algorithm
Back propagation algorithm tries to compute the weight vector values so that after training phase, the
network will be able to produce the right output values given a new input pattern taken from the test dataset.
Initially the weight all the synaptic weights are randomized. Then the algorithm repeats the following steps:
Firstly, an input pattern Xn is presented to the network and the feed-forward network functions are calculated
in every layer, as described in the above section, to get the output values in the output layer.
The next step is to measure the distance (error) between the network output y(X n,w) and the target output tn.
To do so, an error function must be defined. The most common error function applied in backpropagation
algorithm for feed-forward networks, is the sum-of-squares defined by:
1
En = 2 ∑𝑅𝑟=1(𝑡𝑟𝑛 − 𝑜𝑟𝑛 )2
Eq. 2.10
where R is the number of the output variables (output units), t is the target vector, o is the actual network
output vector and n is the indexing for input patterns (instances/observations).
Since the target is to find out the weight values which minimize the error function, the change of the error
function in respect to the change of the weight vector must be calculated. However, there might be several
output nodes (each one corresponding to an element value of the target vector) so the gradient of the error
function for each output node in respect to its synaptic weights should be calculated as:
∇𝐸𝑛 =
𝑑𝐸𝑛
𝑑𝑤𝑟
=(
𝜕𝐸𝑛
) for k = 1... K, r = 1… R
𝜕𝑤𝑘𝑟
Eq. 2.11
Where:
𝐸𝑛 is the error function for the output node r taken the input pattern n.
𝑤𝑟 is the synaptic weight vector including all the weight values of the connections between the output node r
and the last hidden layer.
K is the number of units of the last hidden layer and R is the number of the output units.
Using the Equation 2.10 the gradient of each weight value is:
16
𝜕𝐸𝑛
𝜕𝑤𝑘𝑟
By using the chain rule:
=
𝜕𝐸𝑛
𝜕𝐸𝑛
𝜕𝑜𝑟
𝜕𝑜𝑟
𝜕𝑤𝑘𝑟
2
𝜕𝑤𝑘𝑟
=
𝜕𝑤𝑘𝑟
where:
and (recall Eq. 2.8)
1
𝜕 ∑𝑅𝑟=1 (𝑡𝑛𝑟 − 𝑜𝑛𝑟 )2
𝜕𝐸𝑛
𝜕𝑜𝑟
∙
𝜕𝑜𝑟
𝜕𝑤𝑘𝑟
Eq. 2.12
Eq. 2.13
= - (𝑡𝑟 - 𝑜𝑟 )
Eq.2.14
= f ’ (inpr)qk
Eq. 2.15
where f(∙)is the output activation function and inpr is the linear combination of the weight and corresponding
hidden unit value (the input of the function f in Eq.2.8).
Therefore,
𝜕𝐸𝑛
𝜕𝑤𝑘𝑟
= - (𝑡𝑟 - 𝑜𝑟 ) f ’ (inpr )qk
Eq. 2.16
The general goal is to find the combination of wk which leads to the global minimum of the error function.
Hence, gradient descent is applied to every synaptic weight as:
𝑤𝑘𝑟 = 𝑤𝑘𝑟 – η
𝜕𝐸𝑛
𝜕𝑤𝑘𝑟
= 𝑤𝑘𝑟 + η (𝑡𝑟 - 𝑜𝑟 ) f’’ (inpr )qk
Eq. 2.17
where η is the learning rate.
The process, so far, resembles the perceptron learning (perceptron criterion). However, backprop differs since
the next step is about calculating each hidden unit’s contribution to the final output error and update the
corresponding weights accordingly. The idea is that every hidden unit affects, in an indirect way, the final
output (since each hidden unit is fully connected with all the units of the next layer). Therefore, every hidden
unit’s interconnection weights should be adjusted to minimize the output error. The difficulty is that there is
no target vector for the hidden unit values, which means that the difference (error) between the calculated
value and the correct one is not plausible as it was at the previous step.
So, the next synaptic weights to be adjusted are the ones connecting the last hidden layer (with K nodes) with
the previous to that layer (with M nodes), wjk for j = 1,… ,M and k = 1, …, K. The change of the error
function to the change of these weights is given:
𝜕𝐸𝑛
𝜕𝑤𝑗𝑘
1
=
𝜕 ∑𝑅𝑟=1 (𝑡𝑛𝑟 − 𝑜𝑛𝑟 )2
2
𝜕𝑤𝑘𝑟
Eq. 2.18
17
𝜕𝑜
= - ∑𝑅𝑟=1(𝑡𝑟 − 𝑜𝑟 ) 𝜕𝑤 𝑟
Eq. 2.19
𝑗𝑘
𝜕𝑓(𝑖𝑛𝑝 )
= - ∑𝑅𝑟=1(𝑡𝑟 − 𝑜𝑟 ) 𝜕𝑤 𝑟
𝑗𝑘
Eq. 2.20
𝜕𝑖𝑛𝑝𝑟
= -∑𝑅𝑟=1(𝑡𝑟 − 𝑜𝑟 ) 𝑓 ′ (𝑖𝑛𝑝𝑟 ) 𝜕𝑤
At this point let
So that:
δr = (𝑡𝑟 − 𝑜𝑟 ) 𝑓 ′ (𝑖𝑛𝑝𝑟 )
𝜕𝐸𝑛
𝜕𝑤𝑗𝑘
= -∑𝑅𝑟=1 𝛿𝑟
= -∑𝑅𝑟=1 𝛿𝑟
Eq. 2.23
𝜕 ∑𝐾
𝑐=1 𝑤𝑐𝑟 𝑞𝑐
= -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟
Eq. 2.21
Eq. 2.22
𝜕𝑖𝑛𝑝𝑟
𝜕𝑤𝑗𝑘
= -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟
𝑗𝑘
𝜕𝑤𝑗𝑘
𝜕𝑞𝑘
Eq. 2.24
Eq. 2.25
𝜕𝑤𝑗𝑘
𝜕ℎ(𝑖𝑛𝑝𝑘 )
𝜕𝑤𝑗𝑘
Eq. 2.26
where inpk is the input of the function h as given in Eq. 2.7 and hence:
𝜕𝐸𝑛
𝜕𝑤𝑗𝑘
Let
Then
= -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 ℎ′(𝑖𝑛𝑝𝑘 )
𝜕𝑤𝑗𝑘
𝜕𝑤𝑗𝑘
Eq. 2.27
= -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 ℎ′(𝑖𝑛𝑝𝑘 ) 𝑝𝑗
Eq. 2.28
= -𝑝𝑗 ∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 ℎ′(𝑖𝑛𝑝𝑘 )
Eq. 2.29
δk = ℎ′(𝑖𝑛𝑝𝑘 ) ∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟
𝜕𝐸𝑛
𝜕 ∑𝑀
𝑐=1 𝑤𝑐𝑘 𝑝𝑐
= -𝑝𝑗 δk
Eq. 2.30
Eq. 2.31
18
Given the change of the error function in respect to the change of the weights in the last unit layer, the
corresponding weight update will be:
𝑤𝑗𝑘 = 𝑤𝑗𝑘 – η
𝜕𝐸𝑛
𝜕𝑤𝑗𝑘
= 𝑤𝑗𝑘 + η 𝑝𝑗 δk
Eq. 2.32
Following the same deriving method as above, the updates for the rest hidden layer synaptic weights are
given. Thus, a general form for updating hidden unit weights is:
For example, for the next hidden layer weights, the update would be:
𝑤𝑖𝑗 = 𝑤𝑖𝑗 – η
𝜕𝐸𝑛
𝜕𝑤𝑖𝑗
= 𝑤𝑖𝑗 + η 𝑥𝑖 δj where δj = ℎ′(𝑖𝑛𝑝𝑗 ) ∑𝐾
𝑘=1 𝛿𝑘 𝑤𝑗𝑘
Eq. 2.33
Regarding the amount of the dataset needed to evaluate the gradient of an error function, learning methods
can be split into two categories, the batch and the incremental. The batch set requires the whole training
dataset to be processed before evaluating the gradient and update the weights. On the other side, incremental
methods update the weights using the gradient of the error function after an input pattern is processed.
Regarding the amount of the dataset needed to evaluate the gradient of an error function, learning methods
can be split into two categories, the batch and the incremental. The batch set requires the whole training
dataset to be processed before evaluating the gradient and update the weights. On the other side, incremental
methods update the weights using the gradient of the error function after an input pattern is processed.
Consequently, there can be applied batch backpropagation or incremental backpropagation as follows:
Batch Backpropagtion
 Initialize the weights W randomly

Repeat:
o
For every input pattern Xn:
a.
Present Xn to the network and apply the feed-forward propagation functions to all
layers.
𝜕𝐸𝑛
b.
Find the gradient of the error function to every synaptic weight (
c.
If Xn was the last input pattern from the training dataset move forward otherwise go
𝜕𝑤𝑖𝑗
).
back to step a.
o
For every synaptic weight:
a. Sum up all the gradients of the error function in respect to the synaptic weight:
𝑁
𝜕𝐸𝑛
𝜕𝐸𝑛
=∑
𝜕𝑤𝑖𝑗
𝜕𝑤𝑖𝑗
𝑛=1
o
Update the weight using
19
𝑤𝑖𝑗 = 𝑤𝑖𝑗 – η
𝜕𝐸𝑛
𝜕𝑤𝑖𝑗

Until the stopping criterion is true
Incremental Backpropagation
 Initialize the weights W randomly
 Repeat:
o
For every input pattern Xn :
a. Present an input pattern Xn to the network and apply the feed-forward propagation
functions to all layers.
b. Find the gradient of the error function to every synaptic weight (
𝜕𝐸𝑛
𝜕𝑤𝑖𝑗
).
c. Update the weights using
𝑤𝑖𝑗 = 𝑤𝑖𝑗 – η
𝜕𝐸𝑛
𝜕𝑤𝑖𝑗

Until the stopping criterion is true
As stated previously, a trained network should be able to generalize over new input patterns that were not
presented during training period. The most common generalization assessment method defines splitting the
total available dataset into training and testing dataset. Both of them have not common patterns but each of
them has adequate number of patterns from all classes. Reasonably, the training dataset will be larger than the
testing. Training dataset is used to train the network and update the weights. After all training input patterns
are presented to the network, and its weights are adjusted using backpropagation (either batch mode or
incremental), the network applies feed-forward propagation for the testing input instances. For each testing
input pattern, the error between the network output and the testing target output is calculated. Finally, the
average testing error can be interpreted as a simple measure of the network’s generalization. The smaller the
testing error is the better generalization of the network. If the testing error meets the standards defined by the
ANN user, the network training stops and the final weights are used to give any further answers to the
modelled problem. On the contrary, if the testing error is larger than a predefined constant, then the training
of the network must continue.
A method called k-fold cross validation is an alternative generalization evaluation method and it appeals
especially to cases with scarce dataset. In such cases, whereas the total dataset has few input instances,
20
sacrificing a number of them to create a testing dataset might lead to deficient learning since there is a high
risk that the synaptic interconnections of the network will not get adequate information about the input space
to adjust to. Applying k-fold cross validation requires splitting randomly the initial dataset into k even
disjoint datasets. The training period is also divided into k phases. During each training phase, the network is
trained using k - 1 datasets, leaving one out called validation dataset. After each training phase, the validation
dataset is presented to the network applying feed-forward propagation and calculating the error between the
network output and the validation target vector. The process is repeated k times, each time leaving a different
sample of the total dataset out as validation set. At the end, the average of the k validation scores is an
indicator for the network’s generalization ability. The goal of this method is to exploit the information given
by the whole dataset to train the network and at the same time having a generalization estimator for the
network learning progress.
Initial total dataset:
Initial dataset divided in 10
folds:
Learning phase #1:
1st validation set gives E1
training dataset
Learning phase #2 :
.
.
2nd validation set gives E2
.
Learning phase #10:
10th validation set gives E10
21
Figure 2.3: 10-fold cross validation where orange parts represent the validation set at each learning phase and
the purple ones the training set. En where n=1,..,10 is the corresponding validation error. By the completion
of all 10 learning phases the total validation (or testing) error will be the average
∑10
𝑐=1 𝐸𝑐
10
or any other
combination of the validation errors (e.g. just the sum up of them).
2.5 - Self Organizing Maps
There are some cases of systems for which the existing available data is not classified/clustered/categorized
in any way thus the correct output/response of the model to the input of a data pattern, is not provided.
However many times, certain underlying relations can be extracted in a sense of clustering data. The
identification of such relations comprises the task of unsupervised learning methods. As mentioned before in
this chapter, neural networks can be trained using unsupervised learning. Self Organized Maps (SOM) is an
unsupervised learning neural network paradigm.
2.5.1 - Biology
SOM model is neurobiologically inspired by the way cerebral cortex of the brain handles signals from
various sources of inputs (motor, somatosensory, visual, auditory, etc). The cerebral cortex is organised in
regions. Each physical region is assigned to a different sensory input processing.
Figure 2.4 Mapping of the visual field (right side) on the cortex (left side)
22
Visual cortex comprises the part of the cerebral cortex which processes visual information. More specifically,
primary visual cortex, symbolized as V1, is mainly responsible for processing the spatial information in
vision. The left part of Figure 2.4 shows the left V1. The right side represents the visual field of the right eye
where the inner circle of the visual field represents all the contents which are captured by the fovea. Figure
2.4 shows the correlation between the subsets of visual field and the corresponding visual processing regions
on V1. Observing Figure 2.4 makes it obvious that the neighbouring regions of the visual field are mapped to
neighbouring regions in the cortex. Additionally an interesting notice is that the visual cortex surface which is
responsible for processing the information coming from fovea (labelled as “1”) is disproportionally large to
the inner circle surface of the visual field (fovea). This phenomenon leads to the conclusion that the signals
arriving from the fovea visual field need more processing power than the other signals from the surrounding
regions of the visual field. That is because signals coming from fovea have high resolution and hence they
must be processed in more detail. Consequently it seems that the brain cortex is topologically organized in
such a way that it resembles the topologically ordered representation of the visual field.
2.5.2 - Self Organizing Maps Introduction
A Self Organizing Map, as introduced by Kohonen in 1982, is a neural network constituted by an output layer
fully connected with synaptic weights to an input instance feature variables (Figure 2.5). The name of this
model concisely explains the data processing mechanism. The neurons of the output layer are most of the
times organized in a two dimensional lattice. During training phase, output neurons adapt their synaptic
weights based on a presented input pattern features, without being supervised, by applying competitive
learning. Eventually, when an adequate number of input patterns are shown to the model, the synaptic
weights will be organized in such a way that each output neuron will respond to a cluster/class of the input
space and the final topological ordering of the synaptic weights will resemble the input distribution.
Analogising the Kohonen’s network to the biological model, the input layer can be thought as input signals
and the output layer as the cerebral cortex. Hence, this kind of network must be trained in such a way that
eventually each neuron, or group of neighbouring neurons, will respond to different input signals.
Kohonen’s model learning can be divided into two processes: competition and cooperation. Competition
takes places right after an input pattern is introduced to the network. A discriminant function is calculated by
every artificial neuron and the resulting value provides the basis of the competition; the neuron with the
highest discriminant function score is considered the winner.
Consequently, the synaptic weights between the presented input vector (stimuli) and the winning neuron will
be strengthened within the scope of minimizing the Euclidean distance between them. Hence, the goal is to
23
move the winning neuron physically to the corresponding stimuli input. The increase of the winning neuron’s
synaptic weights can be also thought as inhibition.
Figure 2.5: A SOM example. The dataset includes N input instances X = X1, ..., XN. Each synaptic weight Wj
is a vector of D elements (corresponding to the D features of the Xn input instance vector) connecting each
input feature with the jth neuron. The output layer is a 4 x 4 lattice of neurons Oj. Each neuron is indexed
according to its position on the lattice (row,column).
Cooperation phase follows, where the indexing position of the winning neuron on the lattice gives the
topological information about the neighbourhood of its surrounding neurons to be excited. So, eventually the
neuron which was declared to be the winner through competition, cooperates with some of its neighbouring
neurons by acting as an excitation stimuli for them. Essentially that means that the winning neuron allows to
its neighbours to increase their discriminant function in response to the specific input pattern, to a certain
degree which is assuredly lower than its own synaptic weights strengthing.
Ultimately, the trained network is able to transform a multidimensional input signal into a discrete map of
one or two dimensions (output layer). The transformation is driven by the synaptic weight vector of each
discrete output neuron which basically resembles the topological order of some input vectors (forming a
cluster) in the input space.
24
2.5.3 - Algorithm
The whole training process is based on adapting, primarily, the synaptic weight vector of the winning neuron,
and thereafter the neighbouring ones, to the features of the introduced input vector. Towards that direction, a
sufficient number of input patterns should be presented to the network, iteratively, yet one at a time, initiating
this way the competition among the output neurons followed by the cooperation phase.
For the examples given in the rest description of the algorithm refer to the network given in Figure 2.4. At the
first place, a D- dimensional input vector Xn is presented at each time to the network.
Xn = x1, ... , xD where n ∈ [1,...,N] and N,D ∈ ℕ1
The output layer is discrete since each neuron of the output layer is characterized by a discrete value j which
can be thought as its unique identity or name. Output neurons are indexed meaning that their position is
defined by the coordinates of the lattice. So for a 2-dimensional lattice, the index of each neuron is given by a
two dimensional vector pj including the row and the column number of the neuron’s position on the lattice.
For example, neuron o9 index is p9 = (0,2) on the map. This can be thought as the address of the neuron.
Each output neuron, symbolized oj, will be assigned to a D dimensional synaptic weight vector Wj so that:
Wj = w1j, ... , xDj where j ∈ [1,...,L] and L ∈ ℕ1
L is the number of the output neurons so that L = K x Q for K and Q ∈ ℕ1, defining the number of rows
and the number of columns of the map respectively.
Output neurons compete each other by comparing their synaptic weight vectors W j with the input pattern Xn
and try to find the best matching between these two vectors. Hence, the dot product of W jXn for j= 1, ..., L is
calculated for each neuron and the highest value is chosen. The long-term objective of the algorithm is to
maximize the quantity WjXn for n = 1 ... N (the whole dataset) which could be interpreted as minimizing the
distance given by the Euclidean norm ||Xn - Wj ||. Therefore,
i(Xn) = arg 𝑚𝑖𝑛𝑗 ||Xn - Wj || = min{ ||Xn – W1 ||, ... , ||Xn – WL ||}
Eq. 2.34
where i(Xn) is the index of the winning output neuron i ∈ [1,L], whose synaptic weight vector Wi is the
closest to the input vector Xn, determined of all neurons j = 1, ... , L. Note that the input pattern is a D
dimensional vector of continuous elements which is mapped into a two dimensional discrete space of L
outputs using a competition process. So for the example of Figure 2.4, the winning neuron of the competition
initiated by the input vector Xn, is the output neuron j = 15. So, the i(Xn) is (2,3).
By the ending of the competition process, the winning neuron “excites” the neurons which lie in its
neighbourhood. Two concepts must be clarified at this point. How a neuron’s neighbourhood is defined and
how a neuron is excited by another neuron. A neighbourhood of neurons refers basically, to the topological
arrangement of neurons surrounding the winning neuron. Obviously, the neurons positioned closed to the
winning neuron will be more excited than others positioned in a bigger lateral distance. Thus, the excitation
effect is graduated in respect with the lateral distance dij between the winning neuron i and any other neuron j
25
for j = 1, ... ,L. Having that constraint, the neighbourhood is given by a function (called neighbourhood
function) hij which should satisfy two basic properties:
(1) It has to be symmetric about dij = 0 (where of course dij = dii)
(2) It should be monotonically decaying function in respect to the distance dij.
A Gaussian function is exhibits both properties so the Gaussian neighbourhood function is given as:
2
𝑑𝑖𝑗
hij = 𝑒𝑥𝑝 (− 2𝜎2 )
Eq. 2.35
where:
𝑑𝑖𝑗 = ||𝑝𝑖 − 𝑝𝑗 ||
Eq. 2.36
Note that there besides Gaussian, there are other functions which can be chosen to implement the
neighbouring function. However, the size of the neighbourhood is not fixed since it decreases with time
(discrete time, iterations). When the training begins the neighbourhood range is quite large so the function h ij
returns large values for a big set of neighbouring neurons. Through iterations, the less number of neurons will
get large values by hij meaning that fewer neurons will be excited by the winning neuron. Eventually maybe
the whole neighbourhood could be only the winning neuron. Since the neighbourhood function is given in
Equation 2.35, the neighbourhood decrease can be embedded by decreasing σ over time using the function:
𝑐
𝜎(𝑐) = 𝜎0 exp (− 𝑇1) for c = 0,1, 2, 3, …
Eq. 2.37
where c is the iterations counter, σ0 is the initial value of σ and T1 is a time constant. Therefore the Eq. 2.35
can be reformed as:
hij(c) = 𝑒𝑥𝑝 (−
2
𝑑𝑖𝑗
2𝜎0 exp(−
𝑐 2
)
𝑇1
)
Eq. 2.38
So when two similar in a way, but not totally identical, input vectors are presented to the network, there will
be two different winning neurons. However, beside the winning neuron’s synaptic weights, the neighbours
will corporately update their own weights to a smaller degree. Hence clusters of input vectors that might be
close to each other, but still different, will be projected into topologically close neurons on the lattice. That
why cooperation is a necessary process for the correct training of this model. A form of how this process is
biologically motivated is shown in Figure 2.5.
The final step is the weight adaptation which employs Hebbian learning. Hebbian learning defines that when
simultaneous activation in presynaptic and postsynaptic areas then the synaptic connection between them is
strengthened, otherwise it is weakened. In the case of Kohonen’s model, the presynaptic area is the input
signal Xn and the postsynaptic area is the output activity of the neuron noted as yj. Output activity is
maximized when the neuron wins the competition, is zero when it does not belong to the winner’s
26
neighbourhood and somewhere in between otherwise. So yj could be expressed as the neighbourhood
function hij. Yet:
ΔWj ∝ yjXn
∝ hijXn
Eq. 2.39
However by using Hebbian mechanism of learning, there is a risk that the synaptic weights might reach
saturation. This happens because the weight adaptation occurs only in one direction (towards the input pattern
direction) causing all weights to saturate. To overcome this difficulty, a forgetting term – g(yj)Wj is
introduced having:
ΔWj = ηyjXn – g(yj)Wj
Eq. 2.40
where η is the learning rate parameter. The term ηyjXn drives the weight update towards the input signal
direction making the particular jth neuron more sensitive when this input pattern is presented to the network.
On the other side, the forgetting term – g(yj)Wj binds the new synaptic weight to within reasonable levels.The
function g(yj) must return a positive scalar value so for simplification purposes it could be assigned to yj, so
that g(yj) = ηyj. Hence the weight update could be expressed as:
ΔWj = ηhijXn – ηhijWj
= ηhij(Xn – Wj)
Eq. 2.41
In discrete time simulation, the updated weight vector is given:
Wj(c+1) = Wj(c) + η(c)hij(c)(Xn(c) – Wj(c)) for c = 0,1, 2, 3, …
Eq. 2.42
where
𝑐
𝜂(𝑐) = 𝜂0 exp (− 𝑇2)
Eq. 2.43
where η0 is the initial value of η and T2 is a time constant.
Essentially, the term (Xn(c) – Wj(c)) drives the alignment of Wj(c+1) to Xn vector at each iteration c. The
winning weight vector will make a full step (hij(c) =1) towards the input signal while the neighbouring
excited synaptic weights will move to a smaller degree defined by the neighbouring function (hij(c) < 1).
Additionally, the update of all weights will be restricted by the learning rate η(c). The learning rate changes
in respect to time. In the first iterations, it will be quite large allowing fast learning which means the weight
vectors will move more aggressively towards the input pattern and hence form the rough boundaries for the
potential input clusters. As iteration counter increases the learning rate will decrease forcing the weight
vectors to move more carefully to the space in order to retrieve all the necessary information given from the
27
input patterns and gently converge to the correct topology. In the end, the topological arrangement of the
synaptic weights in the weight space will resemble the input statistical distribution.
The algorithm in steps:
1. Initialize randomly the weight vectors for all output neurons j = 1,…,L where L is the number of the
output neurons lying on the lattice.
2. Present the input vector Xn = (x1,…,xD) to the network where n = 1, … ,N and N is the number of all
input patterns included in the dataset.
3. Calculate WjXn for all output neurons and find the winner using Eq. 2.34.
4. Update all synaptic weights using Equation 2.42
5. Move to step 2 until there are no significant changes to the weight vectors or the maximum limitation
for iterations number is reached.
The learning mechanism could be divided in two phases:
(A) The self organizing topological ordering phase
(B) Convergence phase
During (A) phase, the spatial locations of the output neurons are roughly defined (in the weight space). Each
neuron corresponds to a specific domain of input patterns. The winning neuron’s neighbourhood is large
during this phase allowing to neurons which respond to similar input signals to be brought together. As a
result input clusters which have common or similar features will be projected to neurons which lie next to
each other on the weight map. However the neighbourhood shrinks down in respect to time and eventually it
could be the case that only the winner itself comprises its neighbourhood.
The convergence phase begins
when basically the neighbourhood is very small. Having a small neighbourhood allows the winning neurons
to fine tune their positions which were obtained in topological ordering phase. During this phase the learning
rate is also fairly small so the weights move very slowly towards the input vectors.
2.6 - Kernel Functions
A challenge for computational intelligence area is to provide good estimators for non-linearly separable data.
As indicated by their name, linear functions (e.g. perceptron algorithm) are incapable of processing
successfully non-linear data and so they are eliminated from the pool of possible solutions to non-linear
problems. However non-linear data become linearly separable if they are transformed to a higher dimensional
space (called feature space) and thus a linear model can be used to solve the problem there. So the previous
28
statement should be rephrased as: Linear functions are incapable of processing successfully non-linear data in
the input space (space where input training examples are located).
Hence, given a set of data [X,T] :
X = x1, x2,…, xN where x ∈ ℝD
T = t1, t2,…, tN where t ∈ ℝ
so that:
𝑥11 , … , 𝑥𝑑1 , … , 𝑥𝐷1
𝑥1
𝑡1
X =[ … ]=[
] , T = [… ]
…
𝑁
𝑁
𝑁
𝑁
𝑥
,
…
,
𝑥
,
…
,
𝑥
𝑥
𝑡𝑁
1
𝐷
𝑑
Eq. 2.44
Φ: x → Φ(x) where Φ(x) := 𝛷1 (𝑥), … , 𝛷𝑘 (𝑥), … , 𝛷𝐾 (𝑥)
Eq. 2.45
and a dataset mapping via:
the resulting feature data set is:
Φ(𝑥 1 )
𝛷1 (𝑥 1 ), … , 𝛷𝑘 (𝑥 1 ), … , 𝛷𝐾 (𝑥 1 )
𝑡1
Φ =[ … ]=[
…
], T = [ … ]
Φ(𝑥 𝑁 ) 𝛷1 (𝑥 𝑁 ), … , 𝛷𝑘 (𝑥 𝑁 ), … , 𝛷𝐾 (𝑥 𝑁 )
𝑡𝑁
Eq. 2.46
As shown above, the raw dataset is represented by an N x D matrix while the mapped dataset is an N x K
matrix where K ≥ D. The transformed dataset is now separable in ℝK feature space by a linear hyperplane.
The key of the solution to the inseparability problem lies in the appropriate choice of feature space of
sufficient dimensionality. This is a task that must be accomplished with great care since wrong feature
mapping will give misleading results. In addition the whole data transformation process can be very
computationally expensive since data mapping depends highly on the feature space dimensionality in terms
of time and memory. That is why constructing features from a given dataset comprises rather a difficult task
which many times demands expert knowledge in the domain of the problem.
For example consider constructing quadratic features for x ∈ ℝ2:
Φ(𝑥) ∶= (𝑥1 2,√2𝑥1 𝑥2 , 𝑥2 2)
Eq. 2.47
In the case that data point x is a two dimensional vector, the new feature vector is three dimensional, so it
doesn’t seem to be very demanding regarding to time or memory. However imagine creating quadratic
29
features for a hundred-dimensional dataset. This would lead each new feature data point to be of 5005
dimensions. Hence for every datapoint of the raw dataset, 5005 calculations should be made and saved in the
transformed data matrix.
However, most linear learning algorithms depend only on the dot product between two input training vectors.
Therefore, applying such algorithms in the feature space, the dot product between the transformed data points
(features) must be calculated and substitute every dot product between the raw input vectors in the algorithms
formulation.
⟨𝑥, 𝑥 ′ ⟩ → ⟨Φ(𝑥), Φ(𝑥 ′ )⟩
Eq. 2.48
Back to the example presented in Eq. 2.47, the dot product between the quadratic features will result the
quantity:
2
2
⟨Φ(𝑥), Φ(𝑥 ′ )⟩ = 𝑥1 2 𝑥1 ′ + 2𝑥1 𝑥2 𝑥1 ′ 𝑥2 ′ + 𝑥2 2 𝑥2 ′ = ⟨𝑥, 𝑥 ′ ⟩2
Eq. 2.49
As it is observed in Eq. 2.49, the dot product of two quadratic features in ℝ2 is equal to the square of the dot
product between the corresponding raw data patterns. So an alternative solution to feature construction and
computation is to implicitly compute the dot product between two features without defining in any way the
exact feature mapping routine. A function can be defined instead, which will accept two input vectors and
will result the dot product of their corresponding features in the feature space. Such functions are called
kernels and were first introduced by (Aizerman et al, 1964).
Generally, considering a function Φ(x), which maps the input data into a feature space, a kernel is a
symmetric function which is defined as:
𝐾: 𝑋 × 𝑋 → ℝ
Eq. 2.50
and it satisfies the following property:
K(𝑥, 𝑥 ′ ) = ⟨Φ(𝑥), Φ(𝑥 ′ )⟩
Eq. 2.51
Furthermore, the kernel function can be represented by a kernel matrix (also referred as Gram matrix), for a
given dataset of size N, which is defined in terms of a dot product in the feature space as:
K(𝑖, 𝑗) = ⟨Φ(𝑥𝑖 ), Φ(𝑥𝑗 )⟩ where 𝑖, 𝑗 = 1 … N
Eq. 2.52
In simple words, a kernel function applies direct transformation of the dot product relation to a higher
dimensionality feature space.
30
Obviously not every function comprises a good kernel function. A good kernel function should satisfy some
properties where among them these two are the most important:
1) The function must be symmetric so that K(𝑥, 𝑥 ′ ) = K(𝑥 ′ , 𝑥) since the dot product satisfies
symmetry: ⟨Φ(𝑥), Φ(𝑥 ′ )⟩ = ⟨Φ(𝑥 ′ ), Φ(𝑥)⟩.
2) The corresponding kernel matrix must be positive semi-definite matrix for all choices of the dataset
X = X1, X2,…, XN (Shawe-Taylor and Cristianini, 2004), which means that:
So:
aT K a ≥ 0 for all a ∈ ℝN and a kernel matrix K ∈ ℝN×N
Eq. 2.53
𝑁
∑𝑁
𝑖 ∑𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 )𝑎𝑖 𝑎𝑗 ≥ 0 where 𝑎𝑖 𝑎𝑗 are matrix eigenvalues
Eq. 2.54
This can be proven as:
𝑁
𝑁
𝛮
∑𝑁
𝑖 ∑𝑗 𝑎𝑖 𝑎𝑗 𝛷(𝑥𝑖 )𝛷(𝑥𝑗 ) = ⟨∑𝜄=1 𝑎𝑖 𝛷(𝑥𝑖 ), ∑𝑗=1 𝑎𝑗 𝛷(𝑥𝑗 )⟩
2
= || ∑𝑁
𝑗=1 𝑎𝑗 𝛷(𝑥𝑗 ) || ≥ 0
Eq. 2.55
The distance between two features lying in the feature space can be also expressed in terms of the
corresponding kernel matrix as shown in Eq. 2.56.
2
𝑑(𝑥, 𝑥 ′ ) = ||𝛷(𝑥) − 𝛷(𝑥 ′ )||2 = (𝛷(𝑥) − 𝛷(𝑥 ′ ))2
= ⟨𝛷(𝑥), 𝛷(𝑥)⟩ − 2⟨𝛷(𝑥), 𝛷(𝑥 ′ )⟩ + ⟨𝛷(𝑥 ′ ), 𝛷(𝑥 ′ )⟩
= 𝐾(𝑥, 𝑥) − 2𝐾(𝑥, 𝑥 ′ ) + 𝐾(𝑥 ′ , 𝑥 ′ )
Eq. 2.56
Additionally, many linear models expressed in dual form, can be reformulated in terms of a kernel function
and though eliminate the presence of feature mapping function in the primal formulation of the model. The
perceptron algorithm is revisited and used to show how dual representations may be expressed using kernels.
Perceptron algorithm optimizes the perceptron criterion function which is given in Eq. 2.4. Let initial weight
vector 𝑤 0 to be equal to a zero vector. By the minimization of the perceptron criterion the optimized weight
vector will be equal to:
∗ 𝑖 𝑖
𝑤 ∗ = ∑𝑀
𝑖=1 𝑎𝑖 𝑡 𝑥
Eq. 2.57
31
while if the input data are mapped to a feature space:
∗ 𝑖
𝑖
𝑤 ∗ = ∑𝑀
𝑖=1 𝑎𝑖 𝑡 𝛷(𝑥 )
Eq. 2.58
where in both Eq. 2.57 and Eq. 2.58:
-
𝑎𝑖 > 0 only for the misclassified patterns and zero for the rest. Each 𝑎𝑖 represents the contribution
of 𝑥 𝑖 to the weights update. Many times it is called the embedding strength of 𝑥 𝑖 and it is
proportional to the number of times the particular pattern was misclassified (causing the weights to
change)
-
M is the total number of the misclassified training patterns
By replacing the optimum weight to the initial perceptron formulation the final decision boundary will be:
𝑀
∗
𝑦 = 𝑠𝑖𝑔𝑛(⟨𝑤 , 𝑥⟩) = 𝑠𝑖𝑔𝑛 (⟨ ∑ 𝑎𝑖 ∗ 𝑡 𝑖 𝛷(𝑥 𝑖 ), 𝛷(𝑥)⟩)
𝑖=1
∗ 𝑖
𝑖
= 𝑠𝑖𝑔𝑛(∑𝑀
𝑖=1 𝑎𝑖 𝑡 ⟨ 𝛷(𝑥 ), 𝛷(𝑥)⟩)
Eq. 2.59
There are two things which are noticeable in Eq. 2.59. Firstly, the dot product between the features appears
allowing its substitution by the kernel matrix K(i,j). Secondly, the weight term w is absent from the decision
boundary function and so the term which must be optimized is the vector 𝑎 = {𝑎1 , … , 𝑎𝑁 }. Since each
element of this vector represents the contribution of the corresponding pattern to the weights update, they are
initialized to zero and the perceptron algorithm changes in its update step as:
∗ 𝑖
𝑖
𝑗
if 𝑡 𝑖 (𝑠𝑖𝑔𝑛(∑𝑀
𝑖=1 𝑎𝑖 𝑡 ⟨ 𝛷(𝑥 ), 𝛷(𝑥 )⟩)) ≤ 0
𝑎𝑖 = 𝑎𝑖 + 1 (or 𝑎𝑖 = 𝑎𝑖 + 𝜂 if learning rate is used)
Therefore provided a fixed input training set the decision boundary can be formulated using either w or a.
However if the perceptron is to be used for non-linear data classification, then perceptron dual form
representation allows the convergence to a solution explicitly in terms of kernels, avoiding any necessity of
defining feature function Φ(.).
In summary, a linear function can be applied to non-linear data just by substituting the corresponding indexed
kernel matrix for any dot product of the input data vectors, while defining the exact feature mapping on input
data is out of the interest. The only necessity is the potentiality of handling the dot product of features in
some feature space (of any dimensionality) and a good kernel satisfies this need.
32
2.7 - Support Vector Machines
Support Vector Machines (SVM) comprises a learning machine. Alike perceptron model, SVM learns over a
labeled training dataset, localizes a hyperplane which separates the input space into two classes and then it is
able to generalise over new input patterns. It is categorized as a linear machine learning model and therefore
it fails in separating correctly non-linear data in the input space. However this problem is easily solved by
using kernels and thus the SVM separates the data with a hyperplane in the feature space. As a result having
chosen an appropriate feature space (through kernel function selection) of sufficient dimensionality, any
consistent training set can be made separable.
2.7.1 - SVM for two-class classification
Finding a hyperplane which will separate successfully 2 classes of data is only one part of SVM’s principal
goal. What is more interested in is finding the one hyperplane which minimizes the generalization error. It is
a fact that in some cases there are multiple possible hyperplanes in the input space which can separate the
input data. Hence for each classification problem there is a set of solutions separating the input training data
exactly into the right classes but only one will achieve smallest generalization error (testing error). This
solution is offered by SVM learning machine.
Figure 2.6: Binary classification problem: There are a lot of hyperplane solutions to one problem.
Which one should be chosen?
SVM approach is motivated by the statistical learning theory about generalization (error) bounds, stating that
among all possible linear functions, the one which minimizes the generalization error the most is the same
which maximizes the margin of the training set. The margin is denoted as the minimal distance between
decision boundary and the closest data points of each class (Fig. 2.7). Therefore what SVM essentially does is
to provide the one hyperplane with the maximum distance from the closest class points on both sides.
33
Figure 2.7: The decision boundary chosen by SVM.
Consider dataset of Eq. 2.44. Support Vector Machine model separates input data vectors into two classes in
the D dimensional input space. The location of the separating hyperaplane in the input space is defined by
the position of the input datapoints (vectors) of both classes whose position maximize the margin. Thence
these vectors are said to support the position of the classifier as they keep its position fixed in the input space.
Obviously, that is why these vectors are called support vectors while all the rest input vectors are non-support
vectors. In Fig. 2.7 the support vectors of the particular model are shown highlighted with a light glow
around them. The separating hyperplane behaves as it is connected with the support vectors with an
underlying relationship so that if the support vectors shift then the separating classifier shifts as well! On the
contrary, if the non-support vectors shift, hyperplane position remains invariant. Consequently, Support
Vector Machine is a learning machine which is used to achieve learning from training input data in terms of
identifying the support vectors among training data vectors and based on their position, define the decision
boundary.
Recall that a hyperplane in an arbitrary dimensional space sis given by:
𝑤𝑥+ 𝑏=0
Eq. 2.60
where b: bias, w: weight vector. The SVM decision function for a binary classification problem is:
𝑦 = 𝑠𝑖𝑔𝑛(𝑤 𝑥 + 𝑏)
Eq. 2.61
for labeled outputs y ∈ {+1,-1} and it is identical with the perceptron decision function. Apparently this
decision function does not depend on the magnitude of the quantity 𝑤 𝑥 + 𝑏 since the sign defines the final
decision. Therefore, any positive rescaling of the terms w → λw and b → λb won’t have any effect on the
34
model’s decision. Essentially the idea is to implicitly fix the scale of the input space in such a way that the
distance between the hyperplane and the support vectors will be equal to one.
𝛾=
𝑚𝑎𝑟𝑔𝑖𝑛
2
=1
(shown in Fig. 2.8)
Eq. 2.62
Hence, the correct decision for the positive labeled support vectors will be exactly +1, whereas for the
negative labeled support vectors will be exactly -1. Suppose now there is a positive and negative labeled
support vector, denoted as x1 and x2 respectively (Fig. 2.8):
𝑤 ∗ 𝑥1 + 𝑏 ∗ = +1
Eq. 2.63
𝑤 ∗ 𝑥2 + 𝑏 ∗ = −1
Eq. 2.64
where terms 𝑤 ∗and 𝑏 ∗ describe the canonical form of the separating hyperplane which means that they must
be suitably normalized to give the desired result for support vectors of both classes. Provided 𝑤 ∗and 𝑏 ∗, the
margin is equal to the length of the resulting vector from the projection of the vector (𝑥1 − 𝑥2 ) onto the
normal vector to the hyperplane. Since the weight vector is always perpendicular to the decision boundary, it
follows that the normal vector can be found by:
𝑤
.
||𝑤||
Hence if c is the resulting vector of the projection of
(𝑥1 − 𝑥2 ) to the direction of the normal vector. Subtracting the canonical hyperplanes for support vectors
𝑥1 𝑎𝑛𝑑 𝑥2 :
𝑤 ∗ 𝑥1 + 𝑏 ∗ − 𝑤 ∗ 𝑥2 − 𝑏 ∗ = 1 + 1 = 2 ⇒
𝑤 ∗ 𝑥1 − 𝑤 ∗ 𝑥2 = 𝑤 ∗ (𝑥1 − 𝑥2 ) = 2
Eq. 2.65
then the margin is given as:
margin = ||c|| =
w(𝑥1 − 𝑥2 )
||𝑤||
2
= ||𝑤||
Eq. 2.66
and the distance between any support vector to the decision surface is:
γ=
margin
2
1
= ||𝑤||
Eq. 2.67
1
SVM searches the maximum margin by maximizing the quantity γ = ||𝑤|| which is equivalent to minimizing
||w||2 formalizing a convex minimization problem:
35
1
𝛷(𝑤) = 2 ||w||2
Eq. 2.68
𝑔𝑖 (𝑤) = 𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) 𝑤ℎ𝑒𝑟𝑒 𝑔𝑖 (𝑤) ≥ 1 𝑓𝑜𝑟 𝑖 = 1, … , 𝛮
Eq. 2.69
subject to the constraint functions:
Since this is a constraint optimization problem the primal Lagrangian function can be used to find the
optimum 𝑤 ∗ and 𝑏 ∗.
𝐿(𝑤, 𝑏) =
1
||𝑤||2
2
− ∑𝑁
𝑖=1 𝑎𝑖 [𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) − 1]
Eq. 2.70
where a is the vector with the lagrangian multipliers. To find the optimum towards the terms 𝑤 ∗ and 𝑏 ∗ :
𝛻𝐿𝑤 = 𝑤 − ∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 = 0 ⇒
𝑤 = ∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖
Eq. 2.71
𝛻𝐿𝑏 = − ∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 = 0 ⇒
∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 = 0
Eq. 2.72
Therefore, by substituting the Eq. 2.71 and 2.72 back in the primal objective function Eq. 2.70, the Wolfe
dual problem formalizes as:
1
𝑁
𝑁
𝑊(𝑎) = ∑𝑁
𝑖=1 𝑎𝑖 − 2 ∑𝑖=1 ∑𝑗=1 𝑎𝑖 𝑎𝑗 𝑡𝑖 𝑡𝑗 𝑥𝑖 𝑥𝑗
Subject to constraints:
∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 = 0
and
∑𝑁
𝑖=1 𝑎𝑖 ≥ 0
Eq. 2.73
Eq. 2.74
It is important that the dual form optimization function W(a) is quadratic and the constraints are linear
functions defining a convex region. This fact emanates that any local minimum will be a global minimum.
This is an example of quadratic programming problem. There exist many techniques offering solutions to
quadratic programming problems. Since all the training patterns are used during the “training phase”
(determination of a) efficient algorithms for solving quadratic programming problems should be chosen.
36
By applying quadratic programming to the problem, the optimal vector of langarian multipliers a, which
maximizes the dual form optimization problem W(a), will be derived.
Since w is given as a linear
combination of all the lagrangian multipliers 𝑎𝑖 (Eg. 2.71) for i = 1, … , N, the decision function of SVM can
now be reformed as:
𝑦 = 𝑠𝑖𝑔𝑛(∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥 + 𝑏)
Eq. 2.75
and thus any new data pattern can be classified by this function.
In addition, according to the Karush-Kuhn-Tucker (KKT) conditions for this kind of constrained optimization
problems it applies that:
𝑎𝑖 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.76
𝑡𝑖 𝑦𝑖 − 1 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.77
𝑎𝑖 (𝑡𝑖 𝑦𝑖 − 1) = 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.78
Where 𝑦𝑖 is the result of the decision function for the training data pattern 𝑥𝑖 . By inference, it applies for any
training data point in the input space, that either the corresponding 𝑎𝑖 = 0 or 𝑡𝑖 𝑦𝑖 = 1. However, if the latter
case applies, then the particular input data pattern 𝑥𝑖 is a support vector (Eq. 2.63 and 2.64). This fact leads to
the conclusion that the first case (𝑎𝑖 = 0) is valid for only for the non-support vectors. Thus, the non-support
vectors play no role in decision making for new datapoints! So, after SVM is trained, only the support vectors
are retained (for which 𝑎𝑖 > 0) while all the rest are excluded from the decision making function, reducing
the computational effort of the model.
Furthermore, having 𝑡𝑖 𝑦𝑖 = 1 for all the support vectors allows us to the determination of the b parameter.
Let S and Ns be the set of the support vectors and their total number respectively. For a support vector x j it
holds that:
𝑦𝑗 = ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 ⇒
Eq. 2.79
𝑡𝑗 𝑦𝑗 = 𝑡𝑗 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 ⇒
Eq. 2.80
𝑡𝑗 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 = 1 ⇒
Eq. 2.81
𝑡𝑗 𝑡𝑗 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 = 𝑡𝑗 ⇒
Eq. 2.82
37
∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 = 𝑡𝑗
Eq. 2.83
∑𝑗∈𝑆 [∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏] = ∑𝑗∈𝑆 𝑡𝑗
Eq. 2.84
∑𝑗∈𝑆 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + ∑𝑗∈𝑆 𝑏 = ∑𝑗∈𝑆 𝑡𝑗
Eq. 2.85
Then for all support vectors it stands that:
1
𝑏 = 𝑁 ∑𝑗∈𝑆[𝑡𝑗 − ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 ]
Eq. 2.86
𝑠
The SVM model, as presented until now, works only for linearly separable problems but can easily be
extended to apply for non-linear problems by solving the problem into a feature space defined by a kernel
function. Hence the Wolfe dual optimization function would be:
1
2
𝑁
𝑁
𝑊(𝑎) = ∑𝑁
𝑖=1 𝑎𝑖 − ∑𝑖=1 ∑𝑗=1 𝑎𝑖 𝑎𝑗 𝑡𝑖 𝑡𝑗 𝐾(𝑖, 𝑗)
Eq. 2.87
The decision function would use the corresponding kernel function to classify any new datapoint as:
𝑦 = 𝑠𝑖𝑔𝑛(∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 𝐾(𝑥𝑖 , 𝑥) +
1
∑ [𝑡
𝑁𝑠 𝑗∈𝑆 𝑗
− ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝐾(𝑖, 𝑗)])
Eq. 2.88
Thus, for a non-linear input space, a proper kernel should be chosen and the kernel matrix should be created
based on the training data. Then all the steps described above should be followed with one variation from the
classic SVM approach, every dot product between two input data vectors must be replaced by the
corresponding index of the kernel matrix. Besides, an element of the kernel matrix is the result of the dot
product of the corresponding indexed features. As long as the training of the SVM is done by solving the
constrained optimization problem with quadratic programming in Eq. 2.87, the lagrange multipliers ai i
=1,….,M are optimized. At this point the kernel matrix can be deallocated and keep only the indexed
elements which correspond to the support vectors. Finally, any new data point can be classified with the
decision function presented in Eq. 2.88 and decided whether it belongs to class +1 or -1.
So far SVM model seems to respond to both linear and non-linear problems successfully. The success recipe
is clear and understandable; apply some of statistical learning theory approach with particular mathematical
optimization methods. Combine all these with a pinch of a kernel trick and there it is one powerful tool for
binary classification. However, there are some aspects to be considered furthermore. For example, how does
this tool handle an outlier? Moving one step forward, can this model be adjusted to multi-class problems?
38
Figure 2.8: Outlier’s effect on the localisation of the decision boundary during training phase. Left plot shows
the normal case while right one adds an outlier in the input space.
Datasets are mostly built by humans. There are often a big number of parameters to be filled up for each case
(data pattern), measures have to be made and lastly each case must be assigned with a label defining the class
which it belongs to. It is also a fact that bad communication between cooperators, tiredness and so many
other factors increase the chances of human error. As a result, a pattern may be labeled with the wrong class
or a measure about a parameter of a pattern may be altered or mixed up with another one and so on. Since
SVM (as all other learning machines), learn from data, such incorrect defined datasets drive the model to
present big generalization error. An outlier is a datapoint which is isolated from the rest datapoints of its class
in the input space. Although there is a small chance that this is a valid data point the truth is that most of the
times it discloses that there is an outlier in the dataset. SVM as described above would be strongly influenced
by an outlier and would try to adjust the separating hyperplane based on the outlier’s position. In Figure 2.8
there are presented two cases of the same dataset. In the first case, the dataset is clean from outliers. SVM is
trained based on this data and the optimum separating hyperplane is found. In the second there exists only
one outlier. After SVM is trained including the outlier in the dataset, the optimum decision boundary is
totally differently oriented and localized. If a new input pattern needs to be classified (black dot in Figure
2.8), in the first case it will be classified as a red class object (and it will be correct classification) while in the
second case, in which the hyperplane was greatly influenced by the outlier, it will be defined as a green class
object and it will be wrong. If this is the case about one outlier, imagine how harmful a noisier dataset can be
for a classification model. Essentially what is needed to be done is to eliminate the participation of the
outliers during training process of SVM.
Recall that a margin (for SVM) is the minimum distance between the patterns of two classes. The model is
said to have a hard margin if there is no datapoint lying in the input space defined by margin (after training).
However, the effect of the outliers may be decreased by allowing the model to have a soft margin. This
39
means that a number of data points will be able to remain in the margin space or even in the opposite class
space (but still retain their label). These data points should be penalized based on the distance of their
position from the boundary of their class. Let 𝜉𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 be the slack variables. Therefore, for
datapoints which lie in the inside area or on the canonical hyperplane (for both classes), the corresponding
slack variables will be 𝜉𝑖 = 0 whereas for the others 𝜉𝑖 = |𝑡𝑖 − 𝑦𝑖 |. There are two main approaches for
introducing soft margin to SVM named L1 error norm and L2 error norm and both use these slack variables.
Here only L1 error norm will be presented.
L1 error norm
Having the slack variables, the classification constraints will be replaced by:
𝑔𝑖 (𝑤) = 𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) 𝑤ℎ𝑒𝑟𝑒
Eq. 2.89
𝑔𝑖 (𝑤) ≥ 1 − 𝜉𝑖
Eq. 2.90
𝜉𝑖 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝛮
Eq. 2.91
The correctly positioned data patterns will be eventually assigned with a slack variable 𝜉𝑖 = 0. The rest are
divided into two categories, the first is about data points lying within the space defined by the canonical
hyperplane of their class and the decision boundary, which will get 0 < 𝜉𝑖 ≤ 1 and the rest which are on
wrong side of the decision boundary and their slack variables will be 𝜉𝑖 > 1. The final goal is to maximize
the margin while at the same time minimize the slack variables. Which means minimize the distance of the
outliers from their class area and that is why slack variables could be also characterized as error measures.
Both of these goals can be combined into one function which must be minimized:
1
2
𝛷(𝑤) = ||w||2 + 𝐶 ∑𝑁
𝑖=1 𝜉𝑖
Eq. 2.92
where the term 𝐶 > 0 is the trade-off between maximizing the margin and minimizing the misclassified
patterns (errors). Hence, the primal Lagrangian optimization function becomes:
𝐿(𝑤, 𝑏, 𝜉) =
1
||𝑤||2
2
𝑁
𝛮
+ 𝐶 ∑𝛮
𝜄=1 𝜉𝑖 − ∑𝑖=1 𝑎𝑖 [𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) − 1 + 𝜉𝑖 ] − ∑𝜄=1 𝑟𝑖 𝜉𝑖
Eq. 2.93
where 𝑎𝑖 ≥ 0 and 𝑟𝑖 ≥ 0 are the lagrangian multipliers. To find the optimum towards the terms 𝑤 ∗ , 𝑏 ∗ and
𝜉𝜄 ∗ :
40
𝛻𝐿𝑤 = 𝑤 − ∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 = 0 ⇒
𝑤 = ∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖
Eq. 2.94
𝛻𝐿𝑏 = − ∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 = 0 ⇒
∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 = 0
Eq. 2.95
𝛻𝐿𝜉𝑖 = 𝐶 − 𝑎𝑖 − 𝑟𝑖 = 0 ⇒
𝑎𝑖 = 𝐶 − 𝑟𝑖
Eq. 2.96
Since 𝑎𝑖 ≥ 0 and 𝑟𝑖 ≥ 0 it is obvious that 𝑎𝑖 will be maximized if 𝑟𝑖 = 0 where 𝑎𝑖 = 𝐶 and so:
0 ≤ 𝑎𝑖 ≤ 𝐶
Eq. 2.97
Therefore, by substituting the Eq. 2.94 and 2.95 back in the primal objective function Eq. 2.93, the Wolfe
dual problem is:
1
𝑁
𝑁
𝑊(𝑎) = ∑𝑁
𝑖=1 𝑎𝑖 − 2 ∑𝑖=1 ∑𝑗=1 𝑎𝑖 𝑎𝑗 𝑡𝑖 𝑡𝑗 𝑥𝑖 𝑥𝑗
Subject to constraints:
∑𝑁
𝑖=1 𝑎𝑖 𝑡𝑖 = 0
and
0 ≤ 𝑎𝑖 ≤ 𝐶
Eq. 2.98
Eq. 2.99
Again, following the exact same steps as previously, the optimum α will be calculated through quadratic
programming. The decision function will remain the same as Eq. 2.75 (for feature space problem). For linear
problem simply use linear Kernel which defines: 𝐾(𝑥𝑖 , 𝑥𝑗 ) = ⟨𝑥𝑖 , 𝑥𝑗 ⟩. As before the non-support vectors will
have zero valued 𝑎𝑖 = 0. The support vectors will have 𝑎𝑖 > 0 and furthermore they must satisfy the
constrained defined in Eq. 2.90:
𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) ≥ 1 − 𝜉𝑖
Eq. 2..100
The KKT conditions if the optimization problem in Eq. 2.100 are:
𝑎𝑖 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.101
41
𝑟𝑖 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.102
𝜉𝑖 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.103
𝑟𝑖 𝜉𝑖 = 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.101
𝑡𝑖 𝑦𝑖 − 1 + 𝜉𝑖 ≥ 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.104
𝑎𝑖 (𝑡𝑖 𝑦𝑖 − 1 + 𝜉𝑖 ) = 0
𝑓𝑜𝑟 𝑖 = 1, … , 𝑁
Eq. 2.105
For a support vector which is positioned on the canonical hyperplane of its class applies that 𝑎𝑖 > 𝐶. Then the
corresponding 𝑟𝑖 > 0 (Eq.2.96) which implies that 𝜉𝑖 = 0 (Eq. 2.101). If a support vector lies inside the
margin then 𝑎𝑖 = 𝐶 and this implies that either it is correctly classified , 𝜉𝑖 ≤ 1, or it is misclassified 𝜉𝑖 > 1.
Figure 2.9: Same problem as in Fig. 2.8. SVM penalizes the outlier with a big valued ξi and so the particular
datapoint doesn’t have influence on the classifier’s position.
2.7.2 - Multi-class problems
Until this point only binary classification problems were considered for SVM. In practice thus many
problems contain more than two classes of data points and therefore binary SVM can’t offer any solution.
However there were several suggestions about how SVM algorithm can be extended to fit multi-class
problems. In this study two of the most well-known multi-class classification heuristic approaches will be
discussed. The first approach, named one-versus-the-rest, demands the training of K classifiers where K is the
number of the classes. Each classifier is trained using binary SVM model, giving positive results for a
particular class of data points Ck and negative results for all the rest K-1 classes of data. Finally the trained
classifiers return their decisions for a novel data point x. The new point is assigned to the corresponding class
of the classifier with the bigger valued decision y. However, although the training dataset is the same, each
classifier learns over a totally different data setup in the input space. As a result there is no guarantee that the
42
results coming from each classifier’s decision function are properly scaled so that to be comparable and make
a choice over the largest valued. Not only that, but there is a risk that an object might be assigned to more
than one classes at the same time. Added to this, even if the initial dataset is balanced having equal number of
objects of each class, using one-versus-the-rest method it turns greatly imbalanced. Each classifier would be
then trained over 1 class of positive examples and K-1 negative (where K is the number of classes).
Figure 2.10: This is a three class problem. Following one-versus-the-rest approach, three classifiers must be
trained.
The second approach is the one-versus-one and it comprises one of the simplest schemes of solving multiclass problems. This method creates all possible pairs of the dataset classes. Then a classifier is trained over
each pair of classes. Thus for K classes there will be K(K-1)/2 classifiers. For classifying a new data pattern,
the decision process is better described by a directed acyclic graph (DAG) as shown in Figure 2.11 (for a
three-class problem). At each node of the graph, the new input patter is classified over two classes and
according to the class decision it moves to the next level of the graph. The leaves of the graph (last level) are
the final class decisions of the model. For each new data point the final path through the graph will have
length equal to K-1.
Figure 2.11: One-versus-one scheme for solving multiclass problems. In this example there are three classes
of data and so two pair wise comparisons will take place for each novel input pattern to be classified.
43
It is obvious that the tree’s size increases accordingly to the number of classes and as a result this approach
becomes efficiently expensive in terms of time and space for problems with a big number of classes.
Overall, it seems that SVM is a machine learning tool which is very well set up for binary classification. For
a separable dataset, SVM guarantees to identify the best classifier based on statistical learning theory about
margins. The introduction of the kernel trick enhances the power of the model extending its application to
non-linearly separable datasets without having any computational cost by the dimensionality of the input
space. Although SVM is so well set up for binary class problems, it has no such impact on multiclass
problems since the proposed heuristic approaches may exhibit inconsistent results or may be extremely
computationally demanding. It is also important to mention that SVM are also used for regression problems.
SVM regression is an extension of linear regression applying the same steps as for classification to solve the
dual optimization problem towards lagrangian multipliers α.
44
Chapter 3 - Fuzzy Logic
“Fuzzy logic is a tool for embedding structured human knowledge into workable algorithms” (Kecman,
2001). The way that humans think is rather approximate than exact. As result expressions like “a little”, “not
so much”, “too many”, “so and so”, etc, are used in all languages to express the fuzziness which characterizes
human reasoning and thinking. Such expressions don’t give precise information or quantity about what they
refer to but they still comprise an adequate description of it, giving the necessary information to “get the
feeling”. Real life concepts are not binary; rather their description is graduated from absolute false to absolute
true. Therefore, there are not just beautiful or ugly people, good or bad; the weather is not just absolutely
sunny or absolutely raining, the food is not tasteful or tasteless, etc. In most of the times there is an
intermediate state for every concept in human understanding which classic set theory is unable to handle.
This observation led to the emergence of Fuzzy Logic theory the scientist Lofti A. Zadeh, back in 1965.
Fuzzy logic (FL) is about encoding human reasoning providing a mathematical model representation of the
linguistic knowledge. FL achieves that by introducing fuzzy set of classes which are not independent or
disjoint and their boundaries are fuzzy and indistinct, diverging from classical set theory. In contrast to crisp
sets, there are no strict, precise or exact criteria defining whether an object belongs or not to a fuzzy set.
Consequently, an object belongs to a class (represented by a fuzzy set) to some grade.
In notation, a classical set can be defined by:
1) A list of elements
Scolours = {red, orange, green, blue, yellow, purple}
2) Specifying some property
SA = {x є Scolours | x == red}
SB = {x є Scolours | x == blue}
SC = {x є Scolours | x == yellow}
3) A membership function
μSA(x) = {
1 , 𝑖𝑓 𝑥 ∈ SA
0 , 𝑖𝑓 𝑥 ∉ SA
45
The membership function of a classical set satisfies the principle of bivalence since it returns only true (1) or
false (0) excluding any other truth value. Contrarily, fuzzy membership functions may return real values
other than zero and one, escaping the divalent constraint. Hence, if an object x ∈ 𝑋 then essentially the
membership function d = μS(x) is a mapping between X and the integral [0,1]. Membership functions of
fuzzy sets assign an object to a fuzzy set, with a truth value between [0,1] describing the degree of belonging.
Zero degree totally excludes the object from the set and full degree (one) means that the object is outright a
member of the specific set.
Considering the example given in Figure 4.1, there are three sets, each describing objects of a particular
colour. More specifically, the classes of red, blue and yellow comprise the universe of discourse. The task is
to assign an orange, purple and green object to a colour class. Using the classical set theory (the
representation on the left) the three new objects are rejected, as their membership function returns zero for all
three fixed colour sets. However, in reality, orange is given by mixing red and yellow, green by blue and
yellow and purple is red and blue. So, orange contains a little red and a little yellow and so on. If colour
classes are modelled with fuzzy sets then orange, green and purple belong to a combination of the colour sets,
to a specific degree, given by the membership function of each colour set. Note than none of the three
membership functions will ever return 1 for any of the three new coloured objects, since they are not totally
red, neither totally blue nor yellow.
Figure 3.1: A, B and C represent three colour sets, red, blue and yellow class respectively. Using crisp set
logic, orange, purple and green objects cannot become members to any of these sets whereas using fuzzy
logic all three different colours can become members to one or more sets to a degree.
As in crisp sets, the universe of discourse is a set X. This set may contain either discrete or continuous valued
elements which participate to the modelled problem/system. In Fuzzy Logic a linguistic variable, which is
46
just a label, is also used to describe the general domain of the members included in the set X (e.g. age, weight,
distance, etc). The linguistic variable is then quantified into a collection of linguistic values. A linguistic
value is a label describing a state/situation/ characteristic of the concept defined by the linguistic variable
(e.g. young, heavy, long, etc). Each linguistic value is defined by a fuzzy set. A fuzzy set is a collection of
ordered pairs whose first element is an object of the universe of discourse and the second element is the
object’s degree of belonging to the particular fuzzy set given by the predefined membership function.
Therefore a basic notation of a fuzzy set is:
S = {(x,μs(x)) | μs(x) definition}
Eq. 3.1
where x is an object of universe of discourse and μs(x) is the membership function which characterizes the set
S. An example of the aforementioned concepts is presented on the right side of Figure 3.2 (weight example).
As stated already, fuzzy logic is used to express in a formal way, human reasoning. In real life human
reasoning is used to reflect the content of a specific problem/discussion/model etc. However many times
vague terms are used to describe an aspect of any concept concerning either the physical or spiritual world.
That is why each statement should be interpreted in the frame of the problem/system which is discussed. For
example consider a man P whose height is 1.80 meters. The statement “P is short” sounds incorrect since an
ordinary man who has 1.80 meters height is generally considered tall. However if the discussion is about
professional basketball players then the exact same statement sounds reasonable and valid. Therefore each
statement of human reasoning strongly depends on the context of a discussion. Consequently, linguistic
values are innately context dependent. Similarly, they strongly depend on the system and on whom or what
the system refers to. They depend on the system’s character. Even using the same universe of discourse and
the same linguistic variable, the linguistic values might take a totally different scaling if the main subject of
the system is different.
Furthermore, Fuzzy Logic allows to an element to belong to two or even more fuzzy sets (to a different
degree) whereas in classical set theory an element can be a member only of one set. Hence, an object might
be a little more member of fuzzy set A and at the same time little less member of fuzzy set B. Going back to
example given in Figure 3.1, if the colour of a new object is deep purple then it is a little more blue and a
little less red, but still is considered a member of both colour sets!
Assuredly, the membership function is the identity characteristic of each fuzzy set. Each linguistic value of a
linguistic variable is assigned to a membership function. Human reasoning, as stated above, is subjective and
though different membership functions may be produced by different persons for the same linguistic value of
a common linguistic variable.
47
Figure 3.2: The left part is describes the relationship between some concepts of Fuzzy Logic. The right part
of the figure builds a simple example based on the theory of the left side.
Hence the criteria of choosing a function to be a membership function are problem depended. The shape of
each function changes along with the criteria set for a problem and therefore there is a big range of functions
which can be used to define a linguistic value. However, it is difficult to build from scratch your own
membership function since there is a big risk of modelling arbitrary functions with arbitrary shapes. That is
why most of the times some specific types of functions, which have preponderated in bibliography of fuzzy
modelling, are commonly chosen and adjusted to new datasets. Some classical examples of such membership
functions are the trapezoids, triangular, Gaussian bell and singleton functions.
As shown in Figure 3.3, trapezoid, triangular and Gaussian bell functions can be used to model membership
functions of continuous valued elements. Contrarily, singleton function is more about modelling discrete data
where each case must be individually assigned to a certain membership grade (hence the parameter a of the
singleton function represents an element of the data).
48
Figure 3.3: Trapezoid, triangular and Gaussian bell and singleton membership functions.
Building a trapezoid function requires the definition of four parameters [a,b,c,d]. The elements of the
universe of discourse which fall between the interval (a, b) and (c, d) are partially members of the defined
fuzzy set while the elements included in the range [b, c] are fully members of the set. Triangular function can
be regarded as a special case of trapezoid function taking parameters [a,b,b,c]. The substantial difference
between the two of them is that the trapezoid allows a range of elements to be assigned to full membership
grade while triangular allows only one. Gaussian bell function differs from other two continuous functions
since the parameters needed to define the function is a pair of [σ, m]. The first parameter is the standard
deviation of the function describing the width of the function. The function is symmetrical about the second
parameter m, which is the mean of the function, and though the element which will equal to mean will be the
only one to be assigned to a membership grade of one. Another difference between the Gaussian and the rest
functions is that, the derivative of the former exists at each point of its domain while this is not applied for the
other two (for elements at a or b or c or d). Hence for applications which demand the usage of mathematical
optimization algorithms (e.g. gradient descent) over fuzzy variables, Gaussian is more preferable membership
function than others.
Nevertheless, there are some common concepts which are used to describe all classical membership functions
presented in Figure 3.4. The function’s support is a set which contains all those elements which are assigned
to non-zero membership degrees while the members of the core set have degree of membership equal to one.
The variable a is a specific defined membership function degree. All the elements which have a membership
degree bigger or equal to a are included to the a-cut set. Therefore when a is equal to zero the resulting a-cut
set is the initial domain of the specific membership function. Finally the variable height is about the
maximum membership degree for the specific function’s domain. In Figure 3.4 these basic concepts are
presented for a trapezoid function but they can easily be adjusted to any other type of classical fuzzy
membership functions.
49
Figure 3.4: The concepts of support, core, a-cut and height describing a fuzzy membership function.
Additionally, a fuzzy set A is considered convex if and only if the inequality in Equation 3.2 is satisfied for
every pair 𝑥𝑖 and 𝑥𝑗 where 𝑥𝑖 , 𝑥𝑗 ∈ A and every λ∈ [0,1].
𝜇𝐴 (𝜆𝑥𝑖 + (1 − 𝜆)𝑥𝑗 ) ≥ min{𝜇𝐴 (𝑥𝑖 ), 𝜇𝐴 (𝑥𝑗 )}
Eq. 3.2
Furthermore the cardinality of a fuzzy set A is defined as:
|𝐴| = ∫ 𝜇𝐴 (𝑥) 𝑑𝑥
Eq. 3.3
If the support set of the fuzzy set is discrete, including N elements then the cardinality can be calculated as:
|𝐴| = ∑𝑁
𝑖=1 𝜇𝐴 (𝑥𝑖 )
Eq. 3.4
3.1 - Basic fuzzy set operators
The basic classical set operations of conjunction, disjunction and negation are defined for fuzzy sets as well.
Actually, there are multiple operators which model the conjunction and disjunction of fuzzy sets. The
simplest though, is the min-max operators as well as the algebraic product and sum which are presented in
Table 3.1 as well as in Figure 3.5 and Figure 3.6 respectively.
Any fuzzy set operation will result a new fuzzy set. So, since a fuzzy set is defined by its membership
function, the operations are applied on membership functions and the result is a new membership function on
the universe of discourse.
Min/Max Operators
Algebraic Product/Sum Operators
Conjunction
𝜇𝐴∧𝐵 = min{𝜇𝐴 , 𝜇𝐵 }
𝜇𝐴∧𝐵 = 𝜇𝐴 ∙ 𝜇𝐵
Disjunction
𝜇𝐴∨𝐵 = max{𝜇𝐴 , 𝜇𝐵 }
𝜇𝐴∨𝐵 = 𝜇𝐴 + 𝜇𝐵 − 𝜇𝐴 ∙ 𝜇𝐵
𝜇𝐴𝑐 = 1 − 𝜇𝐴
𝜇𝐴𝑐 = 1 − 𝜇𝐴
Negation
Table 3.1: Two different operators for conjunction, disjunction on two fuzzy sets A and B. The negation
operation has the same form for both operators.
50
There is a plethora of other operators which can be used to model conjunction and disjunction on fuzzy sets.
In Fuzzy Logic conjunction operators are labelled as T-norms and disjunction operators as S-norms. Each one
of them gives a new membership function with different shape but still reflecting the concepts of intersection
and union of two fuzzy sets. For example algebraic product gives a smoother conjunction membership
function (Figure 3.6) that min operator (Figure 3.5 (b)). The choice an operation norm depends on the
problem and generally on the model that the user wishes to produce.
Finally, there are two substantial differences between the crisp set logical operators and fuzzy set. Starting
with crisp logic, it is well known that the intersection between a set and its complement results an empty set.
Contrariwise, the union of a set and its complement gives the universe of discourse. The corresponding
operators on fuzzy sets will result a fuzzy set other than empty or universe of discourse set respectively.
Hence for a fuzzy set A:
𝜇𝐴∧𝐴𝑐 ≠ 0
Eq.3.5
𝜇𝐴∨𝐴𝑐 ≠ 1
Eq.3.6
These two interesting side effects are presented in Figure 3.7.
Figure 3.5: (a) The membership functions of two fuzzy sets named A and B (b) The membership function (m.
f.) of the conjunction of A and B using min norm (c) The m. f. of the disjunction of A and B using max norm
(d) the m. f. describing the negation of A (complement of A)
51
Figure 3.6: The membership functions of the conjunction and the disjunction of A and B using algebraic
product and algebraic sum respectively.
Figure 3.7: (a) The m. f. of a fuzzy set A and its compliment (b) Their conjunction m. f. (c) Their disjunction
m. f.
3.2 - Fuzzy Relations
Consider two crisp sets X and Y. Applying a relation over these sets will provide a new set R containing
ordered pairs (x, y) where x ∈ X and y ∈ Y . Since the Cartesian product of two crisp sets denoted as X × Y
comprises a crisp set including all ordered pairs between X and Y such that:
X × Y = {(𝑥, 𝑦)|𝑥 ∈ X and 𝑦 ∈ Y}
Eq. 3.7
it is derived that the relation R comprises a subset of the Cartesian product X×Y. This subset might be chosen
totally randomly or it might be chosen under a specified condition over the two variables (of the ordered
pairs). In fact any subset of the Cartesian product is considered a relation.
Any relation defined over a Cartesian product (or product set) which is build on discrete sets, can be
expressed as a relational matrix. Each element of the matrix represents the membership degree of the
particular ordered pair (row, column) to the relation and that is why it is also called a membership array. The
52
condition defining a crisp set is very precise. The boundaries between the ordered pairs which are members
of the relation and those which are not are very sharp. Therefore the relational matrix of such a relation will
contain only ones and zeros. The ordered pairs which satisfy the condition will be assigned to one and the rest
to zero.
However there are some cases that the condition of a relation is not so clear and precise. For example “find
the objects which have approximately equal size”. The condition of the relation “approximately equal size”
doesn’t define in accuracy how the word “approximately” should be interpreted. As a result the size of some
objects will be more equal than others. Some ordered pairs will belong to the relation to a bigger or smaller
degree than others. Hence this is a fuzzy relation. Alike fuzzy sets, the membership function of a fuzzy
relation will vary in the interval [0, 1] as well as values of the fuzzy relational matrix since it represents the
membership degrees per each ordered pair. In Figure 3.8 two examples of relations, a crisp relation (x is
equal to y) and a fuzzy relation (x is approximately equal to y) are presented.
Figure 3.8: A crisp relation (on the left) and a fuzzy relation (on the right) over two sets X and Y.
When the universes of discourse X and Y are discrete, then the membership function characterizing the set of
a relation, 𝜇𝑅 (𝑥, 𝑦), defines discrete points in a three dimensional space (X, Y, 𝜇𝑅 (𝑥, 𝑦)). On the other side if
the sets which participate to the relation are continuous then the relational membership function defines a
surface over the Cartesian product X × Y. The relational matrix is difficult to be defined over continuous
universes and therefore discretization of their domain may be applied by choosing an appropriate
discretization step.
The universe of discourse could be also described by a linguistic variable. As stated before, a linguistic
variable is assigned to a set of specific linguistic values. Then the relation could be defined over different
linguistic variables and more particularly over the ordered pairs of their linguistic values. Hence the relation
53
describes the way that these variables might be associated. In Figure 3.9 the universe of discourse is
characterized by the linguistic variable “profession” which takes as linguistic values human professions. The
set X is defined over the linguistic values: basketball player and jockey. Similarly the second set Y is defined
over the linguistic values, short medium and tall of the second linguistic variable (height). The relation
describes the association between human’s profession and human’s height.
Figure3.9: Crisp and Fuzzy relations over linguistic variables
This kind of relations comprises a very helpful tool for modelling IF-THEN rules. For example consider the
resulting crisp relational matrix as presented in Figure 3.9 (left part). In this case, where the universes of
discourse are represented by linguistic variables, the relational matrix may be expressed as a model of IFTHEN rules:
R1: If (a human’s profession is) basketball player, THEN (he is) tall
R2: If (a human’s profession is) jockey, THEN (he is) short
Furthermore, the crisp relational matrix can be generalized to a fuzzy matrix (right part of Figure 3.9) giving
some kind of freedom to the association between human’s profession and his height. For example, sometimes
it is desirable for a basketball team to have one not so tall player (medium). Thus, while it is certain that most
basketball players are tall, there is a smaller possibility that one of them will be medium.
Until this point, fuzzy relations over discrete, continuous and linguistic variables have been considered. The
last part is about fuzzy relations over fuzzy sets. A fuzzy relation over two fuzzy sets analyzes the meaning of
linguistic expressions towards a mathematical direction. For example the expression “Today we have a
54
humid cold day”. Assuredly, the phrase “humid cold day” actually implies that “the day is humid AND cold”.
The latter form of the phrase is given by the intersection between the fuzzy sets “humid” and “cold”.
Therefore, the meaning of this expression will be given by applying a T-norm operator (e.g. minimum) to the
fuzzy sets which describe two different linguistic variables, the relative humidity and the temperature. The
linguistic values for both linguistic variables are low, medium and high. Therefore, humid day is assigned to
high relative humidity and cold day means low temperature. Their corresponding membership functions are
graphically presented in Figure 3.10. Note that the membership functions were defined based on the Cyprus
temperatures and humidity percentages.
Figure 3.10: The membership functions of the fuzzy sets “cold temperature” and “humid weather”
Having defined the membership functions of the linguistic variables participating to the expression the
membership function of the relation which applies over them must be also computed. Since the expression
refers to a conjunction over the linguistic values “high humidity” and “low temperature” then any T-norm
might be chosen. Applying the minimum norm the relational membership function would be:
𝜇𝑅 (𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒_ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒) = min{𝜇ℎ𝑖𝑔ℎ (𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒_ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦), 𝜇𝑙𝑜𝑤 (𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒)} Eq.3.8
The two fuzzy sets of humidity and temperature, are defined by their membership function. The domain of
both membership functions is continuous. Therefore the relation matrix is difficult to be defined unless
discretization is applied. In Figure 3.10 both membership functions are descritized by a certain step so that:
high_humidity = {60, 70, 80, 90, 100}
low_temperature = {0, 5, 10, 15, 20}
and their corresponding membership degrees:
𝜇ℎ𝑖𝑔ℎ
1
0
0.5
1
= 1 , 𝜇𝑙𝑜𝑤 = 1
1
0.5
[1]
[0]
55
The form of the resulting relational matrix is given by:
𝑅 = 𝜇ℎ𝑖𝑔ℎ × 𝜇𝑙𝑜𝑤 𝑇
0
0.5
= 1 × [1 1 1 0.5 0]
1
[1]
Eq. 3.9
By applying the min operator the final relational matrix is:
R
0
5
10
15
20
60
0
0
0
0
0
70
0.5 0.5 0.5 0.5
0
80
1
1
1
0.5
0
90
1
1
1
0.5
0
100 1
1
1
0.5
0
A fuzzy relation over fuzzy sets is a new set containing ordered pairs of the elements included in the fuzzy
sets and their degree of belonging to the specific relation. Therefore, the elements of the relational matrix
define a surface over the Cartesian product of the fuzzy sets (for the above example humidity × temperature)
representing the resulting relational membership function.
As shown in preceding examples, a fuzzy relation can be applied on sets which are defined on different
universes. Likewise, two fuzzy relations which are subsets of different product spaces can be combined
through composition. Since fuzzy relations comprise fuzzy sets, composition may be regarded as a new fuzzy
relation associating these fuzzy sets (relations). Therefore it is derived that the composition of fuzzy relations
introduces a new fuzzy set. Let:
𝑅1 ⊆ 𝑋 × 𝑌 𝑎𝑛𝑑 𝑅2 ⊆ 𝑌 × 𝑍
Eq. 3.10
Then the fuzzy set produced by the composition of 𝑅1 and 𝑅2 is:
𝑇 ⊆𝑋×𝑍
Eq. 3.11
There is a variety of compositional operators which can be applied on fuzzy relations. MAX-MIN and MAXPRODUCT are two common composition operators where MAX-MIN operator is defined as:
𝑇 = 𝑅1 ∘ 𝑅2
𝑇(𝑥, 𝑧) = {[(x, z), max{min{𝜇𝑅1 (𝑥, 𝑦), 𝜇𝑅2 (𝑦, 𝑧)}}] | 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌, 𝑧 ∈ 𝑍}
𝑦
Eq. 3.12
56
Accordingly the membership function of the fuzzy set T:
𝜇𝑅1 ∘ 𝑅2 (𝑥, 𝑧) = { max{min{𝜇𝑅1 (𝑥, 𝑦), 𝜇𝑅2 (𝑦, 𝑧)}} | 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌, 𝑧 ∈ 𝑍}
𝑦
Eq. 3.13
In the following composition example, the fuzzy relation presented in Figure 3.9 will be used. The relation R1
describes the interconnection between the subset of professions {basketball player, jockey} and the fuzzy set
describing height scaling. Another fuzzy relation R2 is defined between the fuzzy sets of height and weight.
The linguistic variables for height are {short, medium, tall} while for weight {low, medium, high}.
Figure 3.10: The composition of two relations R1 and R2 leads to the creation of a fuzzy set T. In this figure
their relational matrices are presented.
The composition operator MAX-MIN is applied to these relations and as a result new fuzzy set T is
produced. For example the membership degree of the ordered pair of (basketball player, low_weight), to the
new fuzzy set T will be given by:
𝜇𝛵 (𝑏𝑎𝑠𝑘𝑒𝑡 𝑝𝑙𝑎𝑦𝑒𝑟, 𝑙𝑜𝑤) = max{𝑚𝑖𝑛{0,1} , 𝑚𝑖𝑛{0.1,0.5} , 𝑚𝑖𝑛{1,0}} = max{0,0.1,0} = 0.1
Eq. 3.14
The final form of the relational matrix describing the fuzzy set T is presented in Figure 3.10. The relational
matrix of the compositional fuzzy set T can be linguistically interpreted by using IF-THEN rules as:
R1: If (a human’s profession is) basketball player, THEN (he has) high weight, possibly to have medium
weight and unlikely to have low weight.
R2: If (a human’s profession is) jockey, THEN (he has) low weight, less likely to have medium weight and
unlikely to have high weight.
Composition allows the transmission from a universe to a different universe through another universe of
discourse which is associated with the formal two with two relations. Thus, it is clear that the composition
operation on fuzzy relations allows the formulation of fuzzy implications.
57
3.3 - Imprecise Reasoning
In classical propositional calculus, inference is achieved by using two basic inference rules, modus ponens
and modus tollens.
Considering a specific fact A then modus ponens states that “A implies B” which can be expressed in an
argument form as:
Premise 1:
𝐴
Premise 2:
𝐴→𝐵
or
𝐴 ⋀ (𝐴 → 𝐵) = 𝐵
Eq. 3.15
Conclusion: 𝐵
The implication rule 𝐴 → 𝐵 can be interpreted as an IF-THEN rule so that: “If x is A then y is B”. Therefore
B is derived if and only if the variable x has the exact value of A. If in any case x is approximately A or close
to A then the implication is not applied and B cannot be inferred. On the other hand, fuzzy logic is flexible
allowing to a variable x, whose value is close to A, to be equal to A to some degree. In other words, except
from actual A, other objects can be regarded as A (of some membership grade) but classical modus ponens
will exclude them from the inferential process treating them as totally non-A objects. Instead, the generalized
modus ponens should be used, which allows the inference to take place even when the first premise is
considered partially true. In that case the conclusion will be partially true as well. Therefore if x is A to a
degree, then y will be B to a degree. The premises leading to the conclusion are not sharp bounded leading to
a conclusion which is not precisely defined and that is why the reasoning is considered imprecise.
Thence it arises that the implication operator must be generalised over fuzzy variables. In classic logic the
implication rule “IF A THEN B”, denoted by 𝐴 → 𝐵, is set-theoretic equivalent to 𝐴̅⋁𝐵 and it is given by the
truth table shown in Table 3.2
A
B
𝐴→𝐵
T
T
T
T
F
F
F
T
T
F
F
T
Table 3.2
58
The only case where the implication is false is when a false consequent follows a true antecedecent. This is
obvious since true premises cannot imply a false conclusion. However classical implication needs some
further explanation for the cases where the antecedecent proposition is false. The paradox in these cases is
that the implication is true independently of the conclusion B. Therefore the implication is valid for all the
cases where the hypothesis (antecedecent) is false. However humans realize linguistic implication differently.
They mostly use it to indicate the causality between the condition and the conclusion. The antecedecent
proposition is equivalent to the cause of the concequent which is thought to be the effect. In other words the
antecedecent is related to the conclusion whereas in classical logic relatedness between these two is not
required for true implication. Fuzzy logic was inspired in the first place by the way humans think and reason.
Thus fuzzy implication is defined to serve this cause-effect relation between the antecedecent and the
concequent propositions.
Given that in crisp logic the implication is equivalent to 𝐴̅⋁𝐵 and that fuzzy negation and disjunction are
given by 𝐴̅ = 1 − 𝜇𝛢 (𝑥) and 𝐴⋁𝐵 = max{𝜇𝛢 (𝑥), 𝜇𝐵 (𝑦)} respectively, it is derived that the fuzzy
implication would be 𝐴 → 𝐵 = max{1 − 𝜇𝛢 (𝑥), 𝜇𝐵 (𝑦)}.
However this definition of fuzzy implication does not reflect the human reasoning since if the “cause” (or the
antecedecent) is not fulfilled having 𝜇𝛢 (𝑥) = 0 , the implication is still be true allowing B to be inferred by A
even if the premise A is not true. Within the scope of describing adequately human reasoning, the effect
should only follow the true presence of the cause. No cause implies no effect. Therefore if 𝜇𝛢 (𝑥) = 0
then 𝜇𝐵 (𝑦) = 0. Fuzzy logic transfers human implication to fuzzy implication by restricting the truth value of
the concequent, not to be larger than the truth value of the premise. This can be achieved by using fuzzy
implication operators like minimum (Mamdani, 1977) and product (Larsen, 1980) which are defined in
equations 3.16 and 3.17 respectively.
𝜇𝐴→𝐵 (𝑥, 𝑦) = min{𝜇𝐴 (𝑥), 𝜇𝐵 (𝑦)}
Eq. 3.16
𝜇𝐴→𝐵 (𝑥, 𝑦) = 𝜇𝐴 (𝑥) ∙ 𝜇𝐵 (𝑦)
Eq. 3.17
3.4 - Generalized Modus Ponens
Generalized Modus Ponens (GMP) allows the transformation from a fuzzy set to another through fuzzy
implication. Therefore, firstly the fuzzy relation R, representing the implication between two fuzzy sets A and
B, must be defined. Having the relation R along with a value describing the state of A (e.g. low, tall or a crisp
number taken from the universe of discourse) GMP is used to infer the state of B (again it can be either fuzzy
or crisp). Fuzzy rules (recall that implication is a fuzzy rule) are built on fuzzy sets. So in the case that the
59
premise A is given in crisp form then its value must be fuzzified (fuzzification step). Similarly if the
conclusion B is desired to be crisp then it must be defuzzified (defuzzification step).
As aforementioned, an implication relation is defined over two universes and therefore 𝑅𝐴→𝐵 ⊆ 𝑋 × 𝑌 where
the two fuzzy sets A and B are defined over the universes X and Y respectively. Then by given a particular
element x belonging to the fuzzy set A or the fuzzy set itself the fuzzy value of the “implied” y ∈ B can be
produced as:
𝑦 = 𝑥 ∘ 𝑅𝐴→𝐵
Eq. 3.18
which is the composition between the fuzzy value of x and the implication relation between A and B. The
steps which are followed to accomplish GMP inference are presented along with an example of a fuzzy
system which is gradually built from preprocessing step to inference. The fuzzy system which is used as an
example adjusts the air condition motor speed to the room temperature.
Steps of building a fuzzy system:
Step 1: Identify and assign input/output linguistic variables.
Two fuzzy variables compose the “air condition” problem, namely temperature which is the input variable
and air condition motor speed (ac speed) which is the output variable. The universe of the variable
temperature is X defined as the Celsius degrees between 0 and 45 while for the ac speed is Y representing the
interval [0,100] representing the percentage of the rpm (revolutions per minute), at a given time instance, to
the maximum rpm that the specific motor can catch.
Step 2: Describe each linguistic variable with a linguistic value (fuzzy set).
The variable temperature is quantified into the linguistic values cold, normal and hot while ac speed into
slow, medium and fast. Each linguistic value defines a fuzzy set and the corresponding membership functions
are shown in Figure 3.11. Note that each linguistic value must be defined over the universe of the linguistic
variable. Therefore the membership functions of the temperature and ac speed describe the fuzziness of the
universe X and Y respectively.
60
Figure 3.11: The membership functions of the linguistic variables temperature and air condition motor speed.
Step 3: Create the rule base. Each rule should assign the input variables to the output variable using
implication relations.
R1 (𝑅𝑐𝑜𝑙𝑑→𝑠𝑙𝑜𝑤 ): If temperature is cold then ac speed is slow.
R2 (𝑅𝑛𝑜𝑟𝑚𝑎𝑙→𝑚𝑒𝑑𝑖𝑢𝑚 ): If temperature is normal then ac speed is medium.
R3 (𝑅ℎ𝑜𝑡→𝑓𝑎𝑠𝑡 ): If temperature is hot then ac speed is fast.
It is important to mention that during inference step all the rules are enabled. They fire in parallel and
therefore two or even more rules can be used to apply inference on a system at the same time. The rule base
(RB) comprises a union of rules in such a way that 𝑅𝐵 = 𝑅1 ∨ 𝑅2 ∨ … ∨ 𝑅𝑁 where N is the total number
of rules.
Additionally, in most of the times there are more than one input variables. For example besides temperature,
humid could be also considered an input variable for the air condition system. However each linguistic
variable is described by several fuzzy sets. The rule base should assign every possible combination of
different input fuzzy linguistic values to the output variable. In the case where Ni is the number of the fuzzy
sets describing the linguistic variable xi and there are M linguistic variables then the total number of rules that
should be coded in a rule base is 𝑁 = 𝑁1 × 𝑁2 × … × 𝑁𝑀 .
Step 4: Introduce the states of the input variables to the system.
The only input variable in the air condition problem is the temperature. Temperature is measured in Celsius
degrees but it is also quantified into linguistic values. So the state of the temperature could be either a Celsius
degree drawn directly from the universe of discourse X or a linguistic value picked from cold, normal and
hot.

If the input variable x is a fuzzy value:
61
Then x ∈ {cold, normal, hot}. Each one of them defines a fuzzy set and therefore there is no need for
further fuzzification. In case the relational matrix of implication must be defined then the universe of
discourse must be discretized. In Figure 3.11 the discretization step for temperature is 5. After
discretization the discrete elements of the chosen fuzzy set are known.
X = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45]
Thus input variable will be assigned to the membership function degrees to which the discretized
elements belong to the chosen fuzzy set.
Let x = hot then the input variable would be:
𝜇ℎ𝑜𝑡 = [0, 0, 0, 0, 0, 0, 0, 0.5, 1, 1]

Eq. 3.19
Else if it is assigned to a crisp value x’:
It means that it is a specific Celsius degree ranging in the interval 0 ≤ 𝑥 ≤ 45. Since it is a crisp
value it must be fuzzified which means it must be expressed by a fuzzy set. Introducing a crisp value
states that the temperature is exactly x’ Celsius degrees which means that it is fully true that the
temperature is x’ and nothing else. Hence the state of the input variable can be fuzzified by a
singleton membership function which essentially will return 1 (true) for x’ and zero (false) for all the
rest elements of the universe X.
𝜇𝑥 ′ (𝑥) = {
1, 𝑖𝑓 𝑥 = 𝑥′
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Eq. 3.20
Let x = 35℃ then 𝜇35 = [0 0 0 0 0 0 0 1 0 0]
For the following steps it is assumed that x = 35℃ with
Eq. 3.21
𝜇𝑥′ = 𝜇35 = [0 0 0 0 0 0 0 1 0 0] is the input
variable of the system.
Step 5: Construct the implication relational matrices for the firing rules.
The crisp value temperature = 35℃ falls into two fuzzy sets, the normal (fully part of normal) and the hot
(partially member with 0.5 grade of membership). As a result two rules fire, R2 and R3. To construct the
relational matrix discretization must be applied to the output universe as well and so:
Y = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
62

R2 actually represents 𝑅𝑛𝑜𝑟𝑚𝑎𝑙→𝑚𝑒𝑑𝑖𝑢𝑚 and therefore the interest here is to infer how strongly each
normal temperature value induces each medium ac speed value. The conclusion of this rule is
y = medium (M) and it is given by the fuzzy set 𝜇𝑀 = [0, 0, 0, 0, 0.5, 1, 0.5, 0, 0, 0]. The fuzzy set of x
= normal (N) is 𝜇𝑁 = [0, 0, 0, 0, 0.4, 1, 1, 1, 0, 0, 0]. As observed the elements which don’t belong to
the sets normal (for temperature) and medium (for ac speed) have zero membership degree.
Therefore they could be excluded from the sets (since their implication relational membership degree
will be certainly zero). As a result:
𝜇𝑁 = [ 0, 0.4, 1, 1, 1, 0]
𝜇𝑀 = [ 0, 0.5, 1, 0.5, 0, ]
By applying the Mamdani implication operator, the relational membership function would be given
by:
𝜇𝑅2 (𝑥, 𝑦) = min{ 𝜇𝑁 (𝑥), 𝜇𝑀 (𝑦)} where x ∈ X and y ∈ Y
Eq. 3.22
and the final relation matrix R2:
y∈M
0/30 0.5/40 1/50 0.5/60 0/70
x∈N

0/15
0
0
0
0
0
0.4/20
0
0.4
0.4
0.4
0
1/25
0
0.5
1
0.5
0
1/30
0
0.5
1
0.5
0
1/35
0
0.5
1
0.5
0
0/40
0
0
0
0
0
𝜇𝑅2 (20,60) = min{0.4, 0. ,5}
R3 (𝑅ℎ𝑜𝑡→𝑓𝑎𝑠𝑡 )
The relational matrix of this rule is constructed the same way as the matrix of R2. Therefore R3
relational matrix is:
63
y∈F
0/50 0.2/60 0.4/70 0.6/80 0.8/90 1/100
x∈H
0/30
0
0
0
0
0
0
0.5/35
0
0.2
0.4
0.5
0.5
0.5
1/40
0
0.2
0.4
0.6
0.8
1
1/45
0
0.2
0.4
0.6
0.8
1
Step 6-Α: Apply generalized modus ponens
This step describes the actual inference based on a fact x’, which is the given state of the temperature
(Equation 3.21) and an implication rule (R2 and R3) resulting the new state y’ of the output variable ac
speed, through composition operation.

Firing by R2:
𝜇𝑥 ′ ∘𝑅2 (𝑦) = 𝜇𝑀′ (𝑦) = max {𝑚𝑖𝑛{𝜇𝑥 ′ (𝑥), 𝜇𝑅2 (𝑥, 𝑦)}}
𝑦
𝑥
Eq. 3.23
Results a new fuzzy set of ac speed: 𝜇𝑀′ = [0, 0, 0, 0, 0.5, 1, 0.5, 0, 0, 0, 0].

Firing by R3:
𝜇𝑥 ′ ∘𝑅3 (𝑦) = 𝜇𝐹′ (𝑦) = max {𝑚𝑖𝑛{𝜇𝑥 ′ (𝑥), 𝜇𝑅3 (𝑥, 𝑦)}}
𝑦
𝑥
Eq. 3.24
Results a new fuzzy set of ac speed: 𝜇𝐹′ = [0, 0, 0, 0, 0, 0, 0.2, 0.4, 0.5, 0.5, 0.5].

Both rules are fired in parallel and therefore they are implicitly connected by a disjunction operator
(in this case min norm is applied)
𝜇𝑦′ (𝑦) = 𝜇𝑀′ (𝑦) ⋁ 𝜇𝐹′ (𝑦) = max{𝜇𝑀′ (𝑦), 𝜇𝐹′ (𝑦)}
Eq. 3.25
The inferred ac speed fuzzy set is: 𝜇𝑦′ (𝑦) = [0, 0, 0, 0, 0.5, 1, 0.5, 0.4, 0.5, 0.5, 0.5]
Step 6-Β: Apply generalized modus ponens
64
An alternative way of applying GMP inference is described in this sub-step. This method is simpler and
doesn’t require any discretization. The whole inference process is graphically represented in Figure 3.12. In
the first place input crisp value x’ is presented to the system and its membership degree to each of the fuzzy
sets describing the input variable is calculated (𝜇antecedecent (𝑥′)). The antecedecent and the concequent of a
fuzzy rule comprise linguistic values (fuzzy sets) of the input and output variables. Therefore the firing rules
are those for which the antecedecent fuzzy set includes the input element x’ to a degree larger than zero. For
each enabled rule the concequent fuzzy set is formed as:
𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 ′(𝑦) = 𝑚𝑖𝑛{𝜇antecedecent (𝑥′), 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑦)}
Eq. 3.26
Eventually all the concequent of all the firing rules are unified to result the final set describing the output
variable.
Returning to the air condition example:
The input crisp value x’ = 35 belongs to cold with zero degree, to normal (N) with one degree and to hot (H)
with 0.5 degree. Therefore the firing rules are R2 and R3 since their antecedecent is normal and hot
respectively.


Firing by R2:
𝜇𝑀′ (𝑦) = 𝑚𝑖𝑛{𝜇N (𝑥′), 𝜇𝑀 (𝑦)}
Eq. 3.27
𝜇𝐹′′ (𝑦) = 𝑚𝑖𝑛{𝜇H (𝑥′), 𝜇𝐹 (𝑦)}
Eq. 3.28
Firing by R3:
The results are shown in Figure 3.12.
These are the basic steps one should follow to design and implement fuzzy inference with GMP. There is
however something which is missing. In most cases, the input variables of realistic fuzzy systems are more
than one. In that case the antecedents must be combined using fuzzy operators (by conjunction, disjunction,
etc). The resulting fuzzy set will be then used to infer the fuzzy set of the concequent. An example of a fuzzy
system with two input variables is presented in Figure 3.13. The rules which are firing in the particular
example are:
R1: If x = small AND y = low then z = strong.
R2: If x = small OR y = low then z = medium.
65
R3: If x = big AND y = high then z = weak.
The output fuzzy set is shown in Figure 3.13 after introducing the input crisp value x’ and y’.
66
Figure 3.12: Construction of the membership function describing the fuzzy set of the concequent inferred by the two firing rules R2 and R3 and a single
input variable. The purple graph is the input variable temperature where the orange is the output variable ac speed. The final concequent membership
function is the union of the two fuzzy sets constructed by the composition of the input variable and the fuzzy rules.
Figure 3.13: Inference over a fuzzy system with two input variables.
67
Step 7: Defuzzification
Fuzzy inference results fuzzy concequent. However in many cases a crisp output value is
required. The process which transforms a fuzzy value to a crisp value is called defuzzification.
Defuzzification methods comprise a big family and only the most well known subset of them is
presented in this study.

Maxima Method
The methods which fall into this category, make use of the elements which belong to the
concequent’s fuzzy set with the maximum membership grade (let them comprise the
maximum set and be named as maximum elements). These methods are the first – of –
maxima, middle – of –maxima and last – of – maxima. The former returns the maximum
element with the minimum domain value. Accordingly the last – of – maxima returns the
maximum element with the maximum domain value. The middle – of – maxima returns
the average of the maximum elements.

Center – of – area / gravity:
The center of gravity xcog of a surface given by a function f(x) between the domain points
x1 and x2 is:
𝑥𝑐𝑜𝑔 =
𝑥
1
𝑥
∫𝑥 1 𝑓(𝑥)𝑑𝑥
2
∫𝑥 2 𝑥 𝑓(𝑥)𝑑𝑥
Eq.3.29
Therefore the center-of-gravity defuzzification method returns the center of gravity of the
output’s fuzzy set’s membership function 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 , as the crisp output value:
𝑥𝑐𝑜𝑔 =
𝑥
1
𝑥
∫𝑥 1 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (x)𝑑𝑥
2
∫𝑥 2 𝑥 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑥)𝑑𝑥
Eq.3.30
where x1 and x2 are the minimum and maximum domain elements of the membership
function, respectively.
If the elements of the fuzzy output set are discrete the center-of-gravity can be calculated
by substituting the integrals with discrete summations:
68
𝑥𝑐𝑜𝑔 =
∑𝑁
𝑖=1 𝑥𝑖 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑥𝑖 )
Eq.3.31
∑𝑁
𝑖=1 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑥𝑖 )
A little attention is required in cases where the fuzzy set of the concequent is totally
empty and therefore all the domain elements belong to this set with zero degree. The
center-of-gravity should be directly assigned to zero avoiding division with zero.
In Figure 3.14 the output fuzzy set of the inference example presented in Figure 3.13 is
defuzzified with three different ways. For example the crisp value using the center-ofgravity method for this set would be calculated as:
𝑧𝑐𝑜𝑔 =
10 ∗ 0.4 + 20 ∗ 0.4 + 30 ∗ 0.4 + 40 ∗ 0.4 + 50 ∗ 0.75 + 60 ∗ 0.25 + 70 ∗ 0.25 + 80 ∗ 0.25 + 90 ∗ 0.25
0 + 0.4 + 0.4 + 0.4 + 0.4 + 0.75 + 0.25 + 0.25 + 0.25 + 0.25 + 0
= 45.5
45.5
Figure 3.14: Defuzzification methods: (a) Middle-of-maxima (b) Left-of-maxima (c) Center-ofGravity
69
Figure 3.15: Basic preprocessing and inference steps for a fuzzy system
Until this point only generalized modus ponens (GMP) has been presented for fuzzy reasoning.
Generalized modus tollens (GMT) is also defined for fuzzy logic systems. Recall that classical
modus tollens can be expressed as:
(¬𝐵 ∧ (𝐴 → 𝐵)) = ¬𝐴
Eq. 3.32
Therefore if A “implies” B and it is given that B is not true then A is certainly not true as well.
Considering that A and B are fuzzy sets defined over the universes X and Y respectively, the
Equation 3.32 can be expressed as a rule:
R: If y is not B then x is not A.
In Fuzzy Logic context B and A comprise fuzzy sets describing a linguistic variable. Therefore,
similarly to the modus ponens equation (Equation 3.18) the concequent of the GMT is given:
𝑥 = 𝑅𝐴→𝐵 ∘ 𝑦
Eq. 3.33
After deriving the concequent fuzzy set by using GMT the rest inference process is the same as
for GMP, presented concisely in Figure 3.15.
3.5 - Fuzzy System Development
The man who first introduced the concept of a fuzzy set referred to Fuzzy Logic as “computing
with words” (Zadeh, 1996). Humans have the ability of inferring conclusions out of imprecise,
non-numerical input data. Not only that but they can also transfer these thoughts and reasoning
ideas to each other using vague expressions and still understand each other. Actually this is the
70
main form of human communication in most real life cases using statements of the type: “If you
decrease a little more the aircondition temperature, the room will become cold”!
Essentially humans and Fuzzy Logic manipulate “words” as labels. These labels characterize a set
of objects/behaviors/actions/etc. However, there is a great and important difference between
humans and Fuzzy Logic. Humans learn the meaning of each word during their lives and through
their experiences.
The function which assigns various input values to a word-label is an
automatic process of the brain and it happens beyond human awareness. On the other site, fuzzy
systems are designed and implemented on computers. Computers don’t comprise a human brain
and therefore despite the imprecise/fuzzy behavior of a fuzzy system, its parameters must be
precisely defined. More precisely there are two categories of fuzzy modeling identification, the
structure identification and the parameter identification (Sugeno and Yasukawa, 1993).
Structure identification concerns about finding input variables (namely Type I problem) and the
relations connecting them to output variables (Type II problem).

Type I: Find input variables
A system might accept a big number of input variables. However high input
dimensionality increases the complexity of the system and furthermore in many cases
only a small subset of the whole input set of variables affects drastically the system. As a
result the initial set of input features must be decreased in such a way that it will retain all
the necessary information the system needs to achieve qualitative reasoning. The problem
of reducing the initial pool of candidate input variables to the minimum number and still
allowing to the system to accomplish correct inference is under the category of Structure
Identification Type I.

Type II: Find input-output relations
This problem requires that the Type I problem is already solved and so the input variables
which are going to be used to model the input feature space of the system are known. The
next question is about the structure and the number of the relations between input
variables and output. In the context of fuzzy logic, the rules which transform the system
from input to output are called rules. Hence Type II is about rules identification and it is
subdivided into two categories:
o
Subtype IIa: Find the number of the fuzzy rule
71
Every system presents a certain behavior (expressed as the output variable)
which is depending upon a particular combination of values of the input
variables. Therefore a specific number of rules must be built in order to define
every possible behavior of the system (every possible combination of causes and
effect).
o
Subtype IIb: Find the structure of premises
Each rule is composed by a set of premises and a concequent. The maximum
number of premises is equal to the total number of the input variables (let it be N)
yet some input variables may not affect a certain system’s behavior (concequent).
Each variable is quantified into a number of linguistic values and therefore for
each rule there are 𝑁 × 𝑀 different combinations of premises. Each rule defines
a specific partition of the input space (if the introduced input features fall into
this partition the rule is firing). The position of the partition is defined by the
membership functions describing the linguistic variables which participate to the
rule. The problem thus is to define the premises values for each rule so that the
input space is partitioned into the minimum possible subspaces without loosing
any information about the system’s output behavior.
Parameter Identification comprises an optimization problem for the parameters of the
membership functions describing each linguistic variable. The number of the parameters which
must be optimized per each membership function depends on the type of the function
(trapezoidal, Gaussian, etc).
If no prior knowledge is available for the fuzzy system to be constructed, structure and parameter
identification problems cannot be dealt separately. There are some proposed methods of solving
the above-mentioned problems. Genetic algorithms are very popular for solving Parameter
Identification and Type I problems. For the former one some other models have been also used to
find a solution like Artificial Neural Networks, Particle Swarm Optimization, gradient descent
methods. For Type II problem clustering methods have been proved to separate adequately the
input space into a certain number of clusters which represent the partitions. Each cluster is then
transformed to a rule form assigning the corresponding input values to the rule’s premises.
72
Chapter 4 - Evolutionary Algorithms
Evolutionary Algorithms (EA) is the field of Computer Science which studies programs that
evolve solutions to problems, incorporating nature’s evolutionary principles. Evolutionary
algorithms are biologically inspired and many of their terms and parameters are identical to the
ones used to describe genetics. For better understanding of their meaning and their use in the
context of EC some biological terms of genetics are explained in this section.
4.1 - Biology
Every living organism is comprised by a certain number of cells. The cell is the structural and
functional unit of life. Each cell contains the same set of chromosomes which is called genome.
For example, human genome is constituted by forty six chromosomes. The chromosomes lie in
the nucleus of the human cell and they are divided into twenty two pairs of homologous
chromosomes and one pair of sex chromosomes. A chromosome is basically a long chain of
biochemical material, namely deoxyribonucleic acid (DNA), which encodes the necessary
information about the development of a specific set of characteristics describing the organism.
The “ribbon” of the DNA is further divided into segments, called genes, which serve as
“blueprints” for a part, or the whole, of a particular protein. Recall that proteins are considered
the building blocks of an organism and therefore the whole genome may be naively thought as
nature’s “recipe book” including all the instructions about the construction of the organism’s
physical and mental features.
The genes of homologous chromosomes encode the same protein functionality and they are
addressed to corresponding locations on the chromosomes. As the genome is inherited from the
organism’s parents, one part of the homologous chromosomes is a copy of the mother’s
corresponding chromosome and the other part is of the father’s.
Each gene is positioned at a specific location on a chromosome which is called locus. The DNA
sequence which is enclosed in a gene locus represents the encoded information about the
expression of a characteristic. The variant of these DNA sequences, which are possible to
appear at a locus, are called alleles. For example, let it be a gene which is responsible for the
production of the eye colour protein. The possible alleles of this gene are {Black, bRown,
bLue, Green}. Since each chromosome is always accompanied by its homologous, the eye
73
color will be set based on the alleles of the two homologous genes. The two alleles may be
different from each other and therefore if N is the number of the possible alleles of a gene then
there are 𝑁 × 𝑁 different combinations of alleles. Now, consider an organism whose eye color
homologous genes present the following allele combination: {R, L} (Figure 4.1).That means
that one of the genes define brown color for the eyes while the other blue. The combination of
the alleles {R, L} is labeled as the eye color genotype of the organism. Brown color is dominant
over blue and thus the color of the eyes of the particular organism will be brown. The actual
color of the eyes is the expression of the genotype and it is called phenotype.
A cell can make a copy of itself through the process of mitosis. During mitosis the DNA of the
nucleus is replicated creating two identical sets of chromosomes. In the end of the process the
cell is divided into two duplicated cells which have the exact number of chromosomes as the
maternal one. The first organism’s cell which undergoes mitosis is the zygote and after
consecutive divisions billions of cells are created each of them containing the same number of
chromosomes as the zygote.
The zygote is the result of the union between two gamete cells (two parents) which are joined
during reproduction and based on the zygote, a new organism can be developed. By
Figure 4.1: A homologous pair of chromosomes appears on the right sight of the figure. The
alleles of the two homologous eye colour genes define that the genotype of eye colour is RL
(brown and blue) where the phenotype is two brown eyes shown on the right part of the figure.
reproducing, the organisms manage to pass on their genes to next generation and therefore gain
in evolution. Reproduction is the driving vehicle of evolution and natural selection is the wheel.
74
Natural selection is a term which first introduced Charles Darwin in his work “The Origin of
Species” back in 1859. The concept of natural selection can be analyzed into two other
concepts:

The survival of the fittest
The fittest organism is considered the one which has been best adapted to its
environment. As a result it has more potential to survive from dangers or threats hidden
in its environment whereas the less adapted organisms will not.

Differential reproduction
This concept emanates from the “the survival of the fittest”. If an organism survives
better than others then it is more likely to reach reproductive age and consequently it
increases the possibility of reproducing better leaving offspring carrying part of its
genome to next generations.
Hence natural selection doesn’t favor the organisms which just live longer than others, rather
the ones which live long enough to have more offspring surviving to adulthood, than others. An
organism is selected by nature to survive based on its interaction with its environment, which is
highly depended on the organism’s phenotype. As a result the genes which are passed through
generations are the ones which express the “selected” phenotype. Over a number of generations
the phenotype of the fittest organisms dominates a population.
Besides reproduction, new information and new characteristics might appear in a population
after mutation. Mutation happens when a DNA sequence changes spontaneously and randomly
altering the expression of the corresponding gene. Mutation may have an impact on the
phenotype of the mutated organism and it is likely that totally new characteristics can emerge
for the particular population. Although reproduction is the main process of evolving organisms,
mutation may lead to the appearance of a new characteristic which could be the key to the better
adaptation of the next generation to a certain changing environment.
Summarizing up, evolution could be given as:
Evolution = reproduction with inheritance + variability + selection
Eq. 4.1
75
4.2 - EC Terminology and History
Biological terms are used to describe common parameters and concepts of the EC algorithms
and they are presented in Table 4.1.
Biology
EC algorithms
Chromosome
A specific data structure which represents the solution to the modeled
problem. A problem might be affected by a number of variables. Each of
these variables is somehow encoded to a chromosome.
Population
A total number of different chromosomes participating to an EC
algorithm.
Genotype
The values of the encoded data of a chromosome.
Phenotype
The decoded values of a chromosome which are assigned to the
corresponding problem variables.
Environment
The target of the problem being modeled.
Fitness Value
Evaluation value on how well a specific chromosome, based on its
phenotype, interacts with the environment. In other words, how well the
solution suggested by a chromosome, “fits” the problem.
Crossover
The process where two artificial chromosomes exchange genotype
(Reproduction)
information.
Mutation
The process where the genotype of a chromosome is altered in a specific
way and position.
Selection
The process where a certain number of chromosomes usually based on
their fitness values.
Table 4.1: How biological terms are used in EC algorithms
The EA area comprises four basic techniques:
(a) Genetic Algorithms (Holland, 1975)
(b) Genetic Programming (Koza, 1992)
(c) Evolutionary Strategies (Rechenberg, 1973)
(d) Evolutionary Programming (Fogel, 1966)
76
They all work on populations of artificial chromosomes (also called individuals). Each
chromosome suggests a solution to the problem being modeled. The initial population of
chromosomes is evolved through a number of generations and eventually converges to a “good”
solution. Their main difference between the four EC paradigms is the chromosome
representation scheme. A short description about them follows whereas a more comprehensive
description will be given for Genetic Algorithms which seems to be the most widely used
among all of them.
4.2.1 - Genetic Algorithms (GA)
Representation:
The basic representation for a GA chromosome is a fixed-length bit string. Each position of the
string represents a gene. A set of one or more sequential genes express the phenotype (value) of
one specific variable of the problem.
Basic Operators:
Reproduction: Bit-string crossover. Two chromosomes are selected as parents and swap
a specific sequence of their genes producing two new offspring which eventually
replace their parents.
Mutation: Bit-string mutation. A number of genes (could be zero as well) constituting a
chromosome is chosen and they get flipped.
Selection: A number of chromosomes are selected using probabilistic methods (which
will be described in greater detail later) which are based on the chromosome fitness
values.
4.2.2 - Genetic Programming (GP)
Representation:
GP is mainly used to evolve programs. A program consists of predefined terminal set, which
includes inputs, constants or any other value the variables of the program can take, and a
function set which includes different statements, operators, functions which are used by the
program.
77
Terminal Set
Function Set
{-4, -3, -2, -1, 0, 1, 2, 3, 4}
{+, - , *, /}
{a, b, c}
{and, or, not}
{true, false}
{if, do, set}
Table 4.2: Terminal and Function set examples
The basic representation for a GP chromosome is then a tree whose inner nodes take elements
from the function set and the terminal nodes (leaves) from the terminal set (Figure 4.2).
Basic Operators:
Reproduction: Sub-tree crossover. Two chromosomes (program trees) are selected as
parents per selection policy. In each parent a sub-tree is selected randomly (the subtrees may have unequal sizes) and then the sub-trees are swapped. Usually there is a
maximum tree depth constraint which restricts the choice of a sub-tree.
Mutation: Sub-tree mutation. A node is randomly chosen and replaced by a randomly
generated sub-tree. Again the max depth constraint applies.
Selection: Alike Genetic Algorithms.
Figure 4.2: Genetic tree expressing the program sin(x) + exp(x 2 + y)
4.2.3 - Evolutionary Strategies (ES)
Representation:
A fixed length vector of real values is the ES representation. Opposed to GA, the basic
representation of ES represents the phenotype of a chromosome rather the genotype. Each
element of the vector has the actual value of a problem’s variable.
78
Basic Operators:
Reproduction: Intermediate recombination. Two (or more) parental chromosomes are
averaged element by element to form a new offspring.
Mutation: Gaussian mutation. A random value is drawn from Gaussian distribution and
added to each element of a chromosome.
Selection: The best fitted chromosomes either only amongst the offspring or both
parents and children, are passed through the next generation.
4.2.4 - Evolutionary Programming (EP)
Representation:
Similar to ES, the representation of EP chromosomes is a fixed-length real valued vector.
Basic Operators:
Reproduction: None. There is not exchange of any information between the
chromosomes of a population. Evolution is driven only by mutation.
Mutation: Distribution Mutation (e.g. Gaussian).
Selection: Each chromosome replicates itself producing an identical offspring. Then
mutation is applied to that offspring. Finally, using a stochastic method, the best fitted
chromosomes are selected from the parents and the offspring.
4.3 - Genetic Algorithms
GA is a method which evolves a population of chromosomes (candidate solutions) aiming at
finding a good solution to a problem after a number of generations. The apply searching to a
problem’s potential set of solutions until a specified termination criterion is satisfied and an
approximately good solution is found. Essentially Holland, who introduced GAs in their
79
canonical form, incorporated the nature’s principle “survival of the fittest” into an optimization
algorithm.
4.3.1 - Search space
GAs are mostly used to solve optimization problems. Optimization is about finding a
good/approximate solution to a problem (e.g. minimum/maximum of a function). To achieve
optimization a set of the problem’s potential solutions 𝑥 ∈ 𝑋 is needed and a method of
evaluating each potential solution so that 𝑓: 𝑋 → ℜ.
The search space of a problem is defined as the collection of all feasible solutions to the
problem. Every single point of the search space comprises a set of the problem’s variables
values which give a specific solution to the problem. Therefore the optimum solution is a point
of the search space. A point lying in a problem’s search space is also assigned to its fitness
value depending on how well it solves the particular problem. An optimization algorithm is then
defined to explore the search space of a problem so that eventually to converge to a solution
which will be the regarded as the optimum one. Hence GA which comprises an optimization
algorithm applies searching to the search space of a problem.
4.3.2 - Simple Genetic Algorithm
An algorithm is a sequence of steps which one must follow to solve a problem. GA is an
iterative algorithm which handles a population of feasible solutions of a problem, encoded in
chromosomes, at each step. GA mimics the process of evolution. Therefore at every iteration a
new population of chromosomes is generated by their ancestors using certain genetic operators.
GAs are search heuristic algorithms which are aimed at finding optimum solutions of a problem
by miming the process of natural evolution. Their great advantage is that they can easily handle
multi-dimensional problems without having to use differential equations. The principle of GAs
is based on the survival of the fittest chromosomes-solutions among an initial random
population of chromosomes. Once the population is generated the algorithm is applied and thus
the evolution process begins. This principle can be used for all kind of problems and
applications. For each application the process of chromosomal evaluation is defined to guide the
process of selection for the optimum solution of the modeled problem. Chromosomes with
better evaluation scores have more probabilities to reproduce compared to the ones with worse
evaluation scores.
80
The first step of a GA is the generation of the initial population of chromosomes, X. A defined
number of N fixed-length bit strings are created, 𝑋 = (𝑥1 , … , 𝑥𝑁 ), to represent a subset of all the
candidate solutions of a problem which are randomly initialized. A fitness function is used to
calculate each chromosome’s fitness value fi , which is a numerical value showing how close the
particular solution is to the optimum one. Then the following steps are iterated until the
termination criterion is satisfied. Until N offspring are generated repeat the sub-steps 1-4:
(1) A pair of chromosomes (𝑥𝑖 , 𝑥𝑗 ), is selected to function as parents, using a probabilistic
selection method based on their fitness values.
(2) With a predefined probability PC the two parental chromosomes swap genetic
information using a crossover operator producing two offspring (𝑜1 , 𝑜2 ).
(3) Each offspring is then mutated under the probability of PM.
(4) The fitness value of each offspring off_fi is evaluated.
After the generation of N offspring, the current population of chromosomes is replaced by the
new population of offspring and a new generation cycle begins.
Using a genetic algorithm paradigm for solving an optimization problem requires a
preprocessing phase including the specification of the algorithm’s parameters and methods.
Therefore for any specific problem the following must be defined:
1. The solution representation into a chromosome.
2. The fitness function.
3. The genetic operators (crossover, mutation and selection)
4. Other parameters as the termination conditions and probabilities PC and PM.
4.3.2.1 - Representation
The most common representation of the GA is binary which means each chromosome is a series
of zeros and ones and somehow they can be decoded to numerical values. A problem is affected
by a number of variables. Let v be such a variable. Then:
If
Then
𝑣 ∈ {0, 1, … , 2𝑙 − 1} where l ∈ ℕ
v binary representation = [𝑏1 𝑏2 … 𝑏𝑙 ]
Eq. 4.2
where 𝑏𝑖 ∈ {0,1} Eq. 4.3
81
and the integer value of the v1 will be decoded following to:
𝑣 = ∑𝑙𝑖=1 𝑏𝑖 2𝑖−1
Eq. 4.4
However there are some cases where a variable’s domain cannot be represented by Equation
4.2. Therefore a more generalized definition is given as:
𝑣 ∈ {𝑀, 𝑀 + 1, … , 𝑀 + 2𝑙 − 1} where l, M ∈ ℕ
Eq. 4.5
Where M is the minimum value of the variable’s domain (v1_min) and 𝑀 + 2𝑙 − 1 is the
maximum (v1_max) and therefore the binary representation should represent the distance
v1_max – v1_min = 𝑀 + 2𝑙 − 1 − 𝑀 = 2𝑙 − 1 which is the same as given in Equation 4.4 and
then the decoded value would be:
𝑣 = 𝑣_𝑚𝑖𝑛 + ∑𝑙𝑖=1 𝑏𝑖 2𝑖−1
Eq. 4.6
Therefore the distance between the maximum value and the minimum value of the variable’s
domain will be divided into 2𝑙 − 1 equal sub-distances which means the domain will be
restricted to 2𝑙 real values with precision of q decimal points. The decoded value is retrieved
by:
𝑣 = 𝑣_𝑚𝑖𝑛 +
(𝑣_ max − 𝑣_min)
2𝑙 −1
∑𝑙𝑖=1 𝑏𝑖 2𝑖−1
Eq. 4.8
If more than one variables define a problem then the binary representation is calculated per each
based on each variable’s type of domain and boundaries. Then the total problem solution
representation is given by concatenating them into one.
Figure 4.4: Example of decoding the binary representation of a problem with two variables (v1
and v2).
82
Figure 4.3: Logic Diagram of GA using max generations (MAX_GEN) termination criterion
83
4.3.2.2 - Fitness Function
The fitness function most of the times “translates” the natural objective/cost function of the
optimization problem. In cases where the problem’s objective function is known then it is used
as fitness function as well. Generally the fitness function should return adequate information on
how close to a good solution is the solution suggested by each chromosome. Hence the fitness
function applies on the decoded values of a chromosome. Fitness values form the basis of the
selection operator since generally the fittest chromosomes have more probabilities of being
selected as parents.
4.3.2.3 - Genetic Operators
4.3.2.3.1 - Selection
Selection takes place during the “breeding” phase of the algorithm where new chromosomes
must be generated and replace the present population. Following natural selection, GA selection
favors the recombination of the fittest chromosomes so that their genome has more probabilities
passing to the next generation. However less fit individuals are not totally excluded from the
breeding phase retaining in that way population’s diversity and taking an advantage of potential
“good” partial solutions.
Three selection methods are presented and for all of them fi is used to denote the fitness of the
ith chromosome where 1 ≤ 𝑖 ≤ 𝑁.
4.3.2.3.1.1 - Roulette Wheel Selection
Derived by its name, this selection method resembles the function of a casino roulette wheel.
Firstly the percentage of each chromosome’s fitness to total population fitness must be
calculated.
𝑃𝑖 =
𝑓𝑖
∑𝑁
𝑖=1 𝑓𝑖
Eq. 4.9
A roulette wheel represents the total population fitness. Each chromosome occupies a slice of
the roulette wheel, sequentially. The size of the slice is proportionate to the chromosome’s
contribution to the total fitness 𝑃𝑖 . Hence if a fitter chromosome will be assigned to more slots
84
of the roulette wheel than a less fit one. The exact position of each chromosome on the roulette
wheel is defined by its cumulative fitness value defined as:
𝑗
𝐶𝑗 = ∑𝑖=1 𝑃𝑖
Eq. 4.10
The roulette wheel ball is implemented as a randomly generated number r in the range [0,1].
The ball “falls” into the boundaries of a specific chromosome’s slice. The particular
chromosome is considered the selected one. Therefore:
𝐶𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑−1 ≤ 𝑟 < 𝐶𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑+1
Eq. 4.11
Figure 4.5: Imagine the roulette wheel to be “unfolded” onto a ruler. A population of three
chromosomes with fitness values f = {15, 25, 10}. Chromosomes are positioned on the roulette
wheel and a random number is generated r = 0.83. The selected chromosome is the third one.
Although roulette wheel selection is one of the GAs traditional selection methods, it exhibits
some problems. Premature Convergence is the first one. Let a chromosome to have a fitness
value much higher than the average population fitness value yet considerably lower than the
optimum. Roulette wheel selection will allow to this chromosome to occupy a very big slice of
the roulette pie (e.g. 90%) and therefore the rest chromosomes will have too few chances to be
selected. As a result in a small number of generations the all the chromosomes of the population
will be fairly identical to this chromosome, unable to escape from this chromosome’s genotype
and therefore unable to reach the optimum or any better solution. In other words, the population
may converge to a local optimum. Another scenario which roulette wheel fails to deal
effectively is when all chromosomes of a population have approximately equal fitness values
close to the optimum value (but none of them is the optimum). In this case, each chromosome
possesses almost equal slice on the roulette wheel and therefore they all have equal chances to
be selected eliminating any selection pressure (e.g. the degree to which fitter chromosomes have
more probabilities of producing offspring). Therefore random wheel behaves just as a random
selection being unable of using the fitness information given for each chromosome.
85
Fitness scaling is used to address the premature convergence as described above. The idea is to
rescale the fitness of each chromosome so that to decrease the selection pressure to the
chromosome which might have fitness value much larger than the rest. By fitness scaling this
chromosome is restricted from dominating the population prematurely. Linear scaling is one of
the available scaling methods under which the new scaled fitness values are evaluated to:
𝑓𝑖′ = 𝑎𝑓𝑖 + 𝑏
Eq. 4.11
The basic idea of linear scaling is that the chromosome with the average fitness is expected to
be selected once. Therefore the average fitness value after scaling remains equal to the average
fitness value before scaling:
′
𝑓𝑎𝑣𝑔
= 𝑓𝑎𝑣𝑔
Eq. 4.12
Given this, the user defines how many times the best fit chromosome is expected to be selected
(expressed by a number C ∈ ℜ). Then the linearly mapped best fitness value is given by:
′
𝑓𝑚𝑎𝑥
= 𝐶𝑓𝑎𝑣𝑔
Eq. 4.13
Clearly, increasing the parameter C increases the selection pressure in favor to the fittest
chromosome while decreasing it selection pressure is decreased. Based on these two
′
′
points (𝑓𝑎𝑣𝑔
, 𝑓𝑎𝑣𝑔 ) and (𝑓𝑚𝑎𝑥 , 𝑓𝑚𝑎𝑥
) the coefficients a and b are calculated and therefore using
equation Eq. 4.11 the rest rescaled fitness values are computer.
Figure 4.6: Fitness rescaling function where C = 2
86
4.3.2.3.1.2 - Rank Selection
Rank selection ranks individuals according to the fitness value of each chromosome. The
chromosome with the best fitness is assigned to the highest rank (r = N where N is the
population size) and consequently the worst fit chromosome gets the lowest (r = 1) and all the
others in between. Then the ranking values are considered the new fitness values.
By using rank selection all chromosomes have chance to be selected, even if one of them has
substantially larger fitness value. However, rank selection is unable of taking an advantage on
information which may be included into fitness differences among the chromosomes leading
many times to much slower convergence.
4.3.2.3.1.3 - Tournament Selection
With tournament selection, a group of m chromosomes is randomly selected and the fittest
chromosome among them is chosen to be a potential parent. Tournament selection is very easy
to implement, efficient and most importantly, it holds a balance between the selective pressure
and population diversity. The only parameter the user must tune is the number of the
chromosomes participating to each “tournament”. As the number of competitors increases, the
selection pressure is also increased and other way around. Therefore the used must finely define
the tournament size m so that to retain the selection pressure and population diversity to
desired levels.
4.3.2.3.1.4 - Elitism
When elitism is used, a number of the best fitted chromosomes are copied directly to the next
generation. The fittest chromosomes are passed to the new population automatically in order to
keep the best solutions (until that point) as a backup. The rest operators (selection, crossover,
mutation) are applied as in canonical GA. In this way, any good genetic code achieved by that
point cannot be destroyed by mutation and hence it improves GA performance.
4.3.2.3.2 - Crossover
Crossover operator can be considered as the main genetic operator of evolution in the context of
GAs. Crossover allows the combination of partial solutions. Two chromosomes (which were
first selected among others) exchange genetic information and two offspring are created. The
87
genome of each child is a combination of its parents’ genome. This hopefully leads to offspring
which are better fitted to the “environment” which means suggesting better solutions to the
optimization problem.

One-point Crossover
This is the simplest way of applying crossover. Simply, one crossover point is randomly
selected so that p ∈ [1, l] where l is the length of the chromosome bit-string. Then the
two parental chromosomes swap genetic code defined by the crossover point p.
Figure 4.7: One-point crossover for two chromosomes of size = 10. The position 6 is
randomly selected and two offspring (on the right of the figure) are created.

Two-point Crossover
The only way 2-point crossover differs from the former is that two position points, p1
and p2 are randomly selected on the chromosome’s length. The genetic code included
inside the range defined by these crossover points is the swapped between the two
parents.
Figure 4.8: Two-point crossover for two chromosomes. The genetic sequence between
the randomly selected points p1 and p2 is swapped.
88
The addition of extra crossover points allows to the GA to combine different parts of the
chromosome’s genome other than head and tail. On the contrary one-point crossover always
swaps both head and tail. Therefore by using two-point crossover more potential solutions can
be visited by GA and the search space of the problem is covered more thoroughly.
However both one-point and two-point crossover methods present a common behavior. Genes
which are lie close to each other are silently correlated in terms of having more probabilities of
being selected together to be swapped. A problem with variables depended from each other
could benefit from this phenomenon by positioning the genotype of these variables next to each
other in the chromosome representation. In other cases thus, where this phenomenon is
preferred to be avoided, uniform crossover can be used instead.

Uniform Crossover
For each pair of parents, a bit string of the same length as the chromosome’s length is
randomly generated called the mask. The first offspring is created following these rules:
For every gene, if the mask’s corresponding gene is one then the offspring copies the
corresponding gene value from parent 1. If the mask’s value is zero then the offspring
copies the gene value of parent 2.
Same applies for the second offspring substituting parent 1 with parent 2 and vice versa.
Figure 4.9: Uniform crossover
4.3.2.3.3 - Mutation
As aforementioned, crossover is the main tool of evolution for creating new chromosomes with
better fitness. However mutation can help the algorithm to introduce new genotypes which are
not present in the population helping the algorithm to escape from local optima. Therefore even
89
if a “good” genotype is lost through the generations, it might be recovered by applying mutation
to the population.
The most common mutation operator for binary representations is to create a vector of random
real values so that 𝑟 ∈ [0,1]l where l is the size of the chromosome. Each element of this vector
is a random value corresponding to each gene of a chromosome. If this random value is less
than a predefined probability constant Pm then the particular gene is flipped (zero to one and one
to zero).
𝐼𝑓
𝑟𝑖 ≤ 𝑃𝑚
𝑡ℎ𝑒𝑛
𝑚𝑢𝑡𝑎𝑡𝑒𝑑_𝑥𝑖 = 1 − 𝑥𝑖
Eq. 4.14
Figure 4.10: Mutation of a chromosome under the probability Pm = 0.2
4.3.2.4 - Setting parameters
GA is a stochastic and heuristic method. One should fine tune the various GA parameters based
on the particular problem to be modeled. Since GAs is applied to a big set of different class
problems, there are not generalized precise rules about each parameter’s definition. However,
knowing the effect of each parameter helps to adjust them properly to a given problem.
First of all, GAs convergence depends on the initial positioning of the chromosomes in the
search space. As aforementioned GA uses chromosomes to apply parallel searching in the
problem space driven by a selection policy. Therefore if the population size is too small (in
respect to the problem’s search space) the GA might be unable to search adequately the space
for the optimum solution in reasonable time. On the other hand, if the population size is too
large then the population convergence will become much slower and GA’s performance will be
significantly decreased. The “right” set of the population size is problem depended. The
dimensions of the problem space must be considered in defining the population size (which
means the chromosome’s length).
Crossover and mutation are the basic genetic operators applied in canonical GA. Adjusting
properly their parameters enhances GA’s effectiveness and efficiency.
90
Crossover phase strongly depends on the only crossover parameter, the crossover probability PC.
If PC is zero then no crossover takes place, and the new generated population is just the copy of
the previous one. On the other hand, if the probability PC is one then all selected pair of parents
will be eventually recombined to create two offspring with their genetic code mixed up.
Basically, crossover probability defines the frequency of chromosomes recombination.
Crossover is applied to create new chromosomes which will combine the genetic information of
their parents. This means that they will be able to combine partial solutions (which may be good
solutions) to create a new solution (which may be better). It is also desirable to leave some
chromosomes passed to the new population untouched with the hope that some chromosomes
suggest a very good solution which will help evolution better if they are not disrupted.
Probability Pm defines how often the genes of a chromosome are mutated. If Pm is extremely
small then the algorithm cannot benefit from mutation function since very few genes are going
to be altered very rarely. Therefore either the population will converge very slowly to an
optimum solution, loosing from GA performance, or it may not converge at all. On the other
hand if the Pm is large then the genome of the population will be highly disrupted with every
generation. Eventually, the chromosome won’t evolve rather they will change their genotypes
totally randomly and therefore the GA won’t apply any exploration to the problem’s search
space. Besides, the basic idea of the GA is to incorporate the effects of the natural selection
principles to an optimization algorithm. By increasing substantially the mutation probability
these principles cannot be applied in anyway. Typically the mutation probability is advised to be
approximately equal to 1⁄𝑐ℎ𝑟𝑜𝑚𝑜𝑠𝑜𝑚𝑒 𝑙𝑒𝑛𝑔𝑡ℎ.
Fine tuning crossover and mutation probabilities helps GA to balance between exploration and
exploitation on the searching for an optimum solution. Exploration is mainly “powered by”
crossover operator and much less by mutation while exploitation by selection. Crossover allows
the GA to explore the search space of the problem driven by the information given in the
present chromosomes. Chromosomes are combined to produce new chromosomes suggesting
new solutions, built by the mixture of their ancestor’s genotype, yet still suggesting a totally
new solution. Mutation also results new genetic code in a more random fashion helping the
algorithm to escape potential local optima. However since mutation is generally more preferable
to happen rarely it doesn’t affect strongly the exploration of the algorithm.
91
As mentioned above, crossover is an exploration operator resulting new solutions based on the
present chromosomes and more precisely based on the present chromosomes and to some
degree in favor of the fittest ones. This essentially means that the new chromosomes eventually
after a number of generations, will introduce solutions close to the best fit chromosomes. This is
exploitation and it is implemented by selection operator. After a certain number of generations,
an important percentage of the population will have a fairly good and approximately close to
each other fitness value. Since fitter chromosomes have more probabilities of being recombined,
the resulting offspring will give a new solution, only this solution will be fairly close to their
ancestor’s suggested solutions. Therefore, in this phase, the GA searches for the optima around
the area of the “good” present solutions. Choosing good fitness value to give proper information
about the quality of a solution along with the fine tuning of the crossover and mutation
parameters and a selection method, GA increases substantially the probabilities of converging to
the best possible solution.
4.3.3 - Overall
In dealing with optimization problems a designer is faced with the multidimensionality of the
problem and the visualization of the search space to start with. In addition, such problems are
complicated to analyze, and random search is often incapable of giving any solution in a
reasonable time period, mainly because the process falls in local minima. Additionally there are
cases where very little is known about the problem function (fitness landscape) and it may be
the case that this fitness landscape is extremely difficult in terms of discontinuity, rough
boundaries, etc where conventional optimization methods (e.g. gradient methods) cannot
perform.
On the other side, an implementation of the problem with GAs is straight forward, practical and
needs no special technical expertise. An additional advantage of the GAs, in reaching a solution
is the fact that calculations are done in parallel thus having a number of processes searching for
the solution at the same time. Also, processes exchange information through crossover, allowing
“good” solutions to be combined and create a new “better” solution, arriving at the optimum
one faster.
As every other heuristic method, GAs have drawbacks. More specifically, GAs do not always
find the optimum solution to a problem. It is better said that GAs give always a “suitable”
solution. Therefore it is an ideal tool for retrieving a number of approximate solutions to a
92
problem but even in the cases where approximate is not adequate, GA cab be incorporated with
other local search algorithms increasing greatly the probabilities of reaching the optimum.
93
Chapter 5 - Fuzzy Cognitive Maps
5.1 - Introduction
A wide range of real life dynamical systems are characterized by causality through their
interrelations. Most of the times, experienced and knowledgeable users can identify the
parameters and the conditions of their system. However, the complexity of these systems acts as a
prohibitive factor on prognosis of their future states. Nevertheless, it is crucial for the users to
make a prognosis on the cost of changing a state of a concept, in taking a decision whether they
will take such a risk of change or not.
Kosko presented FCM as an extension of Cognitive Maps (Kosko, 1992). CMs where firstly
introduced as a decision making modelling tool for political and social systems (Axelrod, 1976).
Such systems can be analyzed into two fundamental elements, the set of concepts (parameters of
a system) and interrelationships or causalities amongst them. They are graphically represented as
nodes and signed directed edges respectively of a signed digraph. The concepts which behave as
causal variables are positioned at the origin of the arrows which always show to the effect
variables. Each edge can be either positive (+1) representing positive causality or negative (-1)
representing negative causality. However according to Kosko, causality tends to be fuzzy and it is
usually expressed or described in vague degrees (a little, highly, always, etc) where in CMs each
node simply makes its decision based on the number of positive impacts and the number of
negative impacts.
FCM modelling constitutes an alternative way of building intelligent systems providing exactly
this opportunity to the users; predicting the final states of the system caused by a change on initial
concept states to a degree. Fuzzy Cognitive Maps (FCMs) constitute a powerful soft computing
modelling method that emerged from the combination of Fuzzy Logic and Neural Networks.
FCMs were introduced first by Kosko (Kosko, 1986) and since then a wide range of applications
in modelling complex dynamic systems have been reported such as medical, environmental,
supervisory and political systems. Essentially, a Fuzzy Cognitive Map is developed by integrating
the existing experience and knowledge regarding a cause – effect system.
94
5.2 - FCM technical background
5.2.1 - FCM structure
In a graphical form, the FCMs are represented by a signed fuzzy weighted graph, usually
involving feedbacks, consisting of nodes and directed links connecting them. The nodes represent
descriptive behavioural concepts of the system and the links represent cause-effect relations
between the concepts.
5.2.1.1 - Concepts
A system is characterized and defined by a set of parameters associated with the system. These
parameters might be facts, actions, trends, restrictions, measurements, etc depending on the
system’s utility, goals and nature. Since the type of FCM parameters varies they are all described
as different concepts of the system.
Figure 5.1: A simple FCM modelling freeway conditions at rush hour (Shamim Khan, Chong,
2003)
More precisely, each concept is a fuzzy set and hence it is described by a membership function.
The membership function returns zero if the concept is disabled and one if the concept is fully
active. In other words the membership function (as described in Chapter 3) defines the extent of a
concept’s existence in a system. For example, the fuzzy set describing the concept “Freeway
congestion” of Figure 5.1 (Shamim Khan, Chong, 2003) can be given by the membership
function of Figure 5.2. The domain of the function is the average speed of all cars of the freeway
in the time interval of fifteen minutes. The membership degree is one if surely there is big
congestion (therefore the concept “Freeway congestion” is totally true), zero if there is no
congestion at all (therefore each car can drive to the maximum allowed speed if desired) and a
95
value in between if there is “some” congestion (yet the cars move with speed above the lower
limit).
Figure 5.2: Freeway Congestion fuzzy set
Every concept is represented by a node in an FCM. A node is characterized by an activation state
value, Αi where 𝐴𝑖 ∈ ℜ usually. This value is the membership degree of the concept, at a specific
time instance, to its own fuzzy set as described above. So if the node of “Freeway congestion”
posses a value of 0.9 it is interpreted that there is big congestion in freeway (close to full
congestion).
The credibility or the truthiness of an activation state value is strongly subjective since it is
defined by a group of humans with knowledge, experience and expertise in the system’s domain
area. Therefore it is probable that their estimations may highly diverge from each other but still
saying the same thing. For example consider the evaluations of the activation state values in
Figure 5.3. Their estimations about each concept activation value differ very much. However the
difference between the two concept values is quite similar for both experts. Hence, it could be the
case that each expert defined the system’s parameters on different scaling. To avoid such tricky
situations, most of the times the activation value of a concept is bounded in the interval [0, 1].
Concept 1
Concept 2
Expert 1:
41
50
Expert 2:
3.5
10
Figure 5.3: The definition of the activation state values (of an imaginary FCM) by two experts.
A FCM which describes a system of N variables manipulates the set of the activation state values
of all the concepts with a 1 × 𝑁 vector A. The “Freeway conditions” FCM example includes five
concepts (C1 – C5). In this case the experts should define the activation state values (or activation
96
levels) of each concept to reflect the real conditions of a specific freeway. Therefore if they
suggest the initial activation level vector Α0 to be:
Α0 = [0.2, 0.9, 0.7, 0, 1]
then we can infer that the conditions of the freeway are:
1.
The weather is fairly good. (C1 = 0.2)
2.
There is a big congestion in the freeway. The cars are moving with relatively slow speed
for some time. (C2 = 0.9)
3.
The drivers are quite frustrated. (C3 = 0.7)
4.
There no accident in the freeway. (C4 = 0)
5.
The drivers are fully aware of the risk they run in the freeway.(C5 = 1)
5.2.1.2 - Weights (Sensitivities)
In FCM context, weights or sensitivities are the links connecting two nodes. They are directional
and they are also characterized by a numerical value and a sign. Generally a weight describes the
cause-effect relationship between two concepts. The causal relations (weights) can be positive or
negative. A positive weight expresses a direct influence relation whereas a negative one defines
an inverse relation between two concepts. Thence, if two concepts Ci and Cj are related to a
degree defined by the weight Wij and:
1.
Wij > 0 then
If the activation levels (AL) of the concept Ci , Αi increases then the AL of the concept Cj,
Αj will also increase. If the AL decrease on the other hand, the AL of the effect concept
will decrease.
2.
Wij < 0 then
If the AL of Ci , Αi is increased, then the AL of the Cj, Αj will be decreased and vice versa.
3.
Wij = 0 then there is no causal relation between the concepts.
FCM models graded causality. A concept can be designed to affect another concept to a degree.
So FCM escapes the drawback of bivalent causality, from which CMs suffer. FCM causal
97
relations are fuzzy and hence they belong to a fuzzy set. Each of them is assigned to a real,
numerical value expressing the degree of the influence a concept accepts from another concept. In
other words the weight shows how strong and intensive the causal relation is. The value of a
weight lies in the interval [-1, 1]. If the weight is -1 then the causal relation is totally negative (or
inverse) and if it is +1 then it is and totally positive (or direct). Anything between zero and -1 or
zero and +1 correspond to various fuzzy degrees of causality. The closer a weight is to zero the
weaker the relation is. As the weight increases to one the causal concept affects more intensively
the effect concept whereas as the weight approaches -1 the causal concept affects more
intensively but inversely the effect concept.
A FCM which models a system with N variables utilizes the relationships amongst them with a
𝑁 × 𝑁 matrix W (weight matrix). The element Wij represents the weight of the relation between
the concepts Ci and Cj.
Therefore the weights of the example in Figure 5.1 would be represented in a matrix:
W
C1
C2
C3
C4
C5
C1
0.0
0.5
0.0
0.2
0.0
C2
0.0
0.0
1.0
0.0
0.0
C3
0.0
0.0
0.0
1.0
0.0
C4
0.0
1.0
0.0
0.0
0.0
C5
0.0
0.0
0.0
-0.6
0.0
Figure 5.4: The weight matrix of the “Freeway Congestion” example as given in Figure 5.1
which can be analyzed as:
1.
W12 = 0.5
Bad weather conditions (C1) contribute to the cause of congestion in the freeway (C2)
98
2.
W14 = 0.2
Bad weather conditions (C1) contribute to a small degree the creation of an accident in
the road (C4)
3.
W23 = 1
When there exist congestion in the freeway (C2) then surely the drivers will become
frustrated (C3)
4.
W34 = 1
When the drivers are frustrated (C3) then there will be certainly an accident (C4)
5.
W42 =1
If an accident happens in the freeway (C4) then congestion follows for sure (C2)
6.
W54 = -0.6
The fact that drivers have risk awareness (C5) decreases in an adequate degree the
chances of an accident in the freeway (C4)
7.
All the rest weights are zero and they show the absence of any kind of causal relation
between the concepts.
Consequently, a FCM can be represented by an activation levels vector A and a weight matrix W.
The beauty of this model is that by using only one vector and one matrix, one can describe in
linguistic terms and analyze the relations of the system due to the fact that the weights can be
represented by fuzzy sets.
A fuzzy set is defined by a membership function. Membership functions were described
thoroughly in Chapter 3 of this study. FCM use membership functions as a communication
channel between the linguistic description of the influence level between two concepts and the
numerical value of their relation (weight). The name of the linguistic variable describing a weight
is “Influence”. The linguistic values express different levels of “Influence” (e.g. low, medium,
high) and they are defined by the designers of the system. Each linguistic value comprises a fuzzy
set whose membership function is defined over the domain [-1, 1].
For example, the designers of the system presented in Figure 5.5 (Papageorgiou, 2004) defined
nine linguistic values for the linguistic variable “Influence”. Therefore a weight can be {negative
99
very strong, negative strong, negative medium, negative weak, zero, positive weak, positive
medium, positive strong, positive very strong} and their corresponding membership functions are:
μnvs, μns, μnm, μnw, μz, μpw, μpm, μps,and μpvs.
Figure 5.5: Membership functions describing the linguistic variable “Influence”
After defining the fuzzy sets of the “Influence” amongst the concepts, each weight of the system
must be linguistically described (by a certain linguistic value). Thereafter, each weight is
deffuzified (following a defuzzificaiton method as mentioned in Chapter 3) and then each weight
is characterized by a real value in the interval [-1, 1].
The weights of the example in Figure 5.1 can be described by the linguistic values as shown in
Figure 5.5 and thus:
1. W12 = 0.5
This weight belongs totally (degree of one) to the set positive medium (μpm). Therefore
C1 affects C2 directly (positive) to a medium degree.
2. W14 = 0.2
This weight belongs to two “Influence” fuzzy sets, zero influence (μz) and positive weak
influence (μpw). Therefore C1 generally doesn’t really affects C4 but if it does too low
levels of influence are directly applied to C4.
3. W23 = 1, W34 = 1, W42 =1
All of them belong to the fuzzy set positive very strong influence (μprs= 1). Therefore all
the cause concepts (2, 3, 4) will cause the maximum direct effect on the effect concepts
(3, 4, 2 respectively).
4. W54 = -0.6
100
It belongs to the fuzzy sets of negative strong influence (μns) and negative medium
influence (μnm). Therefore C5 affects inversely C4 to an adequate degree.
5.2.3 - Static Analysis
An FCM can be analyzed into three elements, density, indirect and total effects and balance
degree.
Density
Density is an FCM characteristic which can provide useful information about the dynamic
behaviour of the network. Density is calculated based on the number of the concepts (N) and the
number of the causal relationships among the concepts (R) of an FCM. Density can be regarded
as a metric describing the network connectivity given by:
𝑅
D = 𝑁2 ,
Eq. 5.1
Density is a measure of the network’s complexity. Larger density indicates more complex
network and therefore it is recommended to be bounded in the range of [0.05, 0.3] (Miao, Liu,
2000).
Relative centrality of a concept
Centrality is a measure defining the importance of a concept in the whole FCM system (Kosko,
1986) and it is calculated by aggregating the weights showing to the concept and the weights
beginning from the concept.
𝑐𝑜𝑛𝑐𝑒𝑝𝑡_𝑐𝑒𝑛𝑡𝑟𝑎𝑙𝑖𝑡𝑦(Ci ) = in(Ci ) + out(Ci )
Eq. 5.2
in(Ci ) = ∑𝑛𝑗=1 |𝑊𝑗𝑖 |
Eq. 5.3
out(Ci ) = ∑𝑛𝑗=1 |𝑊𝑖𝑗 |
Eq. 5.4
where 1 ≤ i, j ≤ N.
Concepts with large centrality play a vital role in distributing causality in the network and so,
most of the times, they should be analyzed in greater care and detail.
101
Indirect and total effects
Consider two concepts of an FCM, 𝐶𝑖 and 𝐶𝑗 , which are connected through M different causal
paths. Every path may be given as a vector(𝑖, 𝑘𝑙1 , … , 𝑘𝑙𝑃 , 𝑗) where 1 ≤ 𝑙 ≤ M and P is the
number of the concepts participating into the particular l path. Let 𝐼𝑙 (𝐶𝑖 , 𝐶𝑗 ) to denote the indirect
effect on 𝐶𝑗 caused by 𝐶𝑖 through the lth causal path. Then 𝑇(𝐶𝑖 , 𝐶𝑗 ) represents the total effect
between 𝐶𝑖 and 𝐶𝑗 :
𝐼𝑙 (𝐶𝑖 , 𝐶𝑗 ) = 𝑚𝑖𝑛{𝑊(𝑧, 𝑧 + 1)| 𝑧 ∈ (𝑖, 𝑘𝑙1 , … , 𝑘𝑙𝑃 , 𝑗)}
𝑇(𝐶𝑖 , 𝐶𝑗 ) = max(𝐼𝑙 (𝐶𝑖 , 𝐶𝑗 )) ,
Eq. 5.5
Eq. 5.6
where z and z+1 are sequential concepts in a specific path.
Therefore the indirect effect of a path between 𝐶𝑖 and 𝐶𝑗 is the minimum weight of the path while
the total effect is the maximum indirect effect among all paths between the two concepts.
Balance Degree of an FCM
An FCM is considered unbalanced if two concepts are connected through two paths whose
causality is characterized by different signs. In any other case the FCM is balanced. In order to
find the balance degree of an FCM the number of the positive and negative semi-cycles in the
FCM must be counted. A semi-cycle of a graph is any cycle which is created by not considering
the direction of the edges. A positive semi-cycle is a cycle which is constituted by a number of
positive weighted edges and an even number of negative weighted edges. On the other hand a
negative semi-cycle is constituted by a number of positive edges and an odd number of negative.
A method of calculating the balance degree (Tsadiras, Margaritis, 1999) is:
𝑎̂ =
∑𝑚 𝑝𝑚 𝑓(𝑚)
∑𝑚 𝑡𝑚 𝑓(𝑚)
,
Eq. 5.7
∑𝑚 𝑝𝑚 𝑓(𝑚)
∑𝑚 𝑛𝑚 𝑓(𝑚)
,
Eq. 5.8
or
𝑎̂ =
where:
𝑝𝑚 is the number of the positive semi-cycles with size m
102
𝑛𝑚 is the number of the negative semi-cycles with size m
𝑡𝑚 is the total number of semi-cycles with size m
𝑓(m) is a monotonously decreasing function e.g. f(m) = 1/m , f(m) = 1/m2 ή f(m)= 1/2m
Considering the example of Figure 5.1, there are three semi-cycles and all three of them are
positive:
1.
C1 → C2 → C4 → C1 ( length m = 3)
2.
C2 → C3 → C4 → C2 ( length m = 3)
3.
C1 → C2 → C3 → C4 → C1 (length m = 4)
Therefore:
𝑝3 = 2, 𝑝4 = 1, 𝑡3 = 2, 𝑡4 = 1 and by using f(m) = 1/m the balance degree is:
𝑎̂ =
2 ∗ 1/3 + 1∗ 1/4
=
2 ∗ 1/3 + 1∗ 1/4
1
The closer the balance degree is to one the more balanced the FCM is considered and vice versa.
5.2.4 - Dynamical Analysis of FCM (Inference)
FCM is widely used as a dynamical tool analysis of certain scenarios (involving the parameters of
the system) through time (discrete iterations). Inference in FCM is expressed by allowing
feedbacks through the weighted causal edges. Given the weight matrix along with the initial
activation levels vector Α0, FCM “runs” through certain steps until converge to the final
activation levels vector Αfinal including the final stable states of the concepts. Finally, by
analyzing the final state vector one can retrieve important information about potential effects
caused by a change on the system’s parameters.
The FCM inference algorithm is:
103
Begin
Step 1: Read the initial state vector Α0 , t=0
Step 2: Read the weight matrix W
Step 3: t = t + 1
Step 4: For every concept (Ci) calculate its new activation levels (Ati ) by:
𝐴𝑡𝑖 = ∑𝑛𝑗=1(𝐴𝑡−1
𝑊𝑗𝑖 )
𝑗
Eq. 5.9
or any other activation function
Step 5: Apply a transformation function to the present activation levels vector
At = f(At)
Eq. 5.10
Step 6: If (At == At-1) or (t==maximum iterations)
Αfinal = At
STOP
Otherwise GO to STEP 3
End
In the above algorithm the variable t enumerates the iterations needed until the FCM model
reaches a certain behaviour which could be either steady point attractor, limit cycle or chaotic
attractor. The vector Α(t) includes the activation levels of the concepts at tth iteration.
FCM allows the concepts to interact with each other to a certain degree specified by the weight
connecting them. The new state value of each concept strongly depends on the state values of the
concepts affecting it. The whole update process of the activation levels can be described as a
simple vector – matrix multiplication:
𝐴𝑡 = 𝑓(𝐴𝑡−1 𝑊)
Eq. 5.11
The algorithm terminates when one of the following three FCM behaviours is identified:
104
1.
Steady point attractor: The FCM concepts stabilize their state to a specific value which
implies that the activation levels vector is not altered through iterations.
Figure 5.6: Steady point attractor. The X-axis represents the iterations number and the
Y-axis presents the values of each concept state.
2.
Limit Cycle: In this case the FCM model falls into a loop of activation level vectors which
repeatedly appear in a certain number of iterations in the same sequence.
Figure 5.7: Limit Cycle behaviour
3.
Chaotic attractor: The activation levels are continuously changing through iterations
with a totally random, non-deterministic way, leading the FCM model to present
unsteadiness.
105
Figure 5.8: Chaotic attractor
FCM model can be used to answer “IF-THEN” questions about how a change to a subset of
parameters can affect the whole system. Therefore FCM modelling tool provides a scenario
development framework for retrieving hidden relations between the concepts of a system. To run
a scenario a particular sequence of steps must taken. The first step is to define the initial
activation levels of the concepts and the weight matrix and then introduce them to the FCM
model. The FCM model will run through a number of iterations allowing the interaction among
the concepts and therefore changing their activation levels according to FCM inference algorithm.
The inference process ends when the FCM convergences to a steady point attractor (any other
behaviour at this point implies that the modelled system suffers from designing problems). The
final activation levels vector exposes the present dynamics of the system based on the system’s
designers’ experience and knowledge. More precisely, the resulting activation levels vector can
be used as the base vector for development of various scenarios examining the system’s responses
on different changes.
To build a scenario, a “WHAT-IF” question is defined and the activation levels of the parameters
which compose the premise of the question are locked to the desired values. The inference
process is then initiated, again, yet this time the activation values of the “locked” concepts will
not be updated through the iterations. The FCM inference algorithm will terminate when the
FCM model presents one of the three aforementioned system behaviours. However, in this case
the final activation levels vector presents the future dynamics of the system after applying the
changes to its parameters. Further analysis of the results may lead to various conclusions about
the modelled system’s future progress under certain conditions (defined by the “locked” valued
concepts).
106
Activation Functions
Every concept updates its activation levels based on the activation levels of the concepts which
affect it and their relationship’s weight. This process is defined by an activation function. There
are several activation functions:
𝑛
Ati = 𝑓 (∑𝑗=1( At−1
j Wji ))
Eq. 5.11
𝑗 ≠𝑖
𝑛
t−1
Ati = 𝑓 (∑𝑗=1( At−1
j Wji + Ai ))
Eq. 5.12
𝑗 ≠𝑖
𝑛
t−1
Ati = 𝑓 (𝑘1 ∑𝑗=1( At−1
j Wji + 𝑘2 Ai ))
Eq. 5.13
𝑗 ≠𝑖
where:

𝐴𝑡𝑖 represents the activation levels of the concept Ci at iteration t

n is the number of the concepts which affect Ci

𝑊𝑗𝑖 is the weight value between the concepts Cj and Ci

k1 is a coefficient which defines in what degree the new state value of a concept will be
affected by the activation levels of its neighbouring concepts (which are defined to
affect it).

k2 is a coefficient which defines in what degree the new state value of a concept will be
affected by its own present activation levels value.

f(.) is a transformation function which is described further down.
Transformation functions
The activation levels of a concept are defined over the interval [0, 1] or [-1, 1]. However, during
inference phase, the state value of a concept might exceed this interval (e.g. if a concept accepts
continuously positive effects which are aggregated using the activation function in Eq. 5.12 then
at some point its new activation levels might be larger than one) . Transformation functions are
used to map the real state value of a concept to the interval [0, 1] (or [-1, 1]).
107
Some functions which are widely used to transform activation levels are:
𝑓(𝑥) =
1
1+ 𝑒 −𝜆𝑥
Eq. 5.14
𝑓(𝑥) = tanh(𝑥)
𝑓(𝑥) = {
−1,
𝑓(𝑥) = { 0,
1,
0,
1,
Eq. 5.15
𝑥≤0
𝑥≥0
𝑥 ≤ −0.5
− 0.5 ≤ 𝑥 ≤ 0.5
𝑥 ≥ 0.5
Eq. 5.16
Eq. 5.17
5.2.5 - Building a FCM
FCM development is traditionally based on human experience and knowledge. Therefore it is
very important for a FCM designer to extract qualitative information from experts in the system’s
domain and transform this information into an FCM structure. In any other case the system may
not be adequately modelled and the results might be invalid. Hence a group of humans with
proper expertise in the scientific or industrial area of the system to be modelled is required for
building an FCM. The experts group is responsible to define the number and the type of the
concepts which will be represented by the FCM model. An expert is able to identify and
distinguish the concepts which compose the environment of a system and which affect its
behaviour. Due to their experience and interaction with similar kind of systems, each expert has a
specific map of the system’s relations between the concepts. Each of them has a notion about
which concepts affect other concepts and whether this influence relationship amongst them is
positive or negative. They can also describe linguistically the degree of influence between the
concepts. FCM development methodologies exploit expertise about the system’s function and
representation to create an FCM.
Linguistic Description of weights
This FCM development methodology is based on the linguistic description of the graded levels of
the causal system’s interconnections (Stylios 1999; 2000). According to this methodology, the
expert/experts along with the FCM designer must define the desired linguistic values of the
linguistic variable “Influence” which describe at the best the causal relations of the system. More
108
precisely, they must define the linguistic terms expressing the grading of the influence (e.g. weak,
medium, strong) and then assign to each of them a membership function on the domain [-1, 1].
Right afterwards, the experts must spend some time in thinking and describing the present causal
relationships between every pair of concepts. As a first step they define the sign of the causeeffect connection (positive or negative). Then they must linguistically evaluate the influence
degree of the relation using one predefined linguistic value. Each causal relation is expressed by
an IF-THEN rule and therefore every expert must define N number of IF-THEN rules where N is
the number of the existing system’s interconnections. An IF-THEN rule describing a causal
relation has the form:
IF the concept Ci takes the value of A THEN the value of the concept Cj changes to the value of B.
Conclusion (about the weight value)  The influence of Ci upon Cj is D.
The terms A, B, and D are linguistic values (A,B describing the state of a concept and D
describing linguistically the influence level). Each expert works individually. Hence if the
number of the experts is E then there are E linguistic terms describing a specific weight which are
aggregated using the SUM method (or any other S-norm method). Finally, the real value of each
weight is computed by using a deffuzificaton method (e.g. centre of area method) on the resulting
aggregated membership function.
Group decision support using fuzzy cognitive maps
This method of FCM development strongly requires a group of several experts because, as it is
mentioned, more than one experts increases the credibility of the resulting FCM model (Kosko,
1988; Khan, Quaddus, 2004; Salmeron 2009). Following to this methodology, the experts do not
predefine the number of the participating concepts rather each expert works on his own, from the
very beginning of the development process, defining the number and the type of the concepts and
describing their causal interconnections with real numbers in the range [-1, 1] (grading from
strongly negative influence (-1) to strongly positive influence (+1)). Therefore each expert
“builds” his own FCM based totally on his own knowledge and experience. Then, all the
proposed FCM models (by the experts) are merged to create the preliminary version of the final
FCM. The merging process is relatively easy since the proposed weight matrices are averaged to
give the final resulting matrix:
𝑊𝑖𝑗 =
𝑒
∑𝐸
𝑒=1 𝑊𝑖𝑗
𝐸
Eq. 5.18
109
A variation of this method requires each matrix to be weighted by a coefficient p representing the
credibility of the expert who built the matrix.
𝑊𝑖𝑗 =
𝑒 𝑒
∑𝐸
𝑒=1 𝑝 𝑊𝑖𝑗
𝑒
∑𝐸
𝑒=1 𝑝
𝑤ℎ𝑒𝑟𝑒 𝑝 ∈ [0,1]
Eq. 5.19
If the credibility factor is one then the expert is considered totally credible. The credibility factors
are initialized to one for all the experts. Then a credibility adjustment policy has to be determined
by the FCM designer. For example in every step, if a weight value 𝑊𝑖𝑗𝑒′ differs from all the rest
𝑊𝑖𝑗𝑒 𝑓𝑜𝑟 1 ≤ 𝑒 ≤ 𝐸 𝑎𝑛𝑑 𝑒 ≠ 𝑒′ (or if a weight value diverges from the mean value of the whole
set of weights) then penalize the corresponding expert by multiplying its credibility factor with a
penalty factor μ so that 𝑝𝑒′ = 𝜇𝑝𝑒′ (Stylios, Groumpos, 2004).
Finally, the group of experts examines this preliminary version of FCM to find groups of
concepts which are identical (in their meaning) yet with different names. For each of these
identified groups all but one concept are eliminated. Clearly all the links coming or leaving from
the removed concepts are also deleted.
5.2.6 - Learning and Optimization of FCM
In FCM context, learning and optimization mainly refer to methods which are used to adapt the
weights (of FCM interconnections) so that to achieve more precise and credible representation of
the modelled system. All FCM learning algorithms take for granted that the number and the type
of the concepts is known and given. The general motivation of developing such methods is to
overcome potential conflicts among a group of expert about one or more weight evaluations.
Furthermore, some (Stach, Kurgan, Pedrycz, 2005; Stach, Kurgan, Pedrycz, Reformat 2005)
argue that human based knowledge is subjective by nature and therefore data-driven FCM
methodologies should be used instead, to enhance the accuracy and the operation of the FCMs.
5.2.6.1 - Unsupervised Learning
Unsupervised training of FCMs comprises Hebbian – based methodologies. All of them involve
the definition of an initial weight matrix (as suggested by the experts). Based on the initial weight
matrix and by applying iterative methods, which are alternations of the Hebbian law, they
converge to the final weight matrix which is considered optimally adjusted to the modelled
decision making and prediction problem.
Differential Hebbian Learning (DHL) was the first proposed learning method for FCM
(Dickerson and Kosko, 1994). DHL tries to identify pairs of concepts whose values
(activation levels) are positively or negatively correlated (e.g. if the value of the cause concept
increases then the value of the effect concept increases as well). The algorithm increases the
110
weight value of the edge connecting such pairs of concepts. If the modification of two connected
concepts is not correlated then the algorithm decreases their weight. For every discrete iteration t
the algorithm assigns the following value to each weight connecting the concept Ci to Cj:
𝑊𝑖𝑗𝑡+1
= {
𝑊𝑖𝑗𝑡 + 𝑐 𝑡 (Δ𝛢𝑡𝑖 Δ𝛢𝑗𝑡 − 𝑊𝑖𝑗𝑡 ) 𝑖𝑓 Δ𝛢𝑡𝑖 ≠ 0
𝑊𝑖𝑗𝑡
𝑖𝑓 Δ𝛢𝑡𝑖 = 0
Δ𝛢𝑡𝑖 = 𝛢𝑡𝑖 − 𝛢𝑡−1
𝑖
t
𝑐 𝑡 = 0.1 [1 − 1.1q] 𝑤ℎ𝑒𝑟𝑒 𝑞 ∈ ℕ
Eq. 5.20
Eq. 5.21
Eq. 5.22
where 𝛢𝑡𝑖 represents the activation levels of the concept Ci in tth iteration and Δ𝛢𝑡𝑖 is the difference
of its activation levels values between the present iteration and the previous (t-1). The learning
coefficient ct is slowly decreased through iterations and should ensure that the updated weights
will lie in the interval [-1, 1]. The term Δ𝛢𝑡𝑖 Δ𝛢𝑗𝑡 simply indicates whether the changes in the
activation levels of the two concepts move to the same or opposite direction. The weights are
updated in every iteration until converge.
Following to this learning approach, a weight is updated based only on the changes of the two
participating (to that causal relationship) concepts. However the effect concept (of the specific
relationship) might be also affected by other concepts and thus accepting aggregated influence
from many sources.
This serious drawback was pointed out in (Huerga, 2002) who presented the Balanced
Differential Algorithm (BDA), which is an extension of the DHL. DHL defines that each weight
update will depend on the changes occurring to other concepts which also affect the particular
effect concept at that certain time instance. Therefore the weight connecting concepts Ci and Cj is
updated as a normalized value of the change in their activation levels to the changes
(to the same direction) occurring in all concepts which affect Cj at the same iteration. Compared
to DHL this approach presented better results (Huerga, 2002) but it was only applied to binary
FCM models (using bivalent activation function).
Another FCM optimization method which utilizes a modified version of Hebbian law is the
Nonlinear Hebbian Learning (Papageorgiou et. al., 2003; 2005). This method needs four
requirements to be fulfilled in order to be applied. The initial weight matrix (which the algorithm
optimizes) should be defined by the experts, the concepts should be divided into two categories
input and output, the intervals in which concept’s state value is supposed to lie should be
specified as well as the desired sign of each weight. Obviously, only the enabled initial weights
(W≠0) are updated through the running of this algorithm while the rest remain disabled. The
algorithm updates the weights iteratively until the satisfaction of a termination criterion which
111
can be either reaching the maximum number of iterations or the minimization of the difference of
output concepts state value (after presenting steady state) between two sequential iterations. Note
that the concepts which are characterized as output are more important concepts for the system
than other (according to the experts) and hence they deserve more attention on behalf of the
designer.
Another unsupervised learning algorithm proposed for FCM is the Adaptive Random FCM
(Aguilar, 2001; 2002) which is based on the same learning principle as the Random Neural
Networks (Gelenbe, 1989). RNNs generally use the probabilities that a signal leaving from a
neuron to another is positive or negative so that to calculate whether the neuron which accepts the
signals will “fire” or not. To fire it needs the aggregated signals it accepts to be positive. Based on
this idea, ARFCM divides every FCM weight Wij into two components, a positive component
W+ij and a negative W-ij. Therefore whenever the weight between two concepts is positive then
W+ij > 0 and W-ij = 0 while if the weight is negative then W+ij = 0 and W-ij > 0. Otherwise, W+ij =
W-ij = 0. Each effect concepts is characterized by a probability of activation qj which is calculated
based on the positive and negative influence accepting from the cause concepts and their
activation probabilities. At each learning step the weights are updated as:
𝑊𝑖𝑗𝑡 = 𝑊𝑖𝑗𝑡−1 + 𝜂(𝛥𝑞𝑖𝑡 𝛥𝑞𝑗𝑡 ) ,
Eq. 5.23
The algorithm terminates whenever the system converges to the activation levels of the concepts
as defined by the experts.
5.2.6.2 - Supervised Learning
Data-Driven NHL (DDNHL) extends the use of NHL basic algorithm by allowing weight
adaptation on historical data (Stack, Kurgan, Pedrycz, 2008). Historical data include the initial
state values of the concepts and their corresponding “caused” steady state values. The algorithm
updates the weights, allows the FCM to run with the new weights drawing the initial activation
levels from the data base and then compares the system’s final output state values with the correct
ones.
5.2.6.3 - Population-based Algorithms
There is a plethora of population-based algorithms proposed for optimizing FCMs especially the
last ten years. All of them but one share the same learning goal of finding the optimum weight
matrix which best represents the causal paths of the modelled system. Additionally the
representation structure of the individuals (either chromosomes or particles) is approximately
112
identical with very small variations (e.g. some do not include the zero weights in the
representation). The basic steps of encoding, decoding and evaluating individuals which represent
FCMs are presented in Figure 5.9.
The first population-based algorithm for FCM optimization was used is Evolution
Strategies (Koulouriotis et al., 2001). Each chromosome suggests a different connectivity of the
modelled FCM system. The algorithm requires that the domain experts of the system have
defined the output concepts which gain the main interest about how they evolve in the system
under any scenario. Additionally, the algorithm uses historical paired input/output data points.
Input data are the initial states of the concepts and output data are the final steady state concept
values. The fitness function of the algorithm simply calculates the distance between each output
concept’s real steady state value and the desired one for every pair of datapoints. The algorithm
terminates if the maximum number of generations is reached or if the fitness of the best fitness
individual is less than a very small positive number.
Genetically Evolved FCM comprises an alternative evolutionary approach of mutli-objective
decision making in terms of satisfying more than one desired concept values at the same time
(Mateou, Andreou, 2008). In the context of a specific desired scenario, the concepts of the main
interest are characterized as output and their desired activation levels are defined. Then a
population of randomly initialized chromosomes is evolved in order to find the weight matrix
which generates, through FCM dynamic analysis, the desired output values. The distance between
the generated activation levels of the output concepts and the corresponding desired values is
calculated and aggregated. The fitness value of a chromosome is then inversely proportional to
the sum of these distances. The writers also state that by using this algorithm they also manage to
handle limit cycle phenomenon. The FCM which is suggested by a certain chromosome,
generated by crossover or mutation, runs and if for a number of iterations the same activation
levels (among the output concepts) keep repeating it is categorized as a limit cycle case. After the
identification of a limit cycle case, the weights of the specific FCM are examined one by one until
the detection of the one/ones which are responsible for the limit cycle phenomenon. As soon as
they are identified, the weight/weights are iteratively changed randomly until the FCM converges
to steady state behaviour.
113
Figure 5.9: General FCM solution representation for individuals participating in population-based
algorithms
Another genetic algorithm named Real Coded GA (RCGA) was proposed which also utilizes
historical data to achieve optimized weight adaptation (Stach et al, 2005; 2007). The dataset is
built as follows: For each system example, the initial values and the final (steady state) values of
the concepts are given as well as all the intermediate state values per iteration (of a FCM model).
The form of a dataset describing the desired evolution of the concept values through FCM
inference is shown in Figure 5.10 (left part). In order to evaluate each chromosome, the dataset is
divided in pairs of cause and effect concept values. Therefore if 𝐶̂ denotes the set with the correct
activation levels of a system per each iteration, the new dataset will be {𝐶̂ 𝑡 , 𝐶̂ 𝑡+1} = {initial
concept values, final concept values} so that 𝐶̂ 𝑡 → 𝐶̂ 𝑡+1. Hence for every candidate FCM
(represented by chromosomes) a set of initial values 𝐶̂ 𝑡 will be presented and the resulting
concept values will be compared with the correct final concept values (𝐶̂ 𝑡+1 ). In other words, if
the dataset includes the concept values for K iterations, the K-1 initial values will be presented to
the candidate FCM and K-1 comparisons between the generated final values and the correct ones
114
will occur. The fitness of a chromosome is then evaluated by aggregating the K-1 distances
between the candidate model’s steady state values and the corresponding desired ones.
Figure 5.10: The activation levels per each iteration, of a system consisted of seven nodes, are
presented in the first part of the figure. The RCGA is then called to optimize a random weight.
The optimum FCM, suggested by the fittest chromosome, “runs” and, in every iteration, the
concept state values should be approximately identical with the corresponding concept values of
the historical data.
The algorithm terminates if the fitness best fit chromosomes reaches a specific threshold or if the
maximum generation number is satisfied.
Besides evolutionary algorithms, Particle Swarm Optimization (PSO) was also proposed for
optimizing FCMs (Parsopoulos et al., 2003; 2005; Papageorgiou et al, 2005; Petalas et al, 2005;
2009). PSO is a stochastic heuristic algorithm as well, and it uses a swarm (population) of
particles which also represent a potential solution to the problem being optimized (in this case,
the weight matrix of an FCM model). The particles move in the problem space with a certain
velocity trying to find the optimum position. The position of each particle suggests a solution to
the modelled problem. The idea is that the particle with the best position tends to attract other
particles close to it, and therefore forming a bigger neighbourhood of particles around the best
position (best solution). The velocity of each particle depends on various parameters (e.g. the best
position which was ever visited by any particle until a specific time point and the best position
that the particle itself visited). Therefore each particle must “remember” its best position and each
neighbourhood must keep track of the best position among the particles comprising the
neighbourhood.
The writers suggest that the weight should be bounded in intervals defined by experts. Therefore
for each weight between two concepts, the experts define a minimum and maximum limit. The
particles are then let to search within these concept values’ domains. Once more, the experts must
115
also define the output concepts which are more significant than others. In every iteration of the
algorithm, the velocity of each particle is updated based on its present position, its own optimum
scored position and the best particle’s position from its neighbourhood. Having calculated the
velocity, the new position of the particle can be found (and therefore the particle moves to a new
solution to the problem). The position of each particle is evaluated by a fitness function which
essentially examines whether the final activation levels of the candidate FCM (suggested by the
particle) fall into the bounding regions of each concept which were predefined by the experts. The
termination criterion is the minimization of the fitness function after a certain number of
iterations.
A GA for FCMs was introduced by (Khan and Chong, 2003) which diverges from other
algorithms, of the same family, in the learning goal. This is the only case of FCM
optimization/learning algorithm where the weight matrix adaptation is not the learning objective
but the detection of the initial activation levels which caused a certain effect (on the concepts).
Therefore given the final steady state values of the concepts (of a system) the algorithm applies
GA operators to find the initial conditions A0 which caused the specific effects.
5.2.7 - Time in FCM
FCM comprises a dynamic modelling tool which can capture future states of a system under
certain circumstances. Therefore FCMs exhibit a temporal aspect due to feedback presence in
their dynamic operation. However FCM model “runs” as a discrete simulation. After presenting
an initial set of activation levels, the model is free to iterate through several concept states until
converge to the final steady states which represent the behavioural response of the system to the
initial conditions. The set of the activation levels at each iteration represents a snapshot of an
intermediate system’s state between the initial conditions and the final converged state.
Unfortunately, FCM model is unable to define the time period in which all these changes will
occur. As a result, the model does not give any information about how long it will take to the
system to present the final steady state behaviour given that the specific initial conditions apply.
The model is only in position to answer “WHAT-IF” questions while it would be of greatest
importance if it could also answer “WHATandWHEN-IF” for scenario building and decision
making in several domains (e.g. political/economic sciences).
116
Although introducing time dependencies into FCM model is a very challenging extension to
classic FCM, there is no equivalent interest in research area. Only few researchers have been
involved in this field. The first approaches to this subject proposed modelling time for FCM
should be done in discrete space, stating that introducing time in continuous space is a very
difficult task. (Hagiwara, 1992; Park, Kim, 1995). The proposed models require from experts to
define a discrete time unit (e.g. one month, year so on) with which they must afterwards describe
every causal relationship for every pair of concepts. If a relationship is assigned to more than one
time units (e.g. 6 months) then dummy nodes are added in between the corresponding pair of
concepts and the weight of each new added edge is 𝑚√𝑊𝑖𝑗 where m is the number of the causal
links between the two original concepts and 𝑊𝑖𝑗 is their former weight (before adding dummy
nodes in between). They assume that initially a concept Ci is stimulated. At the next iteration, the
indirect and total effects on its neighbours (effect concepts) Cj are calculated and they become
stimulated. Therefore, at each iteration n the indirect and total effects are calculated for the
concepts which are distanced by n edges from the very first stimulated concept (following
causality paths). Finally the total effects are given for each concept and therefore one can infer
the type of causality on a certain concept. The process is quite complicated and they do not clarify
how the final states of the concepts could be given using this kind of dynamic analysis.
The next attempt of representing time relations in FCMs was the Ruled Based FCM (Carvalho,
2001). According to the writers RBFCM is more flexible to encode time relations between the
concepts than classic FCMs by using fuzzy rules describing each causal relationship. Once more,
the experts must define the time unit which will be represented by an iteration (e.g. one iteration
could be one month, ten days etc) which means the RBFCM models time dimension in discrete
space as well. Then the experts use fuzzy rules to describe the causal relationships amongst the
concepts. A fuzzy rule should define the type of the effect on a concept (positive or negative) and
its strength which will appear in the time period defined by the predefined time unit. During FCM
inference process, an iteration represents one time unit. Therefore, during FCM inference partial
effects (as defined in the fuzzy rules) will be applied on concepts which will be aggregated to the
total effect as soon as model converges. Experts should keep in mind the time unit, represented by
an iteration, when defining the strength of the effect in a causal relation. For example, consider
two concepts, C1 and C2, describing the concepts of greenhouse effect and ice melting
percentage respectively. If the defined time unit is one day then the strength of the effect
on C2 (caused by C1) should be much smaller than when the time unit is 10 years! The
writer refers to the time period represented by an iteration as Based-Time. All the causal relations
117
of the model are defined based on the B-Time. The B-Time variable is decided by the designer
and the experts of the system depending on the type of the system being modeled and the
potential scenarios they would like to examine on the modeled system. If the designer whishes to
examine the system’s behavior in long term period then a big B-Time (e.g. 1 year) should be
defined while if they would like a short-term system’s reaction to a change then a small B-Time
(e.g. 1 day) is suggested. Additionally, the writers state that if the system exhibits a chaotic
behavior then the B-Time should be decreased, the experts should reconsider the fuzzy rules and
allow the model to “rerun”.
The proposed model also tries to deal with the time delay often observed in cause-effect systems.
Many times, although a change might happen to a cause concept, the effect concept reacts after a
certain time period (not immediately). This phenomenon is called dead-time which begins from
the exact moment a change happens to a cause concept and ends at the same time that the state of
the corresponding effect concept changes (from its previous state value). A very well known real
life example describing this phenomenon is the time delay observed in changing the oil price in
gas stations after the price per barrel changes. To encode this phenomenon, the writer suggests
that a number of iterations should pass until the effect concept changes for the first time its state
value. The number of the “delay” iterations are equal to the time delay (given as a natural
number) divided by the B-Time (time delay / B-Time). The experts also are responsible to define
the time delay (if any) for every causal relationship.
Temporal FCMs (tFCM) also discuss time dependencies in FCMs (Zhong et al, 2008). Following
to this model, each causal relationship is described by two different linguistic variables. The first
one is the time phases characterizing the particular relationship and the other one is the type of
the effect. The time phases of a causal relationship, S, essentially describe the time length of an
effect’s duration (time length) and they are given by linguistic values. For example a set of
linguistic values could be Sn ∈ {short term, medium term, long term}. Each one of them
𝑛
(𝑡) (Figure 5.11).
comprises a fuzzy set and therefore its membership function must be defined 𝜇𝑖𝑗
The type of the effect is defined as a number in the interval γij ∈ [-1, 1] which in turn comprise a
singleton membership function with membership degree one. Then fuzzy rules are defined by the
experts for every causal relation between two concepts Ci and Cj as:
𝑛
{𝑅 𝑛 |𝑅 𝑛 : 𝑆 𝑛 → 𝛾𝑖𝑗
}
Eq. 5.24
118
𝑛
where 𝑆 𝑛 denotes the nth time set (defined for this relation) and 𝛾𝑖𝑗
is a singleton number defining
the strength of the effect and it is assigned to the nth time phase. Hence, considering the example
𝑀𝑇
𝐿𝑇
of Figure 5.1, three singletons should be assigned to every time phase, 𝛾𝑖𝑗𝑆𝑇 , 𝛾𝑖𝑗
and 𝛾𝑖𝑗
.
Figure 5.11: Three membership functions describing the three time phases of a causal
𝑆𝑇
𝑀𝑇
𝐿𝑇
(𝑡), 𝜇𝑖𝑗
(𝑡), 𝜇𝑖𝑗
(𝑡) ). The time unit in this model is one month.
relationship (𝜇𝑖𝑗
Each iteration is considered to count as one time unit (in Figure 5.11 is one month). Therefore at
each iteration t, the causal relation belongs to a degree to one or more time phases (given
𝑛
(𝑡)) and so modus ponens is applied to move from the firing premises (of the fuzzy rules)
by 𝜇𝑖𝑗
to the effect degree (conclusions of the rules). Then, since a concept might accept effects from
many other concepts, fuzzy logical operators are used to aggregate the partial effects (AND or
OR).
Finally the Center of Gravity defuzzification method is used to calculate the total effect hij(t)
which the effect concept Cj accepts from the cause concept Ci at that certain iteration t:
hij(t) =
𝑛 𝑛
∑ 𝛾𝑖𝑗
𝜇𝑖𝑗 (𝑡)
𝑛
𝛾𝑖𝑗
Eq. 5.25
(At this point I would like to note that there could be a typo in Equation 5.25 as written in the
paper. Since Center of Gravity method is used then the form of the Equation 5.25 should be: hij(t)
=
𝑛 𝑛
∑ 𝛾𝑖𝑗
𝜇𝑖𝑗 (𝑡)
𝑛
𝜇𝑖𝑗
(𝑡)
).
5.2.8 - Modified FCM
There are also some proposed FCM models whose structure, function or goal diverges from the
canonical FCMs. For example there is a trend lately of using FCMs as classifying machines. Two
modified FCMs have been proposed under the umbrella of this goal. The first FCM classifying
model, named FCMper consists of two layers, an input and an output layer (Papakostas et al,
2008). The input concepts represent the set of certain features affecting the classifying object
which is represented in nodes of the output layer. The input concepts are not allowed to affect
119
each other while the output concepts (which represent the potential classes/labels of the dataset)
are connected to each other with negative weights allowing the competition between the classes.
Two hybrid FCMpers were also proposed in cooperation with trained ANN classifiers (e.g.
MLP). Therefore the first hybrid FCMper comprises a trained ANN and a typical FCMper. The
features of the problem are presented to a trained neural network. The state values of the neural
network’s output nodes are then inserted to the FCMper as input nodes. Each output value of the
neural network can be regarded as a degree of belief for a specific class (of the problem). The
second hybrid FCMper works just like the previous one only the features of the problem are also
inserted to the FCper (not only the degrees of belief for each class). All FCMper structures
(typical and hybridized) are presented in Figure 5.12. The efficiency of all three models along
with a classic neural network (the specifications of the NN are not mentioned) was tested
(Papakostas, Koulouriotis, 2010) on Two Spiral problem. The hybrid FCMper 3 achieved better
results than all the other FCMpers and the neural network but still not into a satisfying degree.
Figure 5.12: (a) Typical FCMapper, (b) Hybrid FCMapper 1 (c) Hybrid FCMapper 2
The second model totally adopts the structure of the feedforward network as described in Chapter
1 and employs backpropagation learning algorithm as well, called Fuzzy Rules IncorporatedFCM (Song et al, 2010; 2011). Actually this model diverges to a big degree of the basic structure
and inference process of FCM models. It can be more thought as a fuzzified MLP rather than a
modified FCM. This model assumes that there are certain fuzzy rules which describe the specific
classification problem in the form:
If feature_1 = A and feature _2 = B then the causality with yclass = C
where A, B and C are linguistic values which must be defined for each feature. The network
comprises five layers. The input layer nodes are the features of the problem. The second layer is
120
called the antecedent layer. Its nodes represent the linguistic values describing the input features.
Therefore each node symbolizes a fuzzy set. They accept the crisp values of the input layer nodes
and they output a fuzzified value of them through a Gaussian function which is the membership
function characterizing the corresponding fuzzy set. The next layer is named as the Rule Layer
where each node represents a rule which accepts the truth degree of each antecedent member and
returns the degree to which the whole antecedent part of the rule is true (using a Gaussian
function). The Concequent layer nodes are the linguistic values of the output variables (classes)
represented again by Gaussian functions. The output of each Concequent layer concept is the
defuzzified crisp value of the fuzzy set as formed by the firing of the rules. Finally the output
nodes (of the last layer) aggregate the weighted defuzzified crisp values which describe their
states. The network is trained over several examples describing the problem being modelled
using backpropagation method. The model was tested on IRIS data and its success rate was a bit
lower than FCMpers (hybrid version 3).
Figure 5.13: Basic structure of the FRI-FCM
The next model incorporates the experts’ hesitations about their own weight estimations into the
FCM inference process. Besides the influence weight of each causal relationship, the expert must
also define the hesitation degree defining that way how “unsecure” he feels with the specific
estimation (Iakovidis et al, 2011). Therefore the FCM models remain the same except the
activation function which is altered as:
𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒
𝐴𝑖 = 𝑓(𝐴𝑖 + ∑𝑗≠𝑖(𝑤𝑗𝑖
𝐴𝑗 − 𝑤𝑗𝑖ℎ𝑒𝑠𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝐴𝑗 )) Eq. 5.26
5.3 - Issues about FCMs
Fuzzy Cognitive Maps were proposed as a model of incorporating, representing and adapting
human knowledge and experience to uncover hidden causal relationships and influence paths
between the concepts of a system and therefore being able to examine the system’s dynamics
under certain circumstances.
121
An FCM represents a dynamic system that evolves over time. Time in FCM is modelled by
discrete iterations. In every iteration the state value of every concept is updated according to the
aggregated effects it accepts from other concepts. The concepts are connected with directed links
and so virtual causal paths are created, transferring weighted effects from one concept to another,
even if they are not directly connected, allowing feedbacks to occur. After a number of iterations
the system will converge meaning the states of the concepts will stabilize to certain values which
remain unchangeable through time.
Their biggest advantage is the simplicity in their operation combined with the great benefit of
using FCMs in representing, utilizing and inferring based on human knowledge and experience in
systems which are complex and complicate in their relations. Furthermore, they offer the
potentiality of modelling the evolution real systems behaviour. Given an FCM system, the user is
able to observe, examine and test different scenarios by changing and “locking” specific concept
values and allowing the network to “run”. The steady states of the concepts can be then analyzed
to infer or conclude to certain decisions related to the system and the initial conditions presented
to it. Not only that, but the weights between the concepts have essential meaning, they can be
linguistically defined and explained and therefore the causality flow can be more easily
understood and used to enhance final conclusions. These advantages make FCM an appropriate
tool of analyzing systems which are characterized by cause-effect relationships. Such systems are
constituted by a big number of concepts and relations. Experts might know both concepts and
relationships amongst them but due to the size of the system, they are not able to use their own
knowledge and experience to “predict” or infer potential resulting system behaviour after one or
more changes in the system’s parameters. Instead, FCM can be used to exploit experts’
cumulative knowledge and experience, by allowing a dynamic representation of the system which
in turn can be used to investigate/examine/test potential future behaviours of the system states by
“playing”/tuning the initial system’s conditions. Consequently, FCM can be utilized as a decision
support and system’s designer consultant tool! Such a tool is very useful in domains where a
change in a system’s parameter might cause an unwanted and harmful effect on other system’s
parameters or on the overall system’s behaviour. Many times, changing a system’s variable
involves risk of negative impact on the system. A change to a system means stepping from the
known conditions to the unknown. Therefore, in order to take the decision to apply a change the
possible resulting “unknown” conditions must be evaluated by weighting the potential positive or
negative effects of the change on the system. Unfortunately, neither the “undo” button for real
life actions, has been invented (yet!) nor travelling back to time, fixing the initial “move” which
122
caused the unwanted result, is possible (yet!). FCM comprises a software modelling tool which
can be used towards that direction and avoid driving the parameters of the system to undesirable
non reversible states. In other words, FCMs provide a “safety pillow” for every decision making
since the user/users may simulate the system following to their scenario and examine how the
system responds to the changes of the parameters through dynamic analysis of the FCM system.
However, as mentioned in previous section of this chapter, FCMs in their canonical form present
deficiency in modelling real time. By using dynamic analysis of an FCM model there is no
indication in what time period the whole process occurs, how much time an effect will take to
appear and how much time the total system will stabilize after one or more changes to its
parameters. In a real systems (social, political, economic, etc), when a state of parameter is
changed all the parameters which are affected by, react in different ways, in different times and
the effect takes different time periods to appear in its total. Therefore introducing time dimension
in FCMs is of vital importance. The time needed for the total effect to be expressed in a concept
(system’s parameter) after a change to another concept can be analyzed into two elements (time
parameters):
1. Dead Time (DT)
2. Reaction Time (RT)
DT is defined as the time period needed from a concept of the system, to begin changing its state
after the change of the state of another parameter. This time period is initiated by the change of
the cause concept state and is terminated as soon as the effect concept state is altered. For
example consider the concept of oil exploration by a country (C1) and the concept describing the
standard living of the people living in that country (C2) and the causal relation C1  C2. If C1 is
zero meaning the country is not active in oil exploration and one day the country initiates
exploring its oil reserves. Then C1 will change to one. Obviously the standard living of the
inhabitants of that country will also change but not right after the change in the cause concept.
For a certain time period, after the country’s first attempt to explore oil reserves, the standard
living will remain the same. Additionally the change to the standard living won’t appear to its
whole at once rather gradually through time. This is the reaction time which describes the time
needed for a parameter to present the total effect derived by a change to another conceptparameter. One concept may change its value slower or faster than other. Reaction time follows
dead time. It begins when the concept changes for the first time its value and ends when the
concept stabilizes again to a certain value. Therefore both DT and RT must be incorporated in
123
FCM model to represent more realistically the evolution of a system through time. As described
in previous section of this chapter, there were some attempts of modelling time in FCMs but the
first three proposed methods assume that the partial effect has equal strength in every iteration
(which counts as a time unit) and they act in a discrete time space which is also used by (Zhong et
al, 2008) to model the strength of an effect in different predefined time periods ignoring dead
time. The later is a more realistic approach but still is inflexible in representing adequately the
change of a concept state through continuous time for different kind of systems and fails in
incorporating dead time. Additionally, none of them is tested in real systems rather they are
applied in one virtual system (different at each case) with few concepts and only one scenario is
tested.
Another FCM aspect is the weight adaptation in terms of applying learning or optimization. In
most FCM applications experts are used to identify and describe the concepts constituting a
system and their interconnections. However, there exists an idea stating that causal relationships
should not be defined by humans because, although they might have expertise in the domain, they
are still subjective and can make mistakes (under tiredness, pressure, etc) to conclude that
automatic methods should be used to develop FCM models (Stach, Kurgan, Pedrycz, 2005).
However FCM model was first proposed and has been widely used ever since, because it can
represent and manipulate human knowledge and experience. Although there has been a fast and
impressing growing development of machine learning algorithms introducing some features of
learning intelligence to machines, humans can still learn, remember, cumulate knowledge, and
based on that, they can also infer much better. There are cases where some humans are considered
very important, should I dare say almost indispensable, due to their extended devotion in
researching or working in a certain area. Their knowledge contribution to a system’s domain is
very strong and vital. . FCM model gives the availability of representing information coming
from experts in a systemic way, creating a knowledge and experience “backup” which can be
used either by the experts themselves to reveal seemingly unknown causal paths and relations
amongst the concepts of the system or by any other people who would like to take an advantage
of foreign expert knowledge in inferring about the particular system.
Therefore the need of extracting information about the system from the experts appropriately is
very crucial. The experts must define the set of the concepts (out of many) which describe the
system properly and all their causal interconnections. There are some methodologies suggested as
described in previous section, but further work must be done in using aspects from psychology (in
building questionnaires or defining in a proper manner the questions during an interview).
124
However learning and optimization algorithms might be used to enhance FCM modeling as
supplementary methodologies to expert-based knowledge. Furthermore, there are cases were no
experts are available. In such cases the FCM can be built by adapting its weight by a data-driven
machine learning algorithm as described above. However, only two machine learning algorithms
which were proposed for training FCMs have been applied so far in real systems. The NHL was
applied in (Papageorgiou et al, 2008; Kannappan et al, 2011) and the RCGA in (Stach et al,
2005), which both optimize the weight matrix of an FCM in a different manner. NHL applies a
variant of the Hebbian rule e while RCGA comprises a genetic algorithm about FCMs whose
fitness function is defined over historical data which describe the expected state of every concept
in every iteration. General Hebbian rule-based learning methods have been compared to RCGA in
(Stach et al, 2010) where the experimental results show that RCGA achieves better modeling of
FCM weight matrix than others. However RCGA assumes the existence of historical data to work
properly. Not only that, but per each example, the activation values of each concept at each
iteration must be known. However creating this kind of dataset for complex cause-effect systems
is most of the times, not feasible. Therefore further research must be done in this area in order to
provide efficient and practical learning algorithms for FCMs.
Further study must be done in FCM research field in identifying and handling different kind of
synergies between the concepts. The strength of an effect on a specific concept C1 might change
if at the same time another concept C2 accepts a certain type of effect (so it could be enhanced or
weakened). Therefore FCMs would be able to offer more realistic modeling of various systems
by using methods of defining and manipulating synergies amongst concepts during the dynamic
analysis of an FCM. It would be even more interesting to investigate how these synergies vary in
the context of an FCM with real time causal relations.
Finally, more attention must be paid in defining methods of detecting, analyzing and solving
structural and functional faults of an FCM model which can lead the modeled system to unstable
behaviours through inference (limit cycle and chaotic attractor).
125
Chapter 6 - Discussion
In this study the main focus areas comprising Computational Intelligence (Artificial Neural
Networks, Fuzzy Logic, and Evolutionary Computation) have been presented along with some of
their features, classic paradigms and algorithms. It has been described how CI models encompass
knowledge extraction, learning, adaptation, optimization and generalization based on knowledge.
All these attributes are present in nature in individuals as well as in populations of individuals.
That is why the development of CI methods is mainly driven by nature processes and behaviors.
The success of a novel technology is measured in the range and the quantity of the applications
which employ it. CI techniques find response into many application areas. Only some of them are
medical, biomedical and bioinformatics problems, decision support, optimization, stock market
and other time-series prediction, generic data clustering and classification, engineering,
economics and finance as well as project scheduling.
Additionally another great advantage of CI methods is their ability to be easily hybridized with
other methods drawn either from CI area or more conventional mathematics areas. Therefore
many applications are found in hybrid CI models combining different CI areas (e.g. ANN with
GAs or Fuzzy logic with ANN, etc).
Overall CI techniques turned to be very popular since their stochastic and emergent properties
allow them to find approximate or exact solutions to problems whose solution landscape is not
well understood or is very complex and difficult to be dealt with classic mathematical methods.
As life becomes more complex and demanding in private and professional level, it becomes more
crucial to solve all the computational issues, which already exist or arise in humans everyday
lives. CI research studies heuristic approaches of solving such problems which cannot be easily
solved by other algorithms. Hence, it seems that CI area evolves with respect to the human
problems and questions evolution creating room of further research on providing more efficient,
stable and theoretically established algorithms in the CI field.
Two of the topics presented in this study gather much interest in their applications. GAs are the
one of the most popular Evolution Computation technique while Fuzzy Cognitive Maps gain
growing attention due to their practical and great domain applicability.
126
Some of the GAs application areas are Chemistry, Telecommunications, Economics, Engineering
Design, Manufacturing Systems, Medicine, Physics, Scheduling, Bioinformatics and other related
disciplines. GAs have become very popular in approximating solutions for problem domains
which have complex, discontinuous, noisy with many local optima fitness landscapes under
specific constraints. The fact that they evolve potential solutions of a problem in parallel gaining
time and avoiding bother with derivatives makes them practical and easy to develop and be
adjusted by paying very small effort in different problems. GAs can explore a problem’s solution
space in more than one directions at the same time instance through a population of chromosomes
which behave as artificial agents searching all over the space for the optimum combination of the
problem’s parameters to give the “best” fitted solution. Additionally, each chromosome in the
context of GAs can be regarded as a sample of a larger set of spaces. For example the
chromosome 100010 can be considered as a sample from the space 1**0** (where * is a wild
card character – either one or zero) and the space **001* and the space 1*0*1* and so on.
Therefore through generations GA will achieve to evaluate the average fitness of a larger set of
solution spaces than the actual number of chromosomes. This advantage was first stated by the
man who first introduced GAs and called the “Schema Theorem” (Holland, 1992; Mitchell,
1996). Combining Schema Theorem with their ability to function in parallel, GAs are available of
effectively and efficiently searching for solutions in problem spaces which are large and vast in
relatively reasonable amount of time.
Genetics is a rapidly evolving area creating great potentialities of evolving GAs by incorporating
new biologically inspired extensions (e.g. new genetic operators, employ different type of sexes
within chromosomes, etc). Any improvement in GAs area for better optimization is very
important due to the fact that already GAs have been employed by so many applications is a large
set of domains. Furthermore, policies under which fitness function, population size, mutation and
crossover probabilities would be selected, must be established so that to balance between
exploration and exploitation avoiding premature convergence on behalf of the GA algorithm.
Fuzzy Cognitive Maps are also positioned in CI area since they combine attributes of ANN
structure and information processing with Fuzzy Logic knowledge representation. By using FCM
model, one may perform a qualitative simulation of the system and experiment with the model by
running different scenarios by changing one or more parameters of the system. Successful
analysis of complex systems which are characterized by causality in the relationships between the
system’s parameters highly depends on the correct knowledge representation. FCMs allow the
user/users to reveal influence paths hiding behind the already known causal relationships of the
127
concepts and to take an advantage of the causal dynamics of the system in predicting potential
effects under specific initial circumstances. The fact that the concepts and their interconnections
can be represented and explained in linguistic terms help the users to tune, analyze and
understand better the causal paths as well as the results given after an FCM model inference
simulation. That is why FCMs comprise an attractive system modeling methodology since they
can utilize appropriately human knowledge in human language through fuzzified concepts and
weights. FCMs modeling and simulation power is reflected in the plethora of applications fields
as social and political systems, distributed group-decision support, control systems, medical
diagnostic systems, ecology, agriculture, engineering, etc. Essentially FCMs are able to represent
many different domain variables as causal events, objectives, trends, behaviours, situations, etc in
one model and unwrap their cause and effect relations and show how each of them is affected by
all others.
Further research in FCMs should improve FCMs in helping humans making wiser and more
accurate decisions by presenting them the evolution of the modeled system after a set of changes
and the final future consequences of their possible choices (on the system’s parameters). The
research directions in this area are described in Chapter 6. Some of the open issues concerning
FCMs are the introduction of time dependencies in the causal relationships of the model, the
development of efficient learning algorithms whenever data is available, FCM’s structure and
stability analysis, establishing expert knowledge extraction methods or methods of extracting
causal weights through an existing fuzzy rule base.
In conclusion, CI is an area which gives the researcher the opportunity for deeper understanding
of the natural and human developmental, cognitive and evolutionary processes as well as the
excitement of utilizing these features as puzzle pieces building a new world of technology based
on observations, natural language, human reasoning and knowledge extraction. Essentially CI is
an area of incorporating partial features of human and nature intelligence into machines for
solving difficult real life problems.
128
References
AGUILAR, J., 2005. A survey about fuzzy cognitive maps papers (Invited paper). International
Journal of Computational Cognition, 3(2), pp. 27-33.
AGUILAR, J., 2002. Adaptive Random Fuzzy Cognitive Maps, Proceedings of the 8th IberoAmerican Conference on AI: Advances in Artificial Intelligence 2002, Springer-Verlag, pp. 402410.
AGUILAR, J., 2001. A Fuzzy Cognitive Map Based on the Random Neural Model. 2070, pp.
333-338.
AIZERMAN, A., BRAVERMAN, E.M. and ROZONER, L.I., 1964. Theoretical foundations of the
potential function method in pattern recognition learning. Automation and Remote Control,
25, pp. 821-837.
ALPAYDIN, E., 2004. Introduction to Machine Learning (Adaptive Computation and Machine
Learning). The MIT Press.
AXELROD, R.M., 1976. Structure of decision: the cognitive maps of political elites. Princeton
University Press.
B\ACK, T. and SCHWEFEL, H., 1993. An overview of evolutionary algorithms for parameter
optimization. Evol.Comput., 1(1), pp. 1-23.
BEN WANG, XINGSHE ZHOU, GANG YANG and YALEI YANG, 2010. Web Services
Trustworthiness Evaluation Based on Fuzzy Cognitive Maps, Intelligence Information
Processing and Trusted Computing (IPTC), 2010 International Symposium on 2010, pp. 230233.
BENTLEY, P.J., 2007. Digital Biology: How Nature Is Transforming Our Technology and Our
Lives. SIMON \& SCHUSTER.
BERTOLINI, M. and BEVILACQUA, M., 2010. Fuzzy Cognitive Maps for Human Reliability
Analysis in Production Systems. 252, pp. 381-415.
BEYER, H. and SCHWEFEL, H., 2002. Evolution strategies - A comprehensive introduction.
Natural Computing, 1(1), pp. 3-52.
BISHOP, C.M., 2006. Pattern Recognition and Machine Learning (Information Science and
Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc.
BOUTALIS, Y., KOTTAS, T.L. and CHRISTODOULOU, M., 2009. Adaptive Estimation of Fuzzy
Cognitive Maps With Proven Stability and Parameter Convergence. Fuzzy Systems, IEEE
Transactions on, 17(4), pp. 874-889.
BUENO, S. and SALMERON, J.L., 2009. Benchmarking main activation functions in fuzzy
cognitive maps. Expert Systems with Applications, 36(3, Part 1), pp. 5221-5229.
BURGES, C.J.C., 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2, pp. 121-167.
129
CARVALHO, J.P., 2010. On the semantics and the use of Fuzzy Cognitive Maps in social
sciences, Fuzzy Systems (FUZZ), 2010 IEEE International Conference on 2010, pp. 1-6.
CARVALHO, J.P. and TOME, J.A.B., 2004. Qualitative modelling of an economic system using
rule-based fuzzy cognitive maps, Fuzzy Systems, 2004. Proceedings. 2004 IEEE International
Conference on 2004, pp. 659-664 vol.2.
CARVALHO, J.P. and TOMÉ, J.A.B., 2001. Rule based fuzzy cognitive maps - Expressing time in
qualitative system dynamics, 2001, pp. 280-283.
CARVALHO, J.P. and TOME, J.A.B., 1999. Rule based fuzzy cognitive maps and fuzzy cognitive
maps - a comparative study. Annual Conference of the North American Fuzzy Information
Processing Society - NAFIPS, , pp. 115-119.
CHERKASSKY, V. and MULIER, F., 1998. Learning from Data : Concepts, Theory, and Methods
(Adaptive and Learning Systems for Signal Processing, Communications and Control Series).
Wiley-Interscience.
CHO, S., 1999. Pattern recognition with neural networks combined by genetic algorithm. Fuzzy
Sets and Systems, 103(2), pp. 339-347.
CHUNMEI LIN and YUE HE, 2010. The study of inference and inference chain of fuzzy cognitive
map, Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International
Conference on 2010, pp. 358-362.
CLARK, T., LARSON, J., MORDESON, J., POTTER, J., WIERMAN, M., CLARK, T., LARSON, J.,
MORDESON, J., POTTER, J. and WIERMAN, M., 2008. Fuzzy Set Theory. 225, pp. 29-63.
CORDÓN, O., GOMIDE, F., HERRERA, F., HOFFMANN, F. and MAGDALENA, L., 2004. Ten years
of genetic fuzzy systems: current framework and new trends. Fuzzy Sets and Systems,
141(1), pp. 5-31.
CORTES, C. and VAPNIK, V., 1995. Support-Vector Networks, Machine Learning 1995, pp.
273-297.
DAVIS, L.D. and MITCHELL, M., 1991. Handbook of Genetic Algorithms. Van Nostrand
Reinhold, .
DAWKINS, R., 2006. The selfish gene. Oxford University Press.
DE JONG, K., 2002. Evolutionary Computation: A Unified Approach. The MIT Press.
DICKERSON, J.A. and KOSKO, B., 1994. Virtual Worlds as Fuzzy Cognitive Maps. Presence,
3(2), pp. 73-89.
DUBOIS, D. and PRADE, H.M., 1980. Fuzzy sets and systems: theory and applications.
Academic Press.
DUDA, R., HART, P. and STORK, D., 2001. Pattern Classification (2nd Edition). WileyInterscience.
EIBEN, A.E. and SMITH, J.E., 2003. Introduction to Evolutionary Computing. SpringerVerlag.
130
FLOREANO, D. and MATTIUSSI, C., 2008. Bio-Inspired Artificial Intelligence: Theories,
Methods, and Technologies. The MIT Press.
FOGEL, L.J., OWENS, A.J. and WALSH, M.J., 1966. Artificial Intelligence through Simulated
Evolution. New York, USA: John Wiley.
FORBES, N., 2005. Imitation of Life: How Biology Is Inspiring Computing. The MIT Press.
GABRYS, B. and RUTA, D., 2006. Genetic algorithms in classifier fusion. Applied Soft
Computing, 6(4), pp. 337-347.
GEORGOPOULOS, V.C., MALANDRAKI, G.A. and STYLIOS, C.D., 2003. A fuzzy cognitive map
approach to differential diagnosis of specific language impairment. Artificial Intelligence in
Medicine, 29(3), pp. 261-278.
GEORGOPOULOS, V.C. and STYLIOS, C.D., 2009. Diagnosis support using fuzzy cognitive
maps combined with genetic algorithms, 2009, pp. 6226-6229.
GLYKAS, M. and XIROGIANNIS, G., 2005. A soft knowledge modeling approach for
geographically dispersed financial organizations. Soft Computing - A Fusion of Foundations,
Methodologies and Applications, 9(8), pp. 579-593.
GOLDBERG, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning. 1st
edn. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
HAGIWARA, M., 1992. Extended fuzzy cognitive maps, Fuzzy Systems, 1992., IEEE
International Conference on 1992, pp. 795-801.
HANAFIZADEH, P. and ALIEHYAEI, R., 2011. The Application of Fuzzy Cognitive Map in Soft
System Methodology. Systemic Practice and Action Research, pp. 1-30.
HAOMING ZHONG, CHUNYAN MIAO, ZHIQI SHEN and YUHONG FENG, 2008. Temporal fuzzy
cognitive maps, Fuzzy Systems, 2008. FUZZ-IEEE 2008. (IEEE World Congress on
Computational Intelligence). IEEE International Conference on 2008, pp. 1831-1840.
HAYKIN, S.S., 2009. Neural networks and learning machines. Prentice Hall.
HENGJIE, S., CHUNYAN, M. and ZHIQI, S., 2007. Fuzzy cognitive map learning based on
multi-objectiveparticle swarm optimization, Proceedings of the 9th annual conference on
Genetic and evolutionary computation 2007, ACM, pp. 339-339.
HOLLAND, J., 1975. Adaptation in natural and artificial systems. University of Michigan Press.
HOLLAND, J.H., 1992. Adaptation in Natural and Artificial Systems: An Introductory Analysis
with Applications to Biology, Control and Artificial Intelligence. Cambridge, MA, USA: MIT
Press.
HUERGA, A.V., 2002. A Balanced Differential Learning algorithm in Fuzzy Cognitive Maps, In:
Proceedings of the 16th International Workshop on Qualitative Reasoning 2002 2002.
IAKOVIDIS, D.K. and PAPAGEORGIOU, E., 2011. Intuitionistic Fuzzy Cognitive Maps for
Medical Decision Making. Information Technology in Biomedicine, IEEE Transactions on, 15(1),
pp. 100-107.
131
JONES, S., 2000. Almost like a whale: the origin of species updated. Black Swan.
JONG, K.A.D., 2006. Evolutionary computation: a unified approach. MIT Press.
KANNAPPAN, A., TAMILARASI, A. and PAPAGEORGIOU, E.I., 2011. Analyzing the performance
of fuzzy cognitive maps with non-linear hebbian learning algorithm in predicting autistic
disorder. Expert Syst.Appl., 38(3), pp. 1282-1292.
KARRAY, F. and SILVA, C.D., 2004. Soft Computing and Tools of Intelligent Systems Design:
Theory and Applications. Pearson Addison Wesley.
KEARNS, M.J. and VAZIRANI, U.V., 1994. An introduction to computational learning theory.
Cambridge, MA, USA: MIT Press.
KECMAN, V., 2001. Learning and soft computing: support vector machines, neural networks,
and fuzzy logic models. MIT Press.
KHAN, M.S. and QUADDUS, M., 2004. Group decision support using fuzzy cognitive maps for
causal reasoning. Group Decision and Negotiation, 13(5), pp. 463-480.
KHAN, M.S. and CHONG, A., 2003. Fuzzy Cognitive Map Analysis with Genetic Algorithm, IICAI
2003, pp. 1196-1205.
KLIR, G.J., and YUAN, B., 1995. Fuzzy sets and fuzzy logic : theory and applications. Upper
Saddle River, N.J.: Prentice Hall PTR.
KOHONEN, T., 1982. Self-organized formation of topologically correct feature maps. Biological
cybernetics, 43(1), pp. 59-69.
KOK, K., 2009. The potential of Fuzzy Cognitive Maps for semi-quantitative scenario
development, with an example from Brazil. Global Environmental Change, 19(1), pp. 122-133.
KOSKO, B., 1993/1995. Chapter 12: Adaptive Fuzzy Systems. Fuzzy Thinking.
KOSKO, B., 1992. Neural networks and fuzzy systems: a dynamical systems approach to
machine intelligence. Prentice Hall.
KOSKO, B., 1986. Fuzzy cognitive maps. International Journal of Man-Machine Studies, 24(1),
pp. 65-75.
KOSKO, B., 1988. Hidden patterns in combined and adaptive knowledge networks.
Int.J.Approx.Reasoning, 2(4), pp. 377-393.
KOTTAS, T.L., BOUTALIS, Y.S., DEVEDZIC, G. and MERTZIOS, B.G., 2004. A new method for
reaching equilibrium points in Fuzzy Cognitive Maps, 2004, pp. 53-60.
KOULOURIOTIS, D.E., DIAKOULAKIS, I.E., EMRIS, D.M., ANTONIDAKINS, E.N. and
KALIAKATSOS, I.A., 2003. Efficiently modeling and controlling complex dynamic systems using
evolutionary fuzzy cognitive maps (invited paper). International Journal of Computational
Cognition, 1(2), pp. 41-65.
KOULOURIOTIS, D.E., DIAKOULAKIS, I.E. and EMIRIS, D.M., 2001. A fuzzy cognitive mapbased stock market model: synthesis, analysis and experimental results, Fuzzy Systems,
2001. The 10th IEEE International Conference on 2001, pp. 465-468.
132
KOULOURIOTIS, D.E., DIAKOULAKIS, I.E. and EMIRIS, D.M., 2001. Learning fuzzy cognitive
maps using evolution strategies: A novel schema for modeling and simulating high-level
behavior, 2001, pp. 364-371.
KOZA, J.R., 1996. On the programming of computers by means of natural selection. MIT
Press.
KOZA, J.R., 1992. Genetic programming: on the programming of computers by means of
natural selection. Cambridge, MA, USA: MIT Press.
KUDO, M. and SKLANSKY, J., 2000. Comparison of algorithms that select features for pattern
classifiers. Pattern Recognition, 33(1), pp. 25-41.
LAUREANO-CRUCES, A., RAMÃREZ-RODRÃGUEZ, J. and TERáN-GILMORE, A., 2004.
Evaluation of the Teaching-Learning Process with Fuzzy Cognitive Maps. 3315, pp. 922-931.
MAMDANI, E.H., 1976. Application of fuzzy logic to approximate reasoning using linguistic
synthesis, Proceedings of the sixth international symposium on Multiple-valued logic 1976,
IEEE Computer Society Press, pp. 196-202.
MAMDANI, E.H. and GAINES, B.R., 1981. Fuzzy reasoning and its applications. Academic
Press.
MARSLAND, S., 2009. Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc
Machine Learning & Pattern Recognition). Chapman and Hall/CRC.
MARTIN LARSEN, P., 1980. Industrial applications of fuzzy logic control. International Journal
of Man-Machine Studies, 12(1), pp. 3-10.
MATEOU, N.H. and ANDREOU, A.S., 2008. A framework for developing intelligent decision
support systems using evolutionary fuzzy cognitive maps. Journal of Intelligent and Fuzzy
Systems, 19(2), pp. 151-170.
MATEOU, N.H., MOISEOS, M. and ANDREOU, A.S., 2005. Multi-objective evolutionary fuzzy
cognitive maps for decision support, Evolutionary Computation, 2005. The 2005 IEEE
Congress on 2005, pp. 824-830 Vol.1.
MCDERMOTT, A., 1975. Cytogenetics of man and other animals / A. McDermott ; consulting
editor for this volume, H. Harris. Chapman and Hall ; Wiley, London : New York.
MIAO, Y. and LIU, Z.-., 2000. On causal inference in fuzzy cognitive maps. IEEE Transactions
on Fuzzy Systems, 8(1), pp. 107-119.
MICHALEWICZ, Z. and FOGEL, D.B., 2004. How to solve it: modern heuristics. Springer.
MITCHELL, T.M., 1997. Machine Learning. 1 edn. New York, NY, USA: McGraw-Hill, Inc.
NEGNEVITSKY, M., 2005. Artificial intelligence: a guide to intelligent systems. Addison-Wesley.
NEOCLEOUS, C. and SCIZAS, C., 2004. Application of Fuzzy Cognitive Maps to the PoliticalEconomic Problem of Cyprus, , June 17-20 2004, pp. 340-349.
NOVÁK, V. and LEHMKE, S., 2006. Logical structure of fuzzy IF-THEN rules. Fuzzy Sets and
Systems, 157(15), pp. 2003-2029.
133
PAPAGEORGIOU, E., STYLIOS, C. and GROUMPOS, P., 2003. Fuzzy Cognitive Map Learning
Based on Nonlinear Hebbian Rule. LECTURE NOTES IN COMPUTER SCIENCE, (2903), pp. 256268.
PAPAGEORGIOU, E.I., 2011. A new methodology for Decisions in Medical Informatics using
fuzzy cognitive maps based on fuzzy rule-extraction techniques. Applied Soft Computing
Journal, 11(1), pp. 500-513.
PAPAGEORGIOU, E.I. and GROUMPOS, P.P., 2005. A new hybrid method using evolutionary
algorithms to train Fuzzy Cognitive Maps. Applied Soft Computing Journal, 5(4), pp. 409-431.
PAPAGEORGIOU, E.I. and GROUMPOS, P.P., 2004. Two-stage learning algorithm for fuzzy
cognitive maps, Intelligent Systems, 2004. Proceedings. 2004 2nd International IEEE
Conference 2004, pp. 82-87 Vol.1.
PAPAGEORGIOU, E.I., MARKINOS, A.T. and GEMTOS, T.A., 2011. Fuzzy cognitive map based
approach for predicting yield in cotton crop production as a basis for decision support system
in precision agriculture application. Applied Soft Computing, 11(4), pp. 3643.
PAPAGEORGIOU, E.I., PAPANDRIANOS, N.I., APOSTOLOPOULOS, D.J. and VASSILAKOS, P.J.,
2008. Fuzzy Cognitive Map based decision support system for thyroid diagnosis management,
Fuzzy Systems, 2008. FUZZ-IEEE 2008. (IEEE World Congress on Computational Intelligence).
IEEE International Conference on 2008, pp. 1204-1211.
PAPAGEORGIOU, E.I., PARSOPOULOS, K.E., STYLIOS, C.S., GROUMPOS, P.P. and VRAHATIS,
M.N., 2005. Fuzzy cognitive maps learning using particle swarm optimization. Journal of
Intelligent Information Systems, 25(1), pp. 95-121.
PAPAGEORGIOU, E.I., SPYRIDONOS, P.P., GLOTSOS, D.T., STYLIOS, C.D., RAVAZOULA, P.,
NIKIFORIDIS, G.N. and GROUMPOS, P.P., 2008. Brain tumor characterization using the soft
computing technique of fuzzy cognitive maps. Appl.Soft Comput., 8(1), pp. 820-828.
PAPAGEORGIOU, E.I., STYLIOS, C. and GROUMPOS, P.P., 2006. Unsupervised learning
techniques for fine-tuning fuzzy cognitive map causal links. International Journal of Human
Computer Studies, 64(8), pp. 727-743.
PAPAGEORGIOU, E., GEORGOULAS, G., STYLIOS, C., NIKIFORIDIS, G. and GROUMPOS, P.,
2006. Combining Fuzzy Cognitive Maps with Support Vector Machines for Bladder Tumor
Grading. 4251, pp. 515-523.
PAPAGEORGIOU, I. and GROUMPOS, P., 2005. A weight adaptation method for fuzzy cognitive
map learning. Soft Comput., 9(11), pp. 846-857.
PAPAKOSTAS, G.A., KOULOURIOTIS,D.E., 2010. Classifying patterns using Fuzzy Cognitive
Maps. Studies in Fuzziness and Soft Computing, 247, pp. 291-306.
PARK, K.S. and KIM, S.H., 1995. Fuzzy cognitive maps considering time relationships.
Int.J.Hum.-Comput.Stud., 42(2), pp. 157-168.
PARSOPOULOS, K.E., PAPAGEORGIOU, E.I., GROUMPOS, P.P. and VRAHATIS, M.N., 2003. A
first study of fuzzy cognitive maps learning using particle swarm optimization, Evolutionary
Computation, 2003. CEC '03. The 2003 Congress on 2003, pp. 1440-1447 Vol.2.
PETALAS, Y.G., PARSOPOULOS, K.E. and VRAHATIS, M.N., 2008. Improving fuzzy cognitive
maps learning through memetic particle swarm optimization. Soft Comput., 13(1), pp. 77-94.
134
POLI, R., LANGDON, W.B., CLERC, M. and STEPHENS, C.R., 2007. Continuous optimisation
theory made easy? Finite-element models of evolutionary strategies, genetic algorithms and
particle swarm optimizers, Proceedings of the 9th international conference on Foundations of
genetic algorithms 2007, Springer-Verlag, pp. 165-193.
RIDLEY, M., 2004. Evolution. Blackwell Pub.
ROSS, T.J., 2004. Fuzzy logic with engineering applications. John Wiley.
RUSSELL, S.J. and NORVIG, 2003. Artificial Intelligence: A Modern Approach (Second Edition).
Prentice Hall.
SALMERON, J.L., 2009. Augmented fuzzy cognitive maps for modelling LMS critical success
factors. Know.-Based Syst., 22(4), pp. 275-278.
SCHÖLKOPF, B., MIKA, S., BURGES, C.J.C., KNIRSCH, P., MÜLLER, K., RÄTSCH, G. and
SMOLA, A.J., 1999. Input Space Versus Feature Space in Kernel-Based Methods. IEEE
Transactions on Neural Networks, 10, pp. 1000-1017.
SHAWE-TAYLOR, J. and CRISTIANINI, N., 2004. Kernel Methods for Pattern Analysis. New
York, NY, USA: Cambridge University Press.
SHENG-JUN LI and RUI-MIN SHEN, 2004. Fuzzy cognitive map learning based on improved
nonlinear Hebbian rule, Machine Learning and Cybernetics, 2004. Proceedings of 2004
International Conference on 2004, pp. 2301-2306 vol.4.
SIRAJ, A., BRIDGES, S.M. and VAUGHN, R.B., 2001. Fuzzy cognitive maps for decision support
in an intelligent intrusion detection system, IFSA World Congress and 20th NAFIPS
International Conference, 2001. Joint 9th 2001, pp. 2165-2170 vol.4.
SIVANANDAM, S., N. and DEEPA, S., N., 2008. Introduction to Genetic Algorithms. New York:
Springer-Verlag Berlin Heidelberg.
SONG, H.J., MIAO, C.Y., WUYTS, R., SHEN, Z.Q., D'HONDT, M. and CATTHOOR, F., 2011. An
Extension to Fuzzy Cognitive Maps for Classification and Prediction. Fuzzy Systems, IEEE
Transactions on, 19(1), pp. 116-135.
STACH, W., KURGAN, L. and PEDRYCZ, W., 2008. Data-driven Nonlinear Hebbian Learning
method for Fuzzy Cognitive Maps, Fuzzy Systems, 2008. FUZZ-IEEE 2008. (IEEE World
Congress on Computational Intelligence). IEEE International Conference on 2008, pp. 19751981.
STACH, W., KURGAN, L. and PEDRYCZ, W., 2007. Parallel Learning of Large Fuzzy Cognitive
Maps, Neural Networks, 2007. IJCNN 2007. International Joint Conference on 2007, pp. 15841589.
STACH, W., KURGAN, L., PEDRYCZ, W. and REFORMAT, M., 2005. Evolutionary Development
of Fuzzy Cognitive Maps, Fuzzy Systems, 2005. FUZZ '05. The 14th IEEE International
Conference on 2005, pp. 619-624.
STACH, W., KURGAN, L., PEDRYCZ, W. and REFORMAT, M., 2005. Genetic learning of fuzzy
cognitive maps. Fuzzy Sets and Systems, 153(3), pp. 371-401.
135
STACH, W., KURGAN, L.A. and PEDRYCZ, W., 2008. Numerical and Linguistic Prediction of
Time Series With the Use of Fuzzy Cognitive Maps. Fuzzy Systems, IEEE Transactions on,
16(1), pp. 61-72.
STACH, W., KURGAN, L. and PEDRYCZ, W., 2010. Expert-Based and Computational Methods
for Developing Fuzzy Cognitive Maps. Fuzzy Cognitive Maps Advances in Theory,
Methodologies, Tools and Applications, 247, pp. 23-41.
STYLIOS, C.D. and GEORGOPOULOS, V.C., 2008. Fuzzy cognitive maps structure for medical
decision support systems.
STYLIOS, C.D. and GROUMPOS, P.P., 2004. Modeling Complex Systems Using Fuzzy Cognitive
Maps. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans.,
34(1), pp. 155-162.
STYLIOS, C.D. and GROUMPOS, P.P., 1999. Soft computing approach for modelling the
supervisor of manufacturing systems. Journal of Intelligent and Robotic Systems: Theory and
Applications, 26(3), pp. 389-403.
STYRIOS, C.D. and GROUMPOS, P.P., 2000. Fuzzy Cognitive Map In Modeling Supervisory
Control Systems. Journal of Intelligent & Fuzzy Systems, 8, pp. 83-98.
SUGENO, M. and YASUKAWA, T., 1993. A fuzzy-logic-based approach to qualitative modeling.
Fuzzy Systems, IEEE Transactions on, 1(1), pp. 7.
TSADIRAS, A.K., 2008. Comparing the inference capabilities of binary, trivalent and sigmoid
fuzzy cognitive maps. Information Sciences, 178(20), pp. 3880-3894.
TSADIRAS, A.K. and MARGARITIS, K.G., 1999. A New Balance Degree for Fuzzy Cognitive
Maps. Department of Applied Informatics, University of Macedonia.
TSADIRAS, A. and KOUSKOUVELIS, I., 2005. Using Fuzzy Cognitive Maps as a Decision
Support System for Political Decisions: The Case of Turkey’s Integration into the European
Union. 3746, pp. 371-381.
VAPNIK, V., GOLOWICH, S.E. and SMOLA, A., 1996. Support Vector Method for Function
Approximation, Regression Estimation, and Signal Processing, Advances in Neural Information
Processing Systems 9 1996, MIT Press, pp. 281-287.
XIANG-FENG LUO and ER-LIN YAO, 2005. The Reasoning Mechanism of Fuzzy Cognitive Maps,
Semantics, Knowledge and Grid, 2005. SKG '05. First International Conference on 2005, pp.
104-104.
XIAODONG ZHU, ZHIQIU HUANG, SHUQUN YANG and GUOHUA SHEN, 2007. Fuzzy
Implication Methods in Fuzzy Logic, Fuzzy Systems and Knowledge Discovery, 2007. FSKD
2007. Fourth International Conference on 2007, pp. 154-158.
XIROGIANNIS, G., GLYKAS, M., STAIKOURAS,C., 2010. Fuzzy Cognitive Maps in banking
business process performance measurement. Studies in Fuzziness and Soft Computing, 247,
pp. 161-200.
XIROGIANNIS, G. and GLYKAS, M., 2004. Fuzzy cognitive maps in business analysis and
performance-driven change. Engineering Management, IEEE Transactions on, 51(3), pp. 334351.
136
YANLI ZHANG and HUIYAN LIU, 2010. Classification Systems Based on Fuzzy Cognitive Maps,
Genetic and Evolutionary Computing (ICGEC), 2010 Fourth International Conference on 2010,
pp. 538-541.
YOU-WEI TENG and WANG, W.-., 2004. Constructing a user-friendly GA-based fuzzy system
directly from numerical data. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on, 34(5), pp. 2060-2070.
ZADEH, L.A., 1996. Fuzzy logic = computing with words. Fuzzy Systems, IEEE Transactions
on, 4(2), pp. 103-111.
ZADEH, L.A., 1983. The role of fuzzy logic in the management of uncertainty in expert
systems. Fuzzy Sets Syst., 11(1-3), pp. 197-198.
ZADEH, L.A., 1973. Outline of a New Approach to the Analysis of Complex Systems and
Decision Processes. Systems, Man and Cybernetics, IEEE Transactions on, SMC-3(1), pp. 2844.
ZADEH, L.A., 1965. Fuzzy Sets. Information and Computation/information and Control, 8, pp.
338-353.
ZHEN PENG, BINGRU YANG and WEIWEI FANG, 2008. A Learning Algorithm of Fuzzy Cognitive
Map in Document Classification, Fuzzy Systems and Knowledge Discovery, 2008. FSKD '08.
Fifth International Conference on 2008, pp. 501-504.
ZHI-QIANG LIU and YUAN MIAO, 1999. Fuzzy cognitive map and its causal inferences, Fuzzy
Systems Conference Proceedings, 1999. FUZZ-IEEE '99. 1999 IEEE International 1999, pp.
1540-1545 vol.3.
ZIMMERMANN, H.J., 2001. Fuzzy set theory--and its applications. Kluwer Academic Publishers.
137