UNIVERSITY OF CYPRUS DEPARTMENT OF COMPUTER SCIENCE Computational Intelligence Author : Maria Papaioannou Supervisor : Prof. Christos N. Schizas Date 6/7/2011 : Table of Contents Chapter 1 - Introduction ..................................................................................................................................... 4 Chapter 2 - Neural Networks .............................................................................................................................. 7 2.1 - Neurons .................................................................................................................................................. 7 2.1.1 - Biological Neurons.......................................................................................................................... 7 2.1.2 - Artificial Neuron ............................................................................................................................. 8 2.2 - Artificial Neural Networks and Architecture ......................................................................................... 8 2.3 - Artificial Neural Networks and Learning............................................................................................... 8 2.4 - Feed-Forward Neural Network............................................................................................................ 12 2.4.1 - The perceptron algorithm ............................................................................................................. 12 2.4.2 - Functions used by Feed-forward Network .................................................................................... 14 2.4.3 - Back propagation algorithm ......................................................................................................... 16 2.5 - Self Organizing Maps.......................................................................................................................... 22 2.5.1 - Biology .......................................................................................................................................... 22 2.5.2 - Self Organizing Maps Introduction ............................................................................................... 23 2.5.3 - Algorithm ...................................................................................................................................... 25 2.6 - Kernel Functions .................................................................................................................................. 28 2.7 - Support Vector Machines ..................................................................................................................... 33 2.7.1 - SVM for two-class classification ................................................................................................... 33 2.7.2 - Multi-class problems ..................................................................................................................... 42 Chapter 3 - Fuzzy Logic.................................................................................................................................... 45 3.1 - Basic fuzzy set operators ...................................................................................................................... 50 3.2 - Fuzzy Relations .................................................................................................................................... 52 3.3 - Imprecise Reasoning ............................................................................................................................ 58 3.4 - Generalized Modus Ponens.................................................................................................................. 59 3.5 - Fuzzy System Development .................................................................................................................. 70 Chapter 4 - Evolutionary Algorithms ............................................................................................................... 73 4.1 - Biology ................................................................................................................................................. 73 4.2 - EC Terminology and History ............................................................................................................... 76 4.2.1 - Genetic Algorithms (GA) ............................................................................................................... 77 4.2.2 - Genetic Programming (GP) .......................................................................................................... 77 2 4.2.3 - Evolutionary Strategies (ES) ......................................................................................................... 78 4.2.4 - Evolutionary Programming (EP) .................................................................................................. 79 4.3 - Genetic Algorithms............................................................................................................................... 79 4.3.1 - Search space.................................................................................................................................. 80 4.3.2 - Simple Genetic Algorithm ............................................................................................................. 80 4.3.3 - Overall .......................................................................................................................................... 92 Chapter 5 - Fuzzy Cognitive Maps ................................................................................................................... 94 5.1 - Introduction .......................................................................................................................................... 94 5.2 - FCM technical background............................................................................................................ 95 5.2.1 - FCM structure ............................................................................................................................... 95 5.2.3 - Static Analysis ............................................................................................................................. 101 5.2.4 - Dynamical Analysis of FCM (Inference) .................................................................................... 103 5.2.5 - Building a FCM........................................................................................................................... 108 5.2.6 - Learning and Optimization of FCM ............................................................................................ 110 5.2.7 - Time in FCM ............................................................................................................................... 116 5.2.8 - Modified FCM ............................................................................................................................. 119 5.3 - Issues about FCMs ............................................................................................................................. 121 Chapter 6 - Discussion ................................................................................................................................... 126 References ...................................................................................................................................................... 129 3 Chapter 1 - Introduction During the last century world has come to an important finding: How organized and structural thinking arises from a bunch of individual processing units, which co-function and interact in harmony under no specific organized supervision of a super unit. In other words humanity shed a light on how the brain works. The appearance of computers and the outstanding evolution of their computational capabilities enhanced even more the human need in exploring how human brain works by modeling single neurons or group of interconnected neurons and run simulations over them. That meant the first generation of the well known Artificial Neural Networks (ANN). Through time scientists have made more observations on how human and nature works. Behavioral concepts of individuals or organizations of individuals, psychological aspects, biological and genetic findings have contributed into computerized methods which firstly went down the algorithms class of “Artificial Intelligence” but later on created a novel category of algorithms, the “Computational Intelligence”(CI). Computer Science is a field where algorithmic processes, methods and techniques describing, transforming and generally utilizing information are created to satisfy different human needs. Humans have to deal with a variety of complex problems in different domains as Medicine, Engineering, Ecology, Biology and so on. In parallel, humans have aggregated through years vast databases of a plethora kind of information. The need to search, identify structures and distributions and overall take an advantage on all this information along with the simultaneous development of computer machines (in terms of hardware level) led humans in inventing different ways on how to handle information for their own benefit. Computational Intelligence lies under the general umbrella of Computer Science and hence encloses algorithms/ methods/ techniques of utilizing and manipulating data/information. However, as indicated by the name of this area, CI methods aim in incorporating some features of human and nature intelligence in handling information through computations. Although there is no strict definition of intelligence, approximately one could say that intelligence is the ability of an individual to, efficiently and effectively, adapt to a new environment, altering from its present state to something new. In other words, being able under certain circumstances to take the correct decisions about the next action. In order to accomplish something like that certain intelligence features must exist like learning through experience, using language in communications, spatial and environmental cognition, invention, generalization, inference, applying calculations based on common logic and so on. It is clear that humans are the only beings in this world gathering all the intelligent features. However computers are more able to take a big amount of effort into solving a specific and explicitly defined by humans problem, taking an advantage on a vast amount of available data. CI technologies are offered in 4 mimicking intelligence properties found in humans and nature and use these properties effectively and efficiently in a method/algorithm body and eventually bringing better results faster than other conventional mathematical or CS methods in the same problem domain. The main objective of CI research is the understanding of the fashion in which natural systems function and work. CI combines different elements of adaptation, learning, evolution, fuzzy logic, distributed and parallel processing, and generally, CI offers models which employ intelligent (to a degree) functions. CI research is not competitive to other classic mathematical methods which act in the same domains, rather it comprises an alternative technique in problem solving where the employment of conventional methods is not feasible. The three main sub-areas of CI are Artificial Neural Networks, Fuzzy Logic and Evolutionary Computation. Some of the application areas of CI are medical diagnostics, economics, physics, time scheduling, robotics, control systems, natural language processing, and many others. The growing needs in all these areas and their application in humans’ everyday life intensified CI research. This need guided researchers in the direction of investigating further the behavior of nature that lead to the definition of evolutionary systems. The second chapter is about Artificial Neural Network models which are inspired by brain processes and structures. A first description of how ANN were first inspired and modeled is given and thereafter a basic description of three ANN models is given. Among the various configurations of ANN topologies and training algorithms, Multilayer Perceptron (MLP), Kohonen and Support Vector Machines are described. ANN are applicable in a plethora of areas like general classification, regression, pattern recognition, optimisation, control, function approximation, time series prediction, data mining and so on. The third chapter addresses Fuzzy Logic elements and aspects. Coming from the Fuzzy Logic inventor himself, Dr. Zadeh, the word “fuzzy” indicates blurred, unclear or confused states or situations. He even stated that the choice of the word “fuzzy” for describing his theory could had been wrong. However, what essentially Fuzzy Logic tries to give is a theoretical schema on how humans think and reason. Fuzzy Logic allows computers to handle information as humans do, described in linguistic terms, not precisely defined yet being able to apply inference on data and make correct conclusions. Fuzzy Logic has been applied in a wide range of fields as control, decision support, information systems and in a lot of industrial products as cameras, washing machines, air-conditions, optimized timetables etc. The fourth chapter is about Evolutionary Computation which essentially draws ideas from natural evolution as first stated by Darwin. Evolutionary Computation methods are focused in optimization problems. Potential solutions to such problems are created and evolved through discrete iterations (named generations) under selection, crossover and mutation operators. The solutions are encoded into strings (or trees) which act 5 analogously to biological chromosomes. In these algorithms, the parameters comprising a problem are represented either as genotypes (which are evolved based on inheritance) or phenotypes (which are evolved based on environment interaction aspects). The EC area can be analyzed into four directions: Evolutionary Strategies, Evolutionary Programming, Genetic Programming and finally Genetic Algorithms. All these EC subcategories differ in the solution representation and on the genetic operators they use. Fields where EC has been successfully applied are telecommunications, scheduling, engineering design, ANN architectures and structural optimisation, control optimization, classification and clustering, function approximation and time series modeling, regression and so on. The fifth chapter describes the model of Fuzzy Cognitive Maps which comprise a soft computing methodology of modeling systems which constitute causal relationships amongst their parameters. They manage to represent human knowledge and experience, in a certain system’s domain, into a weighted directed graph model and apply inference presenting potential behaviours of the model under specific circumstances. FCMs can be used to adequately describe and handle fuzzy and not precisely defined knowledge. Their construction is based on knowledge extracted from experts on the modeled system. The expert must define the number and linguistically describe the type of the concepts. Additionally they have to define the causal relationships between the concepts and describe their type (direct or inverse) as well as their strength. The main features of FCMs are their simplicity in use and their flexibility in designing, modeling and inferring complex and big-sized systems. The basic structure of FCMs and their inference process is described in this chapter. Some proposed algorithms of the FCM’s weight adaptation and optimization are also presented. Such algorithms are needed to be used in cases where available data describing the system behavior is feasible and experts are not in position in describing the system adequately or completely. Moreover some tries in introducing time dependencies are given along with some hybridized FCMs (mostly with ANN) which are mainly used in classification problems. Finally some open issues about FCMs are discussed in the end of this chapter. The last chapter is an open discussion on CI area and more specifically on Evolutionary Computation and Fuzzy Cognitive Maps which gained the great of interest on behalf of the writer of this study. 6 Chapter 2 - Neural Networks Neural Networks comprise an efficient mathematical/computational tool which is widely used for pattern recognition. As indicated by the term «neural», ANN is constituted by a set of artificial neurons which are interconnected resembling the architecture and the functionality of the biological neurons of the brain. 2.1 - Neurons 2.1.1 - Biological Neurons In biological level, neurons are the structural units of the brain and of the nervous system, which are characterized by a fundamental property: their cell membrane allows the exchange of signals to interconnected neurons. A biological neuron is constituted by three basic elements: the body, the dendrites and the axon. A neuron accepts the input signals from neighbouring neurons through its dendrites and sends output signals to other neurons through its axon. Each axon ends to a synapse. A synapse is the connection between an axon and the dendrites. Chemical processes take place in synapses which define the excitation or inhibition of the electrical signal travelling to the neuron’s body. The arrival of an electrical pulse to a synapse causes the release of neurotransmitting chemicals which travel across to the postsynaptic area. The postsynaptic area is a part of the receiving neuron’s dendrite. Since a neuron has many dendrites it is clear that it may also receive several electrical pulses (input signals) at the same time. The neuron is then able to integrate these input signals and if the strength of the total input overcomes a threshold, transform the input signals into an output signal to its neighbour neurons through its axon. Synapses are believed to play a significant role in learning since the strength of a synaptic connection between neurons is adjusted with chemical processes reflecting the favourable and unfavourable incoming stimuli helping the organism to optimize its response to the particular task and environment. The processing power of the brain depends a lot on the parallel interconnected function of neurons. Neurons comprise a very important piece in the puzzle of processing complex behaviours and features of a living organism such as learning, language, memory, attention, solving problems, consciousness, etc. The very first neural networks were developed to mimic the biological information processing mechanism (McCulloch and Pitts in 1943 and later others Widrow and Hoff with Adaline in 1960, Rosenblatt with Perceptron in 1962). 7 This study however is focused in presenting the ANN as a computational model for regression and classification problems. More specifically the feed forward neural network will be introduced since it appears to gain greatest practical value. 2.1.2 - Artificial Neuron Following the functionality and the structure of a biological neuron, an artificial neuron: 1. Receives multiple inputs, in the form of real valued variables x n = [x1 ,… ,xD]. A neuron with D inputs, defines a D-dimensional space. The artificial neuron will create a D-1 dimensional hyperplane (decision boundary) which will divide the input space in two space areas. 2. Processes the inputs by integrating the input values with a set of synaptic parameters w = [w 1,…,wD] to produce a single numerical output signal. The output signal is then transformed using a differentiable, nonlinear function called activation function. 3. Transmits the output signal to its interconnected artificial neurons. 2.2 - Artificial Neural Networks and Architecture Most ANNs are divided in layers. Each layer is a set of artificial neurons called nodes. The first layer of the network is the input layer. Each node of the input layer corresponds to a feature variable of the input vector. The last layer of the network is the output layer. Each node on the output layer denotes the final decision/answer of the network to the problem being modelled. The rest layers between input and output layer are called hidden layers and their neurons hidden units. The connections between the nodes are characterized by a real value called synaptic weight and its role is in analogy with the biological synapses. A connection with positive weight can be seen as an excitatory synapse while a negative weight resembles an inhibitory synapse. There is a variety of ANN categories, depending on their architectures and connection topologies. In this study feed – forward networks will be presented, which constitute one of the simplest forms of ANN. 2.3 - Artificial Neural Networks and Learning Artificial Neural Networks are data processing systems. The main goal of ANN is to predict/classify correctly given a specific input instance. Similarly to biological neurons, learning in ANN is achieved by adapting the synaptic weights to the input data. There is a variety of learning methods for ANNs, drawn from Supervised Learning, Unsupervised Learning and Reinforcement Learning fields. In any case, the general goal of an ANN is to recover the mapping function between the input data and the correct/wanted/”good” classification/clustering/prediction of the output data. Some basic description of the supervised and unsupervised learning classes is given in the Table 2.2. 8 Stimuli Synapse Dendrites Soma Axon Biological Action Electrical signals are “filtered” by The dendrites Ιnput signals Transmits output neuron potentials the strength of the synapse. receive the integration. signals to expressed as electrical neighbouring electrical signals neurons. impulses through their (electrical synapse. signals). Biological neuron graphically Artificial A vector of Each input variable is weighted by The updated - Integration of the input A single real neuron real valued the corresponding synaptic weighted variables. The result is variable which is (AN) variables. weight. input signal is transformed using an available to pass received in the activation function as input into the AN. h(∙). neighboring artificial neurons. Artificial Neuron x1 graphically w1 w 1 . x1 x2 . w2 . . . . . xD w 2 . x2 Σ S h(s) y w D . xD wD Table 2.1 9 Dataset Network Response Goal Methods Output vector: The Adjust parameters w so that Least-mean-square algorithm S Input vector: Vector u containing D input answer of the ANN given the difference (tn - On) , (for a single neuron) and its p variables/features so that: the instance Xn in a vector called training error, is generalization widely known containing the values for minimized. as back-propagation algorithm e r Xn = x1, x2 , … , xD. Target vector: Vectors v containing the values of the i output variables that the s ANN should answer, given e an input sample. d each output variable. (for multilayer interconnection On = o1, … , oP. of neurons). Tn = t1, … , tP. For 1 ≤ n ≤ N U Input vector: Vector Assign a neuron or a set Adjust the synaptic Unsupervised learning using n containing D input of neurons to respond to parameters w in such a way Hebbian learning (e.g. Self s variables/features so that: specific input instances that for each input Organized Maps). recovering their class/group, a specific p underlying relations and artificial neuron/neurons e projecting them into response intensively while r clusters/groups. the rest neurons don’t u Xn = x1, x2 , … , xD. v response. i s e d Table 2.2: Supervised and Unsupervised learning overview. The choice of which learning method to use depends on many parameters such as the data set structure, the ANN architecture, the type of the problem, etc. An ANN learns by running for a number of iterations the steps defined by the chosen learning algorithm. The process of applying learning in an ANN is called training period/phase. The stopping criterion of the training period maybe one of the following (or a combination of them): 1. The maximum number of iterations is reached. 2. The cost-error function is minimized/maximized adequately. 10 3. The difference of all synaptic weight values between 2 sequential runs is substantially small. Generalization is one of the most considerable properties of a trained computational model. Generalization is achieved if the ANN (or any other computational model) is able to categorize/predict/decide correctly given input instances that were not presented to the network during the training phase (test dataset). A measure of the generalization of the network is the testing error (evaluation of the difference between the output of the network given an input instance, taken from the test dataset, and its target value). If the ANN fails to generalize (meaning the testing error is large – considerably bigger than the training error) then phenomena as overfitting and underfitting is possible to appear. Overfitting happens when the ANN is said to be overtrained so that the model captures the exact relationship between the specific input-output used during training phase. Eventually the network becomes just a “look-up table” memorizing every input instance (even noise points) connecting it with the corresponding output. As a result, the network is incapable of recognizing a new testing input instance, responding fairly randomly. Underfitting appears when the network is not able to capture the underlying function mapping input – output data, either due to the small size of the training dataset or the poor architecture of the model. That is why designing an ANN model requires special attention in terms of making the appropriate choice of an ANN model suited to the problem to be solved-modeled. Such considerations are: 1. If the training dataset is a sample of the actual dataset then the chosen input instances should be representative covering all the possible classes/clusters/categories of the actual dataset. For example if the problem is the diagnosis whether a patient brings a particular disease or not, then the sample dataset should include instances of diseased patients and non-diseased patients. 2. The number of input instances is also important. If the number of the input data is fairly small then underfitting is dangerous to happen since the network will not be given the right amount of information needed to shape the input-output mapping function. Note that adequate instances should be given for all the classes of the input data. Additionally, the size of the training dataset should be defined in respect to the dimension of the input data space. The larger the data dimensionality, the larger the dataset size should be allowing the network to adopt the synaptic weights (whose number increases/decreases accordingly to the increase/decrease of the input feature variables). 3. The network architecture meaning the number of hidden layers and the number of their hidden units, the network topology, etc. 11 4. If the stopping criterion is defined exclusively by the maximum number of epochs and the network is left to run for a long time, there is increased risk for overfitting. On the other side, if the network is trained only for a small number of epochs, undefitting is possible. Figure 2.1: How ANN response to a regression problem when underfitting, good fitting and overfitting occur. 2.4 - Feed-Forward Neural Network Artificial neural networks has become through years a big family of different models combined with a variety of learning methods. One of the most popular and prevalent forms of neural network is the feed-forward network combined with backpropagation learning. A feed-forward network consists of an input layer, an output layer and optionally one or more hidden layers. Their name, “feed-forward”, denotes the direction of the information processing which is guided by the connectivity between the network layers. The nodes (artificial neurons) of each layer are connected with all the nodes of the next layer whereas the neurons belonging to the same layer are independent. Feed-forward networks are also called “multilayer-perceptron” as they follow the basic idea of the elementary perceptron model in a multilayered form. For that reason, the elementary perceptron algorithm description is given paving the way for better understanding the “multilayer-perceptron” neural network. 2.4.1 - The perceptron algorithm The perceptron algorithm (Rosenblatt, 1962) belongs to the family of linear discriminant models. The model is divided in two phases: 1. The nonlinear transformation of input vector X to a feature vector φ(x) (e.g. by using fixed basis functions). 2. Build the generalized linear model as: y(xn) = f(wT ∙ φ(xn)) Eq. 2.1 12 where for N input instances 1 ≤ n ≤ N, w is the weight vector and f∙(∙) is a step size function of the form: f(a) ={ 𝑎 = +1, 𝑖𝑓 𝑎 ≥ 0 𝑎 = −1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Eq. 2.2 Perceptron is a two class model meaning that if the output y(xn) is +1, input pattern xn belongs to C1 class whereas if y(xn) is -1 then xn belongs C2 class. This model can be rendered by a single artificial neuron, as described previously in this study. The output of the perceptron depends on the weighted feature vector φ(xn). Therefore perceptrons algorithm intention is to find a weight vector w* such that patterns xn which are members of the class C1 will satisfy wTφ(xn) > 0 while patterns xn of class C2 will satisfy wTφ(xn) < 0. Setting the target output tn є {+1, -1} for class C1 and class C2 respectively, all the correctly classified patterns should satisfy: wTφ(xn) tn > 0 Eq. 2.3 The error function (named perceptron criterion) is zero for correctly classified patterns (y(x n) is equal to tn) while for the set of the misclassified patterns denoted by M, gives the quantity: E(w) = -ΣmєM (wTφ(xm)tm) Eq. 2.4 The goal is the minimization of the error function and to do so, the weight vector should be updated every time the error function is not zero. The weight update function uses gradient descent algorithm so that: w(i+1) = w(i) – η∇E(w) = w(i) – ηφ(xm)tm Eq 2.5 where i is the counter of the algorithm iterations and η is the learning rate of the algorithm defining the weight vector changing rate. The overall perceptron algorithm is divided in the following steps: For each input instance xn at step i apply 1. y(xn) = f(w(i)T ∙ φ(xn)) 2. if not(equal(y(xn), tn)) if equal(tn , + 1) w(i+1) = w(i) + ηφ(xn) 13 else w(i+1) = w(i) - ηφ(xn) else move forward 3. increase step counter i and move to next input instance 2.4.2 - Functions used by Feed-forward Network A feed-forward network is also named as multilayered perceptron since it can be analyzed to HL +1 stages of processing, each one containing multiple neurons behaving like elementary perceptrons, where HL is the number of the hidden layers of the network. However, attention should be given to the fact that ANN hidden neurons use continuous sigmoidal nonlinearities whereas the perceptron uses step-function nonlinearities. Figure 2.2: A feed-forward network with 2 hidden layers An example of a feed-forward network is shown in Figure 2.2. The particular network is constituted of one input layer, two hidden layers and an output layer. The features of an input pattern are assigned to the nodes of the input layer. In Figure 2.2 the input pattern has two features x1 and x2. Hence the input layer consists of two nodes. The data is then processed by the hidden layer nodes ending in one output node which gives the answer to the problem modelled by the input patter xn and the synaptic weight vector w. 14 As it can be seen, there is an extra node in every layer, excluding output layer, notated by a zero marker. It is called bias parameter and the synaptic weights beginning from the bias nodes are represented by thinner lines, on purpose, since adding a bias node is optional for every layer. The bias parameter is generally added to a layer to allow a fixed offset in the data. For that reason their value is always 1 so that their weight values are just added to the rest weighted node values. The connectivity of the network resembles the direction of the information propagation. It all begins from the input layer and sequentially ends to the output layer. In the first place, an input pattern X n = x1,…,xD is introduced to the network (that means the input layer consists of D nodes1). M linear combinations between the input nodes and hidden units of the first hidden layer are created as: pj = h(∑𝐷 𝑖=1 𝑤𝑖𝑗 𝑥𝑖 + 𝑤0𝑗 ) Eq. 2.6 where j = 1,.., M (M is the number of the hidden units of the first hidden layer excluding the bias node) and h(∙) is a differentiable, nonlinear activation function generally used from sigmoidal functions as the logistic sigmoid or the tanh. The output values of the first hidden layer neurons are linearly combined in the form of: qk = h(∑𝑀 𝑗=1 𝑤𝑗𝑘 𝑝𝑗 + 𝑤0𝑘 ) Eq. 2.7 to give the output of the second hidden layer units. The same process is repeated for every extra hidden layer units. The last step is the evaluation of the output units (belonging to the output layer) where again there are K linear combinations of the last hidden layer variables: or = 𝑓(∑𝐾 𝑘=1 𝑤𝑘𝑟 𝑞𝑘 + 𝑤0𝑟 ) Eq. 2.8 where r = 1,…R and R is the total number of outputs and f might be an activation function (e.g. same as h(∙)) or the identity. Taking the network presented in Figure 2.2 the whole model functioning is described as: 𝑀 𝐷 yr(x,w) = 𝑓(∑𝐾 𝑘=1 𝑤𝑘𝑟 h(∑𝑗=1 𝑤𝑗𝑘 h(∑𝑖=1 𝑤𝑖𝑗 𝑥𝑖 + 𝑤0𝑖 )+ w0j) + w0k) Eq. 2.9 The final output of the network, given an input pattern Xn, is the vector On = o1 ,…, oR. Each element of the vector is the network’s answer/decision for the corresponding element of the target vector t n. Alike perceptron model, if the target vector differs from the network output then there is error (given by an error function). What is left is to train the network by adopting the synaptic interconnections guided by an error 1 The word “nodes” is used but for the input layer case it is more preferable to be understood as incoming signals. 15 function minimization. One of the most well-known learning algorithms, for feed-forward neural networks, is the back propagation learning. 2.4.3 - Back propagation algorithm Back propagation algorithm tries to compute the weight vector values so that after training phase, the network will be able to produce the right output values given a new input pattern taken from the test dataset. Initially the weight all the synaptic weights are randomized. Then the algorithm repeats the following steps: Firstly, an input pattern Xn is presented to the network and the feed-forward network functions are calculated in every layer, as described in the above section, to get the output values in the output layer. The next step is to measure the distance (error) between the network output y(X n,w) and the target output tn. To do so, an error function must be defined. The most common error function applied in backpropagation algorithm for feed-forward networks, is the sum-of-squares defined by: 1 En = 2 ∑𝑅𝑟=1(𝑡𝑟𝑛 − 𝑜𝑟𝑛 )2 Eq. 2.10 where R is the number of the output variables (output units), t is the target vector, o is the actual network output vector and n is the indexing for input patterns (instances/observations). Since the target is to find out the weight values which minimize the error function, the change of the error function in respect to the change of the weight vector must be calculated. However, there might be several output nodes (each one corresponding to an element value of the target vector) so the gradient of the error function for each output node in respect to its synaptic weights should be calculated as: ∇𝐸𝑛 = 𝑑𝐸𝑛 𝑑𝑤𝑟 =( 𝜕𝐸𝑛 ) for k = 1... K, r = 1… R 𝜕𝑤𝑘𝑟 Eq. 2.11 Where: 𝐸𝑛 is the error function for the output node r taken the input pattern n. 𝑤𝑟 is the synaptic weight vector including all the weight values of the connections between the output node r and the last hidden layer. K is the number of units of the last hidden layer and R is the number of the output units. Using the Equation 2.10 the gradient of each weight value is: 16 𝜕𝐸𝑛 𝜕𝑤𝑘𝑟 By using the chain rule: = 𝜕𝐸𝑛 𝜕𝐸𝑛 𝜕𝑜𝑟 𝜕𝑜𝑟 𝜕𝑤𝑘𝑟 2 𝜕𝑤𝑘𝑟 = 𝜕𝑤𝑘𝑟 where: and (recall Eq. 2.8) 1 𝜕 ∑𝑅𝑟=1 (𝑡𝑛𝑟 − 𝑜𝑛𝑟 )2 𝜕𝐸𝑛 𝜕𝑜𝑟 ∙ 𝜕𝑜𝑟 𝜕𝑤𝑘𝑟 Eq. 2.12 Eq. 2.13 = - (𝑡𝑟 - 𝑜𝑟 ) Eq.2.14 = f ’ (inpr)qk Eq. 2.15 where f(∙)is the output activation function and inpr is the linear combination of the weight and corresponding hidden unit value (the input of the function f in Eq.2.8). Therefore, 𝜕𝐸𝑛 𝜕𝑤𝑘𝑟 = - (𝑡𝑟 - 𝑜𝑟 ) f ’ (inpr )qk Eq. 2.16 The general goal is to find the combination of wk which leads to the global minimum of the error function. Hence, gradient descent is applied to every synaptic weight as: 𝑤𝑘𝑟 = 𝑤𝑘𝑟 – η 𝜕𝐸𝑛 𝜕𝑤𝑘𝑟 = 𝑤𝑘𝑟 + η (𝑡𝑟 - 𝑜𝑟 ) f’’ (inpr )qk Eq. 2.17 where η is the learning rate. The process, so far, resembles the perceptron learning (perceptron criterion). However, backprop differs since the next step is about calculating each hidden unit’s contribution to the final output error and update the corresponding weights accordingly. The idea is that every hidden unit affects, in an indirect way, the final output (since each hidden unit is fully connected with all the units of the next layer). Therefore, every hidden unit’s interconnection weights should be adjusted to minimize the output error. The difficulty is that there is no target vector for the hidden unit values, which means that the difference (error) between the calculated value and the correct one is not plausible as it was at the previous step. So, the next synaptic weights to be adjusted are the ones connecting the last hidden layer (with K nodes) with the previous to that layer (with M nodes), wjk for j = 1,… ,M and k = 1, …, K. The change of the error function to the change of these weights is given: 𝜕𝐸𝑛 𝜕𝑤𝑗𝑘 1 = 𝜕 ∑𝑅𝑟=1 (𝑡𝑛𝑟 − 𝑜𝑛𝑟 )2 2 𝜕𝑤𝑘𝑟 Eq. 2.18 17 𝜕𝑜 = - ∑𝑅𝑟=1(𝑡𝑟 − 𝑜𝑟 ) 𝜕𝑤 𝑟 Eq. 2.19 𝑗𝑘 𝜕𝑓(𝑖𝑛𝑝 ) = - ∑𝑅𝑟=1(𝑡𝑟 − 𝑜𝑟 ) 𝜕𝑤 𝑟 𝑗𝑘 Eq. 2.20 𝜕𝑖𝑛𝑝𝑟 = -∑𝑅𝑟=1(𝑡𝑟 − 𝑜𝑟 ) 𝑓 ′ (𝑖𝑛𝑝𝑟 ) 𝜕𝑤 At this point let So that: δr = (𝑡𝑟 − 𝑜𝑟 ) 𝑓 ′ (𝑖𝑛𝑝𝑟 ) 𝜕𝐸𝑛 𝜕𝑤𝑗𝑘 = -∑𝑅𝑟=1 𝛿𝑟 = -∑𝑅𝑟=1 𝛿𝑟 Eq. 2.23 𝜕 ∑𝐾 𝑐=1 𝑤𝑐𝑟 𝑞𝑐 = -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 Eq. 2.21 Eq. 2.22 𝜕𝑖𝑛𝑝𝑟 𝜕𝑤𝑗𝑘 = -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 𝑗𝑘 𝜕𝑤𝑗𝑘 𝜕𝑞𝑘 Eq. 2.24 Eq. 2.25 𝜕𝑤𝑗𝑘 𝜕ℎ(𝑖𝑛𝑝𝑘 ) 𝜕𝑤𝑗𝑘 Eq. 2.26 where inpk is the input of the function h as given in Eq. 2.7 and hence: 𝜕𝐸𝑛 𝜕𝑤𝑗𝑘 Let Then = -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 ℎ′(𝑖𝑛𝑝𝑘 ) 𝜕𝑤𝑗𝑘 𝜕𝑤𝑗𝑘 Eq. 2.27 = -∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 ℎ′(𝑖𝑛𝑝𝑘 ) 𝑝𝑗 Eq. 2.28 = -𝑝𝑗 ∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 ℎ′(𝑖𝑛𝑝𝑘 ) Eq. 2.29 δk = ℎ′(𝑖𝑛𝑝𝑘 ) ∑𝑅𝑟=1 𝛿𝑟 𝑤𝑘𝑟 𝜕𝐸𝑛 𝜕 ∑𝑀 𝑐=1 𝑤𝑐𝑘 𝑝𝑐 = -𝑝𝑗 δk Eq. 2.30 Eq. 2.31 18 Given the change of the error function in respect to the change of the weights in the last unit layer, the corresponding weight update will be: 𝑤𝑗𝑘 = 𝑤𝑗𝑘 – η 𝜕𝐸𝑛 𝜕𝑤𝑗𝑘 = 𝑤𝑗𝑘 + η 𝑝𝑗 δk Eq. 2.32 Following the same deriving method as above, the updates for the rest hidden layer synaptic weights are given. Thus, a general form for updating hidden unit weights is: For example, for the next hidden layer weights, the update would be: 𝑤𝑖𝑗 = 𝑤𝑖𝑗 – η 𝜕𝐸𝑛 𝜕𝑤𝑖𝑗 = 𝑤𝑖𝑗 + η 𝑥𝑖 δj where δj = ℎ′(𝑖𝑛𝑝𝑗 ) ∑𝐾 𝑘=1 𝛿𝑘 𝑤𝑗𝑘 Eq. 2.33 Regarding the amount of the dataset needed to evaluate the gradient of an error function, learning methods can be split into two categories, the batch and the incremental. The batch set requires the whole training dataset to be processed before evaluating the gradient and update the weights. On the other side, incremental methods update the weights using the gradient of the error function after an input pattern is processed. Regarding the amount of the dataset needed to evaluate the gradient of an error function, learning methods can be split into two categories, the batch and the incremental. The batch set requires the whole training dataset to be processed before evaluating the gradient and update the weights. On the other side, incremental methods update the weights using the gradient of the error function after an input pattern is processed. Consequently, there can be applied batch backpropagation or incremental backpropagation as follows: Batch Backpropagtion Initialize the weights W randomly Repeat: o For every input pattern Xn: a. Present Xn to the network and apply the feed-forward propagation functions to all layers. 𝜕𝐸𝑛 b. Find the gradient of the error function to every synaptic weight ( c. If Xn was the last input pattern from the training dataset move forward otherwise go 𝜕𝑤𝑖𝑗 ). back to step a. o For every synaptic weight: a. Sum up all the gradients of the error function in respect to the synaptic weight: 𝑁 𝜕𝐸𝑛 𝜕𝐸𝑛 =∑ 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 𝑛=1 o Update the weight using 19 𝑤𝑖𝑗 = 𝑤𝑖𝑗 – η 𝜕𝐸𝑛 𝜕𝑤𝑖𝑗 Until the stopping criterion is true Incremental Backpropagation Initialize the weights W randomly Repeat: o For every input pattern Xn : a. Present an input pattern Xn to the network and apply the feed-forward propagation functions to all layers. b. Find the gradient of the error function to every synaptic weight ( 𝜕𝐸𝑛 𝜕𝑤𝑖𝑗 ). c. Update the weights using 𝑤𝑖𝑗 = 𝑤𝑖𝑗 – η 𝜕𝐸𝑛 𝜕𝑤𝑖𝑗 Until the stopping criterion is true As stated previously, a trained network should be able to generalize over new input patterns that were not presented during training period. The most common generalization assessment method defines splitting the total available dataset into training and testing dataset. Both of them have not common patterns but each of them has adequate number of patterns from all classes. Reasonably, the training dataset will be larger than the testing. Training dataset is used to train the network and update the weights. After all training input patterns are presented to the network, and its weights are adjusted using backpropagation (either batch mode or incremental), the network applies feed-forward propagation for the testing input instances. For each testing input pattern, the error between the network output and the testing target output is calculated. Finally, the average testing error can be interpreted as a simple measure of the network’s generalization. The smaller the testing error is the better generalization of the network. If the testing error meets the standards defined by the ANN user, the network training stops and the final weights are used to give any further answers to the modelled problem. On the contrary, if the testing error is larger than a predefined constant, then the training of the network must continue. A method called k-fold cross validation is an alternative generalization evaluation method and it appeals especially to cases with scarce dataset. In such cases, whereas the total dataset has few input instances, 20 sacrificing a number of them to create a testing dataset might lead to deficient learning since there is a high risk that the synaptic interconnections of the network will not get adequate information about the input space to adjust to. Applying k-fold cross validation requires splitting randomly the initial dataset into k even disjoint datasets. The training period is also divided into k phases. During each training phase, the network is trained using k - 1 datasets, leaving one out called validation dataset. After each training phase, the validation dataset is presented to the network applying feed-forward propagation and calculating the error between the network output and the validation target vector. The process is repeated k times, each time leaving a different sample of the total dataset out as validation set. At the end, the average of the k validation scores is an indicator for the network’s generalization ability. The goal of this method is to exploit the information given by the whole dataset to train the network and at the same time having a generalization estimator for the network learning progress. Initial total dataset: Initial dataset divided in 10 folds: Learning phase #1: 1st validation set gives E1 training dataset Learning phase #2 : . . 2nd validation set gives E2 . Learning phase #10: 10th validation set gives E10 21 Figure 2.3: 10-fold cross validation where orange parts represent the validation set at each learning phase and the purple ones the training set. En where n=1,..,10 is the corresponding validation error. By the completion of all 10 learning phases the total validation (or testing) error will be the average ∑10 𝑐=1 𝐸𝑐 10 or any other combination of the validation errors (e.g. just the sum up of them). 2.5 - Self Organizing Maps There are some cases of systems for which the existing available data is not classified/clustered/categorized in any way thus the correct output/response of the model to the input of a data pattern, is not provided. However many times, certain underlying relations can be extracted in a sense of clustering data. The identification of such relations comprises the task of unsupervised learning methods. As mentioned before in this chapter, neural networks can be trained using unsupervised learning. Self Organized Maps (SOM) is an unsupervised learning neural network paradigm. 2.5.1 - Biology SOM model is neurobiologically inspired by the way cerebral cortex of the brain handles signals from various sources of inputs (motor, somatosensory, visual, auditory, etc). The cerebral cortex is organised in regions. Each physical region is assigned to a different sensory input processing. Figure 2.4 Mapping of the visual field (right side) on the cortex (left side) 22 Visual cortex comprises the part of the cerebral cortex which processes visual information. More specifically, primary visual cortex, symbolized as V1, is mainly responsible for processing the spatial information in vision. The left part of Figure 2.4 shows the left V1. The right side represents the visual field of the right eye where the inner circle of the visual field represents all the contents which are captured by the fovea. Figure 2.4 shows the correlation between the subsets of visual field and the corresponding visual processing regions on V1. Observing Figure 2.4 makes it obvious that the neighbouring regions of the visual field are mapped to neighbouring regions in the cortex. Additionally an interesting notice is that the visual cortex surface which is responsible for processing the information coming from fovea (labelled as “1”) is disproportionally large to the inner circle surface of the visual field (fovea). This phenomenon leads to the conclusion that the signals arriving from the fovea visual field need more processing power than the other signals from the surrounding regions of the visual field. That is because signals coming from fovea have high resolution and hence they must be processed in more detail. Consequently it seems that the brain cortex is topologically organized in such a way that it resembles the topologically ordered representation of the visual field. 2.5.2 - Self Organizing Maps Introduction A Self Organizing Map, as introduced by Kohonen in 1982, is a neural network constituted by an output layer fully connected with synaptic weights to an input instance feature variables (Figure 2.5). The name of this model concisely explains the data processing mechanism. The neurons of the output layer are most of the times organized in a two dimensional lattice. During training phase, output neurons adapt their synaptic weights based on a presented input pattern features, without being supervised, by applying competitive learning. Eventually, when an adequate number of input patterns are shown to the model, the synaptic weights will be organized in such a way that each output neuron will respond to a cluster/class of the input space and the final topological ordering of the synaptic weights will resemble the input distribution. Analogising the Kohonen’s network to the biological model, the input layer can be thought as input signals and the output layer as the cerebral cortex. Hence, this kind of network must be trained in such a way that eventually each neuron, or group of neighbouring neurons, will respond to different input signals. Kohonen’s model learning can be divided into two processes: competition and cooperation. Competition takes places right after an input pattern is introduced to the network. A discriminant function is calculated by every artificial neuron and the resulting value provides the basis of the competition; the neuron with the highest discriminant function score is considered the winner. Consequently, the synaptic weights between the presented input vector (stimuli) and the winning neuron will be strengthened within the scope of minimizing the Euclidean distance between them. Hence, the goal is to 23 move the winning neuron physically to the corresponding stimuli input. The increase of the winning neuron’s synaptic weights can be also thought as inhibition. Figure 2.5: A SOM example. The dataset includes N input instances X = X1, ..., XN. Each synaptic weight Wj is a vector of D elements (corresponding to the D features of the Xn input instance vector) connecting each input feature with the jth neuron. The output layer is a 4 x 4 lattice of neurons Oj. Each neuron is indexed according to its position on the lattice (row,column). Cooperation phase follows, where the indexing position of the winning neuron on the lattice gives the topological information about the neighbourhood of its surrounding neurons to be excited. So, eventually the neuron which was declared to be the winner through competition, cooperates with some of its neighbouring neurons by acting as an excitation stimuli for them. Essentially that means that the winning neuron allows to its neighbours to increase their discriminant function in response to the specific input pattern, to a certain degree which is assuredly lower than its own synaptic weights strengthing. Ultimately, the trained network is able to transform a multidimensional input signal into a discrete map of one or two dimensions (output layer). The transformation is driven by the synaptic weight vector of each discrete output neuron which basically resembles the topological order of some input vectors (forming a cluster) in the input space. 24 2.5.3 - Algorithm The whole training process is based on adapting, primarily, the synaptic weight vector of the winning neuron, and thereafter the neighbouring ones, to the features of the introduced input vector. Towards that direction, a sufficient number of input patterns should be presented to the network, iteratively, yet one at a time, initiating this way the competition among the output neurons followed by the cooperation phase. For the examples given in the rest description of the algorithm refer to the network given in Figure 2.4. At the first place, a D- dimensional input vector Xn is presented at each time to the network. Xn = x1, ... , xD where n ∈ [1,...,N] and N,D ∈ ℕ1 The output layer is discrete since each neuron of the output layer is characterized by a discrete value j which can be thought as its unique identity or name. Output neurons are indexed meaning that their position is defined by the coordinates of the lattice. So for a 2-dimensional lattice, the index of each neuron is given by a two dimensional vector pj including the row and the column number of the neuron’s position on the lattice. For example, neuron o9 index is p9 = (0,2) on the map. This can be thought as the address of the neuron. Each output neuron, symbolized oj, will be assigned to a D dimensional synaptic weight vector Wj so that: Wj = w1j, ... , xDj where j ∈ [1,...,L] and L ∈ ℕ1 L is the number of the output neurons so that L = K x Q for K and Q ∈ ℕ1, defining the number of rows and the number of columns of the map respectively. Output neurons compete each other by comparing their synaptic weight vectors W j with the input pattern Xn and try to find the best matching between these two vectors. Hence, the dot product of W jXn for j= 1, ..., L is calculated for each neuron and the highest value is chosen. The long-term objective of the algorithm is to maximize the quantity WjXn for n = 1 ... N (the whole dataset) which could be interpreted as minimizing the distance given by the Euclidean norm ||Xn - Wj ||. Therefore, i(Xn) = arg 𝑚𝑖𝑛𝑗 ||Xn - Wj || = min{ ||Xn – W1 ||, ... , ||Xn – WL ||} Eq. 2.34 where i(Xn) is the index of the winning output neuron i ∈ [1,L], whose synaptic weight vector Wi is the closest to the input vector Xn, determined of all neurons j = 1, ... , L. Note that the input pattern is a D dimensional vector of continuous elements which is mapped into a two dimensional discrete space of L outputs using a competition process. So for the example of Figure 2.4, the winning neuron of the competition initiated by the input vector Xn, is the output neuron j = 15. So, the i(Xn) is (2,3). By the ending of the competition process, the winning neuron “excites” the neurons which lie in its neighbourhood. Two concepts must be clarified at this point. How a neuron’s neighbourhood is defined and how a neuron is excited by another neuron. A neighbourhood of neurons refers basically, to the topological arrangement of neurons surrounding the winning neuron. Obviously, the neurons positioned closed to the winning neuron will be more excited than others positioned in a bigger lateral distance. Thus, the excitation effect is graduated in respect with the lateral distance dij between the winning neuron i and any other neuron j 25 for j = 1, ... ,L. Having that constraint, the neighbourhood is given by a function (called neighbourhood function) hij which should satisfy two basic properties: (1) It has to be symmetric about dij = 0 (where of course dij = dii) (2) It should be monotonically decaying function in respect to the distance dij. A Gaussian function is exhibits both properties so the Gaussian neighbourhood function is given as: 2 𝑑𝑖𝑗 hij = 𝑒𝑥𝑝 (− 2𝜎2 ) Eq. 2.35 where: 𝑑𝑖𝑗 = ||𝑝𝑖 − 𝑝𝑗 || Eq. 2.36 Note that there besides Gaussian, there are other functions which can be chosen to implement the neighbouring function. However, the size of the neighbourhood is not fixed since it decreases with time (discrete time, iterations). When the training begins the neighbourhood range is quite large so the function h ij returns large values for a big set of neighbouring neurons. Through iterations, the less number of neurons will get large values by hij meaning that fewer neurons will be excited by the winning neuron. Eventually maybe the whole neighbourhood could be only the winning neuron. Since the neighbourhood function is given in Equation 2.35, the neighbourhood decrease can be embedded by decreasing σ over time using the function: 𝑐 𝜎(𝑐) = 𝜎0 exp (− 𝑇1) for c = 0,1, 2, 3, … Eq. 2.37 where c is the iterations counter, σ0 is the initial value of σ and T1 is a time constant. Therefore the Eq. 2.35 can be reformed as: hij(c) = 𝑒𝑥𝑝 (− 2 𝑑𝑖𝑗 2𝜎0 exp(− 𝑐 2 ) 𝑇1 ) Eq. 2.38 So when two similar in a way, but not totally identical, input vectors are presented to the network, there will be two different winning neurons. However, beside the winning neuron’s synaptic weights, the neighbours will corporately update their own weights to a smaller degree. Hence clusters of input vectors that might be close to each other, but still different, will be projected into topologically close neurons on the lattice. That why cooperation is a necessary process for the correct training of this model. A form of how this process is biologically motivated is shown in Figure 2.5. The final step is the weight adaptation which employs Hebbian learning. Hebbian learning defines that when simultaneous activation in presynaptic and postsynaptic areas then the synaptic connection between them is strengthened, otherwise it is weakened. In the case of Kohonen’s model, the presynaptic area is the input signal Xn and the postsynaptic area is the output activity of the neuron noted as yj. Output activity is maximized when the neuron wins the competition, is zero when it does not belong to the winner’s 26 neighbourhood and somewhere in between otherwise. So yj could be expressed as the neighbourhood function hij. Yet: ΔWj ∝ yjXn ∝ hijXn Eq. 2.39 However by using Hebbian mechanism of learning, there is a risk that the synaptic weights might reach saturation. This happens because the weight adaptation occurs only in one direction (towards the input pattern direction) causing all weights to saturate. To overcome this difficulty, a forgetting term – g(yj)Wj is introduced having: ΔWj = ηyjXn – g(yj)Wj Eq. 2.40 where η is the learning rate parameter. The term ηyjXn drives the weight update towards the input signal direction making the particular jth neuron more sensitive when this input pattern is presented to the network. On the other side, the forgetting term – g(yj)Wj binds the new synaptic weight to within reasonable levels.The function g(yj) must return a positive scalar value so for simplification purposes it could be assigned to yj, so that g(yj) = ηyj. Hence the weight update could be expressed as: ΔWj = ηhijXn – ηhijWj = ηhij(Xn – Wj) Eq. 2.41 In discrete time simulation, the updated weight vector is given: Wj(c+1) = Wj(c) + η(c)hij(c)(Xn(c) – Wj(c)) for c = 0,1, 2, 3, … Eq. 2.42 where 𝑐 𝜂(𝑐) = 𝜂0 exp (− 𝑇2) Eq. 2.43 where η0 is the initial value of η and T2 is a time constant. Essentially, the term (Xn(c) – Wj(c)) drives the alignment of Wj(c+1) to Xn vector at each iteration c. The winning weight vector will make a full step (hij(c) =1) towards the input signal while the neighbouring excited synaptic weights will move to a smaller degree defined by the neighbouring function (hij(c) < 1). Additionally, the update of all weights will be restricted by the learning rate η(c). The learning rate changes in respect to time. In the first iterations, it will be quite large allowing fast learning which means the weight vectors will move more aggressively towards the input pattern and hence form the rough boundaries for the potential input clusters. As iteration counter increases the learning rate will decrease forcing the weight vectors to move more carefully to the space in order to retrieve all the necessary information given from the 27 input patterns and gently converge to the correct topology. In the end, the topological arrangement of the synaptic weights in the weight space will resemble the input statistical distribution. The algorithm in steps: 1. Initialize randomly the weight vectors for all output neurons j = 1,…,L where L is the number of the output neurons lying on the lattice. 2. Present the input vector Xn = (x1,…,xD) to the network where n = 1, … ,N and N is the number of all input patterns included in the dataset. 3. Calculate WjXn for all output neurons and find the winner using Eq. 2.34. 4. Update all synaptic weights using Equation 2.42 5. Move to step 2 until there are no significant changes to the weight vectors or the maximum limitation for iterations number is reached. The learning mechanism could be divided in two phases: (A) The self organizing topological ordering phase (B) Convergence phase During (A) phase, the spatial locations of the output neurons are roughly defined (in the weight space). Each neuron corresponds to a specific domain of input patterns. The winning neuron’s neighbourhood is large during this phase allowing to neurons which respond to similar input signals to be brought together. As a result input clusters which have common or similar features will be projected to neurons which lie next to each other on the weight map. However the neighbourhood shrinks down in respect to time and eventually it could be the case that only the winner itself comprises its neighbourhood. The convergence phase begins when basically the neighbourhood is very small. Having a small neighbourhood allows the winning neurons to fine tune their positions which were obtained in topological ordering phase. During this phase the learning rate is also fairly small so the weights move very slowly towards the input vectors. 2.6 - Kernel Functions A challenge for computational intelligence area is to provide good estimators for non-linearly separable data. As indicated by their name, linear functions (e.g. perceptron algorithm) are incapable of processing successfully non-linear data and so they are eliminated from the pool of possible solutions to non-linear problems. However non-linear data become linearly separable if they are transformed to a higher dimensional space (called feature space) and thus a linear model can be used to solve the problem there. So the previous 28 statement should be rephrased as: Linear functions are incapable of processing successfully non-linear data in the input space (space where input training examples are located). Hence, given a set of data [X,T] : X = x1, x2,…, xN where x ∈ ℝD T = t1, t2,…, tN where t ∈ ℝ so that: 𝑥11 , … , 𝑥𝑑1 , … , 𝑥𝐷1 𝑥1 𝑡1 X =[ … ]=[ ] , T = [… ] … 𝑁 𝑁 𝑁 𝑁 𝑥 , … , 𝑥 , … , 𝑥 𝑥 𝑡𝑁 1 𝐷 𝑑 Eq. 2.44 Φ: x → Φ(x) where Φ(x) := 𝛷1 (𝑥), … , 𝛷𝑘 (𝑥), … , 𝛷𝐾 (𝑥) Eq. 2.45 and a dataset mapping via: the resulting feature data set is: Φ(𝑥 1 ) 𝛷1 (𝑥 1 ), … , 𝛷𝑘 (𝑥 1 ), … , 𝛷𝐾 (𝑥 1 ) 𝑡1 Φ =[ … ]=[ … ], T = [ … ] Φ(𝑥 𝑁 ) 𝛷1 (𝑥 𝑁 ), … , 𝛷𝑘 (𝑥 𝑁 ), … , 𝛷𝐾 (𝑥 𝑁 ) 𝑡𝑁 Eq. 2.46 As shown above, the raw dataset is represented by an N x D matrix while the mapped dataset is an N x K matrix where K ≥ D. The transformed dataset is now separable in ℝK feature space by a linear hyperplane. The key of the solution to the inseparability problem lies in the appropriate choice of feature space of sufficient dimensionality. This is a task that must be accomplished with great care since wrong feature mapping will give misleading results. In addition the whole data transformation process can be very computationally expensive since data mapping depends highly on the feature space dimensionality in terms of time and memory. That is why constructing features from a given dataset comprises rather a difficult task which many times demands expert knowledge in the domain of the problem. For example consider constructing quadratic features for x ∈ ℝ2: Φ(𝑥) ∶= (𝑥1 2,√2𝑥1 𝑥2 , 𝑥2 2) Eq. 2.47 In the case that data point x is a two dimensional vector, the new feature vector is three dimensional, so it doesn’t seem to be very demanding regarding to time or memory. However imagine creating quadratic 29 features for a hundred-dimensional dataset. This would lead each new feature data point to be of 5005 dimensions. Hence for every datapoint of the raw dataset, 5005 calculations should be made and saved in the transformed data matrix. However, most linear learning algorithms depend only on the dot product between two input training vectors. Therefore, applying such algorithms in the feature space, the dot product between the transformed data points (features) must be calculated and substitute every dot product between the raw input vectors in the algorithms formulation. 〈𝑥, 𝑥 ′ 〉 → 〈Φ(𝑥), Φ(𝑥 ′ )〉 Eq. 2.48 Back to the example presented in Eq. 2.47, the dot product between the quadratic features will result the quantity: 2 2 〈Φ(𝑥), Φ(𝑥 ′ )〉 = 𝑥1 2 𝑥1 ′ + 2𝑥1 𝑥2 𝑥1 ′ 𝑥2 ′ + 𝑥2 2 𝑥2 ′ = 〈𝑥, 𝑥 ′ 〉2 Eq. 2.49 As it is observed in Eq. 2.49, the dot product of two quadratic features in ℝ2 is equal to the square of the dot product between the corresponding raw data patterns. So an alternative solution to feature construction and computation is to implicitly compute the dot product between two features without defining in any way the exact feature mapping routine. A function can be defined instead, which will accept two input vectors and will result the dot product of their corresponding features in the feature space. Such functions are called kernels and were first introduced by (Aizerman et al, 1964). Generally, considering a function Φ(x), which maps the input data into a feature space, a kernel is a symmetric function which is defined as: 𝐾: 𝑋 × 𝑋 → ℝ Eq. 2.50 and it satisfies the following property: K(𝑥, 𝑥 ′ ) = 〈Φ(𝑥), Φ(𝑥 ′ )〉 Eq. 2.51 Furthermore, the kernel function can be represented by a kernel matrix (also referred as Gram matrix), for a given dataset of size N, which is defined in terms of a dot product in the feature space as: K(𝑖, 𝑗) = 〈Φ(𝑥𝑖 ), Φ(𝑥𝑗 )〉 where 𝑖, 𝑗 = 1 … N Eq. 2.52 In simple words, a kernel function applies direct transformation of the dot product relation to a higher dimensionality feature space. 30 Obviously not every function comprises a good kernel function. A good kernel function should satisfy some properties where among them these two are the most important: 1) The function must be symmetric so that K(𝑥, 𝑥 ′ ) = K(𝑥 ′ , 𝑥) since the dot product satisfies symmetry: 〈Φ(𝑥), Φ(𝑥 ′ )〉 = 〈Φ(𝑥 ′ ), Φ(𝑥)〉. 2) The corresponding kernel matrix must be positive semi-definite matrix for all choices of the dataset X = X1, X2,…, XN (Shawe-Taylor and Cristianini, 2004), which means that: So: aT K a ≥ 0 for all a ∈ ℝN and a kernel matrix K ∈ ℝN×N Eq. 2.53 𝑁 ∑𝑁 𝑖 ∑𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 )𝑎𝑖 𝑎𝑗 ≥ 0 where 𝑎𝑖 𝑎𝑗 are matrix eigenvalues Eq. 2.54 This can be proven as: 𝑁 𝑁 𝛮 ∑𝑁 𝑖 ∑𝑗 𝑎𝑖 𝑎𝑗 𝛷(𝑥𝑖 )𝛷(𝑥𝑗 ) = 〈∑𝜄=1 𝑎𝑖 𝛷(𝑥𝑖 ), ∑𝑗=1 𝑎𝑗 𝛷(𝑥𝑗 )〉 2 = || ∑𝑁 𝑗=1 𝑎𝑗 𝛷(𝑥𝑗 ) || ≥ 0 Eq. 2.55 The distance between two features lying in the feature space can be also expressed in terms of the corresponding kernel matrix as shown in Eq. 2.56. 2 𝑑(𝑥, 𝑥 ′ ) = ||𝛷(𝑥) − 𝛷(𝑥 ′ )||2 = (𝛷(𝑥) − 𝛷(𝑥 ′ ))2 = 〈𝛷(𝑥), 𝛷(𝑥)〉 − 2〈𝛷(𝑥), 𝛷(𝑥 ′ )〉 + 〈𝛷(𝑥 ′ ), 𝛷(𝑥 ′ )〉 = 𝐾(𝑥, 𝑥) − 2𝐾(𝑥, 𝑥 ′ ) + 𝐾(𝑥 ′ , 𝑥 ′ ) Eq. 2.56 Additionally, many linear models expressed in dual form, can be reformulated in terms of a kernel function and though eliminate the presence of feature mapping function in the primal formulation of the model. The perceptron algorithm is revisited and used to show how dual representations may be expressed using kernels. Perceptron algorithm optimizes the perceptron criterion function which is given in Eq. 2.4. Let initial weight vector 𝑤 0 to be equal to a zero vector. By the minimization of the perceptron criterion the optimized weight vector will be equal to: ∗ 𝑖 𝑖 𝑤 ∗ = ∑𝑀 𝑖=1 𝑎𝑖 𝑡 𝑥 Eq. 2.57 31 while if the input data are mapped to a feature space: ∗ 𝑖 𝑖 𝑤 ∗ = ∑𝑀 𝑖=1 𝑎𝑖 𝑡 𝛷(𝑥 ) Eq. 2.58 where in both Eq. 2.57 and Eq. 2.58: - 𝑎𝑖 > 0 only for the misclassified patterns and zero for the rest. Each 𝑎𝑖 represents the contribution of 𝑥 𝑖 to the weights update. Many times it is called the embedding strength of 𝑥 𝑖 and it is proportional to the number of times the particular pattern was misclassified (causing the weights to change) - M is the total number of the misclassified training patterns By replacing the optimum weight to the initial perceptron formulation the final decision boundary will be: 𝑀 ∗ 𝑦 = 𝑠𝑖𝑔𝑛(〈𝑤 , 𝑥〉) = 𝑠𝑖𝑔𝑛 (〈 ∑ 𝑎𝑖 ∗ 𝑡 𝑖 𝛷(𝑥 𝑖 ), 𝛷(𝑥)〉) 𝑖=1 ∗ 𝑖 𝑖 = 𝑠𝑖𝑔𝑛(∑𝑀 𝑖=1 𝑎𝑖 𝑡 〈 𝛷(𝑥 ), 𝛷(𝑥)〉) Eq. 2.59 There are two things which are noticeable in Eq. 2.59. Firstly, the dot product between the features appears allowing its substitution by the kernel matrix K(i,j). Secondly, the weight term w is absent from the decision boundary function and so the term which must be optimized is the vector 𝑎 = {𝑎1 , … , 𝑎𝑁 }. Since each element of this vector represents the contribution of the corresponding pattern to the weights update, they are initialized to zero and the perceptron algorithm changes in its update step as: ∗ 𝑖 𝑖 𝑗 if 𝑡 𝑖 (𝑠𝑖𝑔𝑛(∑𝑀 𝑖=1 𝑎𝑖 𝑡 〈 𝛷(𝑥 ), 𝛷(𝑥 )〉)) ≤ 0 𝑎𝑖 = 𝑎𝑖 + 1 (or 𝑎𝑖 = 𝑎𝑖 + 𝜂 if learning rate is used) Therefore provided a fixed input training set the decision boundary can be formulated using either w or a. However if the perceptron is to be used for non-linear data classification, then perceptron dual form representation allows the convergence to a solution explicitly in terms of kernels, avoiding any necessity of defining feature function Φ(.). In summary, a linear function can be applied to non-linear data just by substituting the corresponding indexed kernel matrix for any dot product of the input data vectors, while defining the exact feature mapping on input data is out of the interest. The only necessity is the potentiality of handling the dot product of features in some feature space (of any dimensionality) and a good kernel satisfies this need. 32 2.7 - Support Vector Machines Support Vector Machines (SVM) comprises a learning machine. Alike perceptron model, SVM learns over a labeled training dataset, localizes a hyperplane which separates the input space into two classes and then it is able to generalise over new input patterns. It is categorized as a linear machine learning model and therefore it fails in separating correctly non-linear data in the input space. However this problem is easily solved by using kernels and thus the SVM separates the data with a hyperplane in the feature space. As a result having chosen an appropriate feature space (through kernel function selection) of sufficient dimensionality, any consistent training set can be made separable. 2.7.1 - SVM for two-class classification Finding a hyperplane which will separate successfully 2 classes of data is only one part of SVM’s principal goal. What is more interested in is finding the one hyperplane which minimizes the generalization error. It is a fact that in some cases there are multiple possible hyperplanes in the input space which can separate the input data. Hence for each classification problem there is a set of solutions separating the input training data exactly into the right classes but only one will achieve smallest generalization error (testing error). This solution is offered by SVM learning machine. Figure 2.6: Binary classification problem: There are a lot of hyperplane solutions to one problem. Which one should be chosen? SVM approach is motivated by the statistical learning theory about generalization (error) bounds, stating that among all possible linear functions, the one which minimizes the generalization error the most is the same which maximizes the margin of the training set. The margin is denoted as the minimal distance between decision boundary and the closest data points of each class (Fig. 2.7). Therefore what SVM essentially does is to provide the one hyperplane with the maximum distance from the closest class points on both sides. 33 Figure 2.7: The decision boundary chosen by SVM. Consider dataset of Eq. 2.44. Support Vector Machine model separates input data vectors into two classes in the D dimensional input space. The location of the separating hyperaplane in the input space is defined by the position of the input datapoints (vectors) of both classes whose position maximize the margin. Thence these vectors are said to support the position of the classifier as they keep its position fixed in the input space. Obviously, that is why these vectors are called support vectors while all the rest input vectors are non-support vectors. In Fig. 2.7 the support vectors of the particular model are shown highlighted with a light glow around them. The separating hyperplane behaves as it is connected with the support vectors with an underlying relationship so that if the support vectors shift then the separating classifier shifts as well! On the contrary, if the non-support vectors shift, hyperplane position remains invariant. Consequently, Support Vector Machine is a learning machine which is used to achieve learning from training input data in terms of identifying the support vectors among training data vectors and based on their position, define the decision boundary. Recall that a hyperplane in an arbitrary dimensional space sis given by: 𝑤𝑥+ 𝑏=0 Eq. 2.60 where b: bias, w: weight vector. The SVM decision function for a binary classification problem is: 𝑦 = 𝑠𝑖𝑔𝑛(𝑤 𝑥 + 𝑏) Eq. 2.61 for labeled outputs y ∈ {+1,-1} and it is identical with the perceptron decision function. Apparently this decision function does not depend on the magnitude of the quantity 𝑤 𝑥 + 𝑏 since the sign defines the final decision. Therefore, any positive rescaling of the terms w → λw and b → λb won’t have any effect on the 34 model’s decision. Essentially the idea is to implicitly fix the scale of the input space in such a way that the distance between the hyperplane and the support vectors will be equal to one. 𝛾= 𝑚𝑎𝑟𝑔𝑖𝑛 2 =1 (shown in Fig. 2.8) Eq. 2.62 Hence, the correct decision for the positive labeled support vectors will be exactly +1, whereas for the negative labeled support vectors will be exactly -1. Suppose now there is a positive and negative labeled support vector, denoted as x1 and x2 respectively (Fig. 2.8): 𝑤 ∗ 𝑥1 + 𝑏 ∗ = +1 Eq. 2.63 𝑤 ∗ 𝑥2 + 𝑏 ∗ = −1 Eq. 2.64 where terms 𝑤 ∗and 𝑏 ∗ describe the canonical form of the separating hyperplane which means that they must be suitably normalized to give the desired result for support vectors of both classes. Provided 𝑤 ∗and 𝑏 ∗, the margin is equal to the length of the resulting vector from the projection of the vector (𝑥1 − 𝑥2 ) onto the normal vector to the hyperplane. Since the weight vector is always perpendicular to the decision boundary, it follows that the normal vector can be found by: 𝑤 . ||𝑤|| Hence if c is the resulting vector of the projection of (𝑥1 − 𝑥2 ) to the direction of the normal vector. Subtracting the canonical hyperplanes for support vectors 𝑥1 𝑎𝑛𝑑 𝑥2 : 𝑤 ∗ 𝑥1 + 𝑏 ∗ − 𝑤 ∗ 𝑥2 − 𝑏 ∗ = 1 + 1 = 2 ⇒ 𝑤 ∗ 𝑥1 − 𝑤 ∗ 𝑥2 = 𝑤 ∗ (𝑥1 − 𝑥2 ) = 2 Eq. 2.65 then the margin is given as: margin = ||c|| = w(𝑥1 − 𝑥2 ) ||𝑤|| 2 = ||𝑤|| Eq. 2.66 and the distance between any support vector to the decision surface is: γ= margin 2 1 = ||𝑤|| Eq. 2.67 1 SVM searches the maximum margin by maximizing the quantity γ = ||𝑤|| which is equivalent to minimizing ||w||2 formalizing a convex minimization problem: 35 1 𝛷(𝑤) = 2 ||w||2 Eq. 2.68 𝑔𝑖 (𝑤) = 𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) 𝑤ℎ𝑒𝑟𝑒 𝑔𝑖 (𝑤) ≥ 1 𝑓𝑜𝑟 𝑖 = 1, … , 𝛮 Eq. 2.69 subject to the constraint functions: Since this is a constraint optimization problem the primal Lagrangian function can be used to find the optimum 𝑤 ∗ and 𝑏 ∗. 𝐿(𝑤, 𝑏) = 1 ||𝑤||2 2 − ∑𝑁 𝑖=1 𝑎𝑖 [𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) − 1] Eq. 2.70 where a is the vector with the lagrangian multipliers. To find the optimum towards the terms 𝑤 ∗ and 𝑏 ∗ : 𝛻𝐿𝑤 = 𝑤 − ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 = 0 ⇒ 𝑤 = ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 Eq. 2.71 𝛻𝐿𝑏 = − ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 = 0 ⇒ ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 = 0 Eq. 2.72 Therefore, by substituting the Eq. 2.71 and 2.72 back in the primal objective function Eq. 2.70, the Wolfe dual problem formalizes as: 1 𝑁 𝑁 𝑊(𝑎) = ∑𝑁 𝑖=1 𝑎𝑖 − 2 ∑𝑖=1 ∑𝑗=1 𝑎𝑖 𝑎𝑗 𝑡𝑖 𝑡𝑗 𝑥𝑖 𝑥𝑗 Subject to constraints: ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 = 0 and ∑𝑁 𝑖=1 𝑎𝑖 ≥ 0 Eq. 2.73 Eq. 2.74 It is important that the dual form optimization function W(a) is quadratic and the constraints are linear functions defining a convex region. This fact emanates that any local minimum will be a global minimum. This is an example of quadratic programming problem. There exist many techniques offering solutions to quadratic programming problems. Since all the training patterns are used during the “training phase” (determination of a) efficient algorithms for solving quadratic programming problems should be chosen. 36 By applying quadratic programming to the problem, the optimal vector of langarian multipliers a, which maximizes the dual form optimization problem W(a), will be derived. Since w is given as a linear combination of all the lagrangian multipliers 𝑎𝑖 (Eg. 2.71) for i = 1, … , N, the decision function of SVM can now be reformed as: 𝑦 = 𝑠𝑖𝑔𝑛(∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥 + 𝑏) Eq. 2.75 and thus any new data pattern can be classified by this function. In addition, according to the Karush-Kuhn-Tucker (KKT) conditions for this kind of constrained optimization problems it applies that: 𝑎𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.76 𝑡𝑖 𝑦𝑖 − 1 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.77 𝑎𝑖 (𝑡𝑖 𝑦𝑖 − 1) = 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.78 Where 𝑦𝑖 is the result of the decision function for the training data pattern 𝑥𝑖 . By inference, it applies for any training data point in the input space, that either the corresponding 𝑎𝑖 = 0 or 𝑡𝑖 𝑦𝑖 = 1. However, if the latter case applies, then the particular input data pattern 𝑥𝑖 is a support vector (Eq. 2.63 and 2.64). This fact leads to the conclusion that the first case (𝑎𝑖 = 0) is valid for only for the non-support vectors. Thus, the non-support vectors play no role in decision making for new datapoints! So, after SVM is trained, only the support vectors are retained (for which 𝑎𝑖 > 0) while all the rest are excluded from the decision making function, reducing the computational effort of the model. Furthermore, having 𝑡𝑖 𝑦𝑖 = 1 for all the support vectors allows us to the determination of the b parameter. Let S and Ns be the set of the support vectors and their total number respectively. For a support vector x j it holds that: 𝑦𝑗 = ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 ⇒ Eq. 2.79 𝑡𝑗 𝑦𝑗 = 𝑡𝑗 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 ⇒ Eq. 2.80 𝑡𝑗 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 = 1 ⇒ Eq. 2.81 𝑡𝑗 𝑡𝑗 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 = 𝑡𝑗 ⇒ Eq. 2.82 37 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏 = 𝑡𝑗 Eq. 2.83 ∑𝑗∈𝑆 [∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + 𝑏] = ∑𝑗∈𝑆 𝑡𝑗 Eq. 2.84 ∑𝑗∈𝑆 ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 + ∑𝑗∈𝑆 𝑏 = ∑𝑗∈𝑆 𝑡𝑗 Eq. 2.85 Then for all support vectors it stands that: 1 𝑏 = 𝑁 ∑𝑗∈𝑆[𝑡𝑗 − ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝑥𝑖 𝑥𝑗 ] Eq. 2.86 𝑠 The SVM model, as presented until now, works only for linearly separable problems but can easily be extended to apply for non-linear problems by solving the problem into a feature space defined by a kernel function. Hence the Wolfe dual optimization function would be: 1 2 𝑁 𝑁 𝑊(𝑎) = ∑𝑁 𝑖=1 𝑎𝑖 − ∑𝑖=1 ∑𝑗=1 𝑎𝑖 𝑎𝑗 𝑡𝑖 𝑡𝑗 𝐾(𝑖, 𝑗) Eq. 2.87 The decision function would use the corresponding kernel function to classify any new datapoint as: 𝑦 = 𝑠𝑖𝑔𝑛(∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 𝐾(𝑥𝑖 , 𝑥) + 1 ∑ [𝑡 𝑁𝑠 𝑗∈𝑆 𝑗 − ∑𝑖∈𝑆 𝑎𝑖 𝑡𝑖 𝐾(𝑖, 𝑗)]) Eq. 2.88 Thus, for a non-linear input space, a proper kernel should be chosen and the kernel matrix should be created based on the training data. Then all the steps described above should be followed with one variation from the classic SVM approach, every dot product between two input data vectors must be replaced by the corresponding index of the kernel matrix. Besides, an element of the kernel matrix is the result of the dot product of the corresponding indexed features. As long as the training of the SVM is done by solving the constrained optimization problem with quadratic programming in Eq. 2.87, the lagrange multipliers ai i =1,….,M are optimized. At this point the kernel matrix can be deallocated and keep only the indexed elements which correspond to the support vectors. Finally, any new data point can be classified with the decision function presented in Eq. 2.88 and decided whether it belongs to class +1 or -1. So far SVM model seems to respond to both linear and non-linear problems successfully. The success recipe is clear and understandable; apply some of statistical learning theory approach with particular mathematical optimization methods. Combine all these with a pinch of a kernel trick and there it is one powerful tool for binary classification. However, there are some aspects to be considered furthermore. For example, how does this tool handle an outlier? Moving one step forward, can this model be adjusted to multi-class problems? 38 Figure 2.8: Outlier’s effect on the localisation of the decision boundary during training phase. Left plot shows the normal case while right one adds an outlier in the input space. Datasets are mostly built by humans. There are often a big number of parameters to be filled up for each case (data pattern), measures have to be made and lastly each case must be assigned with a label defining the class which it belongs to. It is also a fact that bad communication between cooperators, tiredness and so many other factors increase the chances of human error. As a result, a pattern may be labeled with the wrong class or a measure about a parameter of a pattern may be altered or mixed up with another one and so on. Since SVM (as all other learning machines), learn from data, such incorrect defined datasets drive the model to present big generalization error. An outlier is a datapoint which is isolated from the rest datapoints of its class in the input space. Although there is a small chance that this is a valid data point the truth is that most of the times it discloses that there is an outlier in the dataset. SVM as described above would be strongly influenced by an outlier and would try to adjust the separating hyperplane based on the outlier’s position. In Figure 2.8 there are presented two cases of the same dataset. In the first case, the dataset is clean from outliers. SVM is trained based on this data and the optimum separating hyperplane is found. In the second there exists only one outlier. After SVM is trained including the outlier in the dataset, the optimum decision boundary is totally differently oriented and localized. If a new input pattern needs to be classified (black dot in Figure 2.8), in the first case it will be classified as a red class object (and it will be correct classification) while in the second case, in which the hyperplane was greatly influenced by the outlier, it will be defined as a green class object and it will be wrong. If this is the case about one outlier, imagine how harmful a noisier dataset can be for a classification model. Essentially what is needed to be done is to eliminate the participation of the outliers during training process of SVM. Recall that a margin (for SVM) is the minimum distance between the patterns of two classes. The model is said to have a hard margin if there is no datapoint lying in the input space defined by margin (after training). However, the effect of the outliers may be decreased by allowing the model to have a soft margin. This 39 means that a number of data points will be able to remain in the margin space or even in the opposite class space (but still retain their label). These data points should be penalized based on the distance of their position from the boundary of their class. Let 𝜉𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 be the slack variables. Therefore, for datapoints which lie in the inside area or on the canonical hyperplane (for both classes), the corresponding slack variables will be 𝜉𝑖 = 0 whereas for the others 𝜉𝑖 = |𝑡𝑖 − 𝑦𝑖 |. There are two main approaches for introducing soft margin to SVM named L1 error norm and L2 error norm and both use these slack variables. Here only L1 error norm will be presented. L1 error norm Having the slack variables, the classification constraints will be replaced by: 𝑔𝑖 (𝑤) = 𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) 𝑤ℎ𝑒𝑟𝑒 Eq. 2.89 𝑔𝑖 (𝑤) ≥ 1 − 𝜉𝑖 Eq. 2.90 𝜉𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝛮 Eq. 2.91 The correctly positioned data patterns will be eventually assigned with a slack variable 𝜉𝑖 = 0. The rest are divided into two categories, the first is about data points lying within the space defined by the canonical hyperplane of their class and the decision boundary, which will get 0 < 𝜉𝑖 ≤ 1 and the rest which are on wrong side of the decision boundary and their slack variables will be 𝜉𝑖 > 1. The final goal is to maximize the margin while at the same time minimize the slack variables. Which means minimize the distance of the outliers from their class area and that is why slack variables could be also characterized as error measures. Both of these goals can be combined into one function which must be minimized: 1 2 𝛷(𝑤) = ||w||2 + 𝐶 ∑𝑁 𝑖=1 𝜉𝑖 Eq. 2.92 where the term 𝐶 > 0 is the trade-off between maximizing the margin and minimizing the misclassified patterns (errors). Hence, the primal Lagrangian optimization function becomes: 𝐿(𝑤, 𝑏, 𝜉) = 1 ||𝑤||2 2 𝑁 𝛮 + 𝐶 ∑𝛮 𝜄=1 𝜉𝑖 − ∑𝑖=1 𝑎𝑖 [𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) − 1 + 𝜉𝑖 ] − ∑𝜄=1 𝑟𝑖 𝜉𝑖 Eq. 2.93 where 𝑎𝑖 ≥ 0 and 𝑟𝑖 ≥ 0 are the lagrangian multipliers. To find the optimum towards the terms 𝑤 ∗ , 𝑏 ∗ and 𝜉𝜄 ∗ : 40 𝛻𝐿𝑤 = 𝑤 − ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 = 0 ⇒ 𝑤 = ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 𝑥𝑖 Eq. 2.94 𝛻𝐿𝑏 = − ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 = 0 ⇒ ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 = 0 Eq. 2.95 𝛻𝐿𝜉𝑖 = 𝐶 − 𝑎𝑖 − 𝑟𝑖 = 0 ⇒ 𝑎𝑖 = 𝐶 − 𝑟𝑖 Eq. 2.96 Since 𝑎𝑖 ≥ 0 and 𝑟𝑖 ≥ 0 it is obvious that 𝑎𝑖 will be maximized if 𝑟𝑖 = 0 where 𝑎𝑖 = 𝐶 and so: 0 ≤ 𝑎𝑖 ≤ 𝐶 Eq. 2.97 Therefore, by substituting the Eq. 2.94 and 2.95 back in the primal objective function Eq. 2.93, the Wolfe dual problem is: 1 𝑁 𝑁 𝑊(𝑎) = ∑𝑁 𝑖=1 𝑎𝑖 − 2 ∑𝑖=1 ∑𝑗=1 𝑎𝑖 𝑎𝑗 𝑡𝑖 𝑡𝑗 𝑥𝑖 𝑥𝑗 Subject to constraints: ∑𝑁 𝑖=1 𝑎𝑖 𝑡𝑖 = 0 and 0 ≤ 𝑎𝑖 ≤ 𝐶 Eq. 2.98 Eq. 2.99 Again, following the exact same steps as previously, the optimum α will be calculated through quadratic programming. The decision function will remain the same as Eq. 2.75 (for feature space problem). For linear problem simply use linear Kernel which defines: 𝐾(𝑥𝑖 , 𝑥𝑗 ) = 〈𝑥𝑖 , 𝑥𝑗 〉. As before the non-support vectors will have zero valued 𝑎𝑖 = 0. The support vectors will have 𝑎𝑖 > 0 and furthermore they must satisfy the constrained defined in Eq. 2.90: 𝑡𝑖 (𝑤 𝑥𝑖 + 𝑏) ≥ 1 − 𝜉𝑖 Eq. 2..100 The KKT conditions if the optimization problem in Eq. 2.100 are: 𝑎𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.101 41 𝑟𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.102 𝜉𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.103 𝑟𝑖 𝜉𝑖 = 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.101 𝑡𝑖 𝑦𝑖 − 1 + 𝜉𝑖 ≥ 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.104 𝑎𝑖 (𝑡𝑖 𝑦𝑖 − 1 + 𝜉𝑖 ) = 0 𝑓𝑜𝑟 𝑖 = 1, … , 𝑁 Eq. 2.105 For a support vector which is positioned on the canonical hyperplane of its class applies that 𝑎𝑖 > 𝐶. Then the corresponding 𝑟𝑖 > 0 (Eq.2.96) which implies that 𝜉𝑖 = 0 (Eq. 2.101). If a support vector lies inside the margin then 𝑎𝑖 = 𝐶 and this implies that either it is correctly classified , 𝜉𝑖 ≤ 1, or it is misclassified 𝜉𝑖 > 1. Figure 2.9: Same problem as in Fig. 2.8. SVM penalizes the outlier with a big valued ξi and so the particular datapoint doesn’t have influence on the classifier’s position. 2.7.2 - Multi-class problems Until this point only binary classification problems were considered for SVM. In practice thus many problems contain more than two classes of data points and therefore binary SVM can’t offer any solution. However there were several suggestions about how SVM algorithm can be extended to fit multi-class problems. In this study two of the most well-known multi-class classification heuristic approaches will be discussed. The first approach, named one-versus-the-rest, demands the training of K classifiers where K is the number of the classes. Each classifier is trained using binary SVM model, giving positive results for a particular class of data points Ck and negative results for all the rest K-1 classes of data. Finally the trained classifiers return their decisions for a novel data point x. The new point is assigned to the corresponding class of the classifier with the bigger valued decision y. However, although the training dataset is the same, each classifier learns over a totally different data setup in the input space. As a result there is no guarantee that the 42 results coming from each classifier’s decision function are properly scaled so that to be comparable and make a choice over the largest valued. Not only that, but there is a risk that an object might be assigned to more than one classes at the same time. Added to this, even if the initial dataset is balanced having equal number of objects of each class, using one-versus-the-rest method it turns greatly imbalanced. Each classifier would be then trained over 1 class of positive examples and K-1 negative (where K is the number of classes). Figure 2.10: This is a three class problem. Following one-versus-the-rest approach, three classifiers must be trained. The second approach is the one-versus-one and it comprises one of the simplest schemes of solving multiclass problems. This method creates all possible pairs of the dataset classes. Then a classifier is trained over each pair of classes. Thus for K classes there will be K(K-1)/2 classifiers. For classifying a new data pattern, the decision process is better described by a directed acyclic graph (DAG) as shown in Figure 2.11 (for a three-class problem). At each node of the graph, the new input patter is classified over two classes and according to the class decision it moves to the next level of the graph. The leaves of the graph (last level) are the final class decisions of the model. For each new data point the final path through the graph will have length equal to K-1. Figure 2.11: One-versus-one scheme for solving multiclass problems. In this example there are three classes of data and so two pair wise comparisons will take place for each novel input pattern to be classified. 43 It is obvious that the tree’s size increases accordingly to the number of classes and as a result this approach becomes efficiently expensive in terms of time and space for problems with a big number of classes. Overall, it seems that SVM is a machine learning tool which is very well set up for binary classification. For a separable dataset, SVM guarantees to identify the best classifier based on statistical learning theory about margins. The introduction of the kernel trick enhances the power of the model extending its application to non-linearly separable datasets without having any computational cost by the dimensionality of the input space. Although SVM is so well set up for binary class problems, it has no such impact on multiclass problems since the proposed heuristic approaches may exhibit inconsistent results or may be extremely computationally demanding. It is also important to mention that SVM are also used for regression problems. SVM regression is an extension of linear regression applying the same steps as for classification to solve the dual optimization problem towards lagrangian multipliers α. 44 Chapter 3 - Fuzzy Logic “Fuzzy logic is a tool for embedding structured human knowledge into workable algorithms” (Kecman, 2001). The way that humans think is rather approximate than exact. As result expressions like “a little”, “not so much”, “too many”, “so and so”, etc, are used in all languages to express the fuzziness which characterizes human reasoning and thinking. Such expressions don’t give precise information or quantity about what they refer to but they still comprise an adequate description of it, giving the necessary information to “get the feeling”. Real life concepts are not binary; rather their description is graduated from absolute false to absolute true. Therefore, there are not just beautiful or ugly people, good or bad; the weather is not just absolutely sunny or absolutely raining, the food is not tasteful or tasteless, etc. In most of the times there is an intermediate state for every concept in human understanding which classic set theory is unable to handle. This observation led to the emergence of Fuzzy Logic theory the scientist Lofti A. Zadeh, back in 1965. Fuzzy logic (FL) is about encoding human reasoning providing a mathematical model representation of the linguistic knowledge. FL achieves that by introducing fuzzy set of classes which are not independent or disjoint and their boundaries are fuzzy and indistinct, diverging from classical set theory. In contrast to crisp sets, there are no strict, precise or exact criteria defining whether an object belongs or not to a fuzzy set. Consequently, an object belongs to a class (represented by a fuzzy set) to some grade. In notation, a classical set can be defined by: 1) A list of elements Scolours = {red, orange, green, blue, yellow, purple} 2) Specifying some property SA = {x є Scolours | x == red} SB = {x є Scolours | x == blue} SC = {x є Scolours | x == yellow} 3) A membership function μSA(x) = { 1 , 𝑖𝑓 𝑥 ∈ SA 0 , 𝑖𝑓 𝑥 ∉ SA 45 The membership function of a classical set satisfies the principle of bivalence since it returns only true (1) or false (0) excluding any other truth value. Contrarily, fuzzy membership functions may return real values other than zero and one, escaping the divalent constraint. Hence, if an object x ∈ 𝑋 then essentially the membership function d = μS(x) is a mapping between X and the integral [0,1]. Membership functions of fuzzy sets assign an object to a fuzzy set, with a truth value between [0,1] describing the degree of belonging. Zero degree totally excludes the object from the set and full degree (one) means that the object is outright a member of the specific set. Considering the example given in Figure 4.1, there are three sets, each describing objects of a particular colour. More specifically, the classes of red, blue and yellow comprise the universe of discourse. The task is to assign an orange, purple and green object to a colour class. Using the classical set theory (the representation on the left) the three new objects are rejected, as their membership function returns zero for all three fixed colour sets. However, in reality, orange is given by mixing red and yellow, green by blue and yellow and purple is red and blue. So, orange contains a little red and a little yellow and so on. If colour classes are modelled with fuzzy sets then orange, green and purple belong to a combination of the colour sets, to a specific degree, given by the membership function of each colour set. Note than none of the three membership functions will ever return 1 for any of the three new coloured objects, since they are not totally red, neither totally blue nor yellow. Figure 3.1: A, B and C represent three colour sets, red, blue and yellow class respectively. Using crisp set logic, orange, purple and green objects cannot become members to any of these sets whereas using fuzzy logic all three different colours can become members to one or more sets to a degree. As in crisp sets, the universe of discourse is a set X. This set may contain either discrete or continuous valued elements which participate to the modelled problem/system. In Fuzzy Logic a linguistic variable, which is 46 just a label, is also used to describe the general domain of the members included in the set X (e.g. age, weight, distance, etc). The linguistic variable is then quantified into a collection of linguistic values. A linguistic value is a label describing a state/situation/ characteristic of the concept defined by the linguistic variable (e.g. young, heavy, long, etc). Each linguistic value is defined by a fuzzy set. A fuzzy set is a collection of ordered pairs whose first element is an object of the universe of discourse and the second element is the object’s degree of belonging to the particular fuzzy set given by the predefined membership function. Therefore a basic notation of a fuzzy set is: S = {(x,μs(x)) | μs(x) definition} Eq. 3.1 where x is an object of universe of discourse and μs(x) is the membership function which characterizes the set S. An example of the aforementioned concepts is presented on the right side of Figure 3.2 (weight example). As stated already, fuzzy logic is used to express in a formal way, human reasoning. In real life human reasoning is used to reflect the content of a specific problem/discussion/model etc. However many times vague terms are used to describe an aspect of any concept concerning either the physical or spiritual world. That is why each statement should be interpreted in the frame of the problem/system which is discussed. For example consider a man P whose height is 1.80 meters. The statement “P is short” sounds incorrect since an ordinary man who has 1.80 meters height is generally considered tall. However if the discussion is about professional basketball players then the exact same statement sounds reasonable and valid. Therefore each statement of human reasoning strongly depends on the context of a discussion. Consequently, linguistic values are innately context dependent. Similarly, they strongly depend on the system and on whom or what the system refers to. They depend on the system’s character. Even using the same universe of discourse and the same linguistic variable, the linguistic values might take a totally different scaling if the main subject of the system is different. Furthermore, Fuzzy Logic allows to an element to belong to two or even more fuzzy sets (to a different degree) whereas in classical set theory an element can be a member only of one set. Hence, an object might be a little more member of fuzzy set A and at the same time little less member of fuzzy set B. Going back to example given in Figure 3.1, if the colour of a new object is deep purple then it is a little more blue and a little less red, but still is considered a member of both colour sets! Assuredly, the membership function is the identity characteristic of each fuzzy set. Each linguistic value of a linguistic variable is assigned to a membership function. Human reasoning, as stated above, is subjective and though different membership functions may be produced by different persons for the same linguistic value of a common linguistic variable. 47 Figure 3.2: The left part is describes the relationship between some concepts of Fuzzy Logic. The right part of the figure builds a simple example based on the theory of the left side. Hence the criteria of choosing a function to be a membership function are problem depended. The shape of each function changes along with the criteria set for a problem and therefore there is a big range of functions which can be used to define a linguistic value. However, it is difficult to build from scratch your own membership function since there is a big risk of modelling arbitrary functions with arbitrary shapes. That is why most of the times some specific types of functions, which have preponderated in bibliography of fuzzy modelling, are commonly chosen and adjusted to new datasets. Some classical examples of such membership functions are the trapezoids, triangular, Gaussian bell and singleton functions. As shown in Figure 3.3, trapezoid, triangular and Gaussian bell functions can be used to model membership functions of continuous valued elements. Contrarily, singleton function is more about modelling discrete data where each case must be individually assigned to a certain membership grade (hence the parameter a of the singleton function represents an element of the data). 48 Figure 3.3: Trapezoid, triangular and Gaussian bell and singleton membership functions. Building a trapezoid function requires the definition of four parameters [a,b,c,d]. The elements of the universe of discourse which fall between the interval (a, b) and (c, d) are partially members of the defined fuzzy set while the elements included in the range [b, c] are fully members of the set. Triangular function can be regarded as a special case of trapezoid function taking parameters [a,b,b,c]. The substantial difference between the two of them is that the trapezoid allows a range of elements to be assigned to full membership grade while triangular allows only one. Gaussian bell function differs from other two continuous functions since the parameters needed to define the function is a pair of [σ, m]. The first parameter is the standard deviation of the function describing the width of the function. The function is symmetrical about the second parameter m, which is the mean of the function, and though the element which will equal to mean will be the only one to be assigned to a membership grade of one. Another difference between the Gaussian and the rest functions is that, the derivative of the former exists at each point of its domain while this is not applied for the other two (for elements at a or b or c or d). Hence for applications which demand the usage of mathematical optimization algorithms (e.g. gradient descent) over fuzzy variables, Gaussian is more preferable membership function than others. Nevertheless, there are some common concepts which are used to describe all classical membership functions presented in Figure 3.4. The function’s support is a set which contains all those elements which are assigned to non-zero membership degrees while the members of the core set have degree of membership equal to one. The variable a is a specific defined membership function degree. All the elements which have a membership degree bigger or equal to a are included to the a-cut set. Therefore when a is equal to zero the resulting a-cut set is the initial domain of the specific membership function. Finally the variable height is about the maximum membership degree for the specific function’s domain. In Figure 3.4 these basic concepts are presented for a trapezoid function but they can easily be adjusted to any other type of classical fuzzy membership functions. 49 Figure 3.4: The concepts of support, core, a-cut and height describing a fuzzy membership function. Additionally, a fuzzy set A is considered convex if and only if the inequality in Equation 3.2 is satisfied for every pair 𝑥𝑖 and 𝑥𝑗 where 𝑥𝑖 , 𝑥𝑗 ∈ A and every λ∈ [0,1]. 𝜇𝐴 (𝜆𝑥𝑖 + (1 − 𝜆)𝑥𝑗 ) ≥ min{𝜇𝐴 (𝑥𝑖 ), 𝜇𝐴 (𝑥𝑗 )} Eq. 3.2 Furthermore the cardinality of a fuzzy set A is defined as: |𝐴| = ∫ 𝜇𝐴 (𝑥) 𝑑𝑥 Eq. 3.3 If the support set of the fuzzy set is discrete, including N elements then the cardinality can be calculated as: |𝐴| = ∑𝑁 𝑖=1 𝜇𝐴 (𝑥𝑖 ) Eq. 3.4 3.1 - Basic fuzzy set operators The basic classical set operations of conjunction, disjunction and negation are defined for fuzzy sets as well. Actually, there are multiple operators which model the conjunction and disjunction of fuzzy sets. The simplest though, is the min-max operators as well as the algebraic product and sum which are presented in Table 3.1 as well as in Figure 3.5 and Figure 3.6 respectively. Any fuzzy set operation will result a new fuzzy set. So, since a fuzzy set is defined by its membership function, the operations are applied on membership functions and the result is a new membership function on the universe of discourse. Min/Max Operators Algebraic Product/Sum Operators Conjunction 𝜇𝐴∧𝐵 = min{𝜇𝐴 , 𝜇𝐵 } 𝜇𝐴∧𝐵 = 𝜇𝐴 ∙ 𝜇𝐵 Disjunction 𝜇𝐴∨𝐵 = max{𝜇𝐴 , 𝜇𝐵 } 𝜇𝐴∨𝐵 = 𝜇𝐴 + 𝜇𝐵 − 𝜇𝐴 ∙ 𝜇𝐵 𝜇𝐴𝑐 = 1 − 𝜇𝐴 𝜇𝐴𝑐 = 1 − 𝜇𝐴 Negation Table 3.1: Two different operators for conjunction, disjunction on two fuzzy sets A and B. The negation operation has the same form for both operators. 50 There is a plethora of other operators which can be used to model conjunction and disjunction on fuzzy sets. In Fuzzy Logic conjunction operators are labelled as T-norms and disjunction operators as S-norms. Each one of them gives a new membership function with different shape but still reflecting the concepts of intersection and union of two fuzzy sets. For example algebraic product gives a smoother conjunction membership function (Figure 3.6) that min operator (Figure 3.5 (b)). The choice an operation norm depends on the problem and generally on the model that the user wishes to produce. Finally, there are two substantial differences between the crisp set logical operators and fuzzy set. Starting with crisp logic, it is well known that the intersection between a set and its complement results an empty set. Contrariwise, the union of a set and its complement gives the universe of discourse. The corresponding operators on fuzzy sets will result a fuzzy set other than empty or universe of discourse set respectively. Hence for a fuzzy set A: 𝜇𝐴∧𝐴𝑐 ≠ 0 Eq.3.5 𝜇𝐴∨𝐴𝑐 ≠ 1 Eq.3.6 These two interesting side effects are presented in Figure 3.7. Figure 3.5: (a) The membership functions of two fuzzy sets named A and B (b) The membership function (m. f.) of the conjunction of A and B using min norm (c) The m. f. of the disjunction of A and B using max norm (d) the m. f. describing the negation of A (complement of A) 51 Figure 3.6: The membership functions of the conjunction and the disjunction of A and B using algebraic product and algebraic sum respectively. Figure 3.7: (a) The m. f. of a fuzzy set A and its compliment (b) Their conjunction m. f. (c) Their disjunction m. f. 3.2 - Fuzzy Relations Consider two crisp sets X and Y. Applying a relation over these sets will provide a new set R containing ordered pairs (x, y) where x ∈ X and y ∈ Y . Since the Cartesian product of two crisp sets denoted as X × Y comprises a crisp set including all ordered pairs between X and Y such that: X × Y = {(𝑥, 𝑦)|𝑥 ∈ X and 𝑦 ∈ Y} Eq. 3.7 it is derived that the relation R comprises a subset of the Cartesian product X×Y. This subset might be chosen totally randomly or it might be chosen under a specified condition over the two variables (of the ordered pairs). In fact any subset of the Cartesian product is considered a relation. Any relation defined over a Cartesian product (or product set) which is build on discrete sets, can be expressed as a relational matrix. Each element of the matrix represents the membership degree of the particular ordered pair (row, column) to the relation and that is why it is also called a membership array. The 52 condition defining a crisp set is very precise. The boundaries between the ordered pairs which are members of the relation and those which are not are very sharp. Therefore the relational matrix of such a relation will contain only ones and zeros. The ordered pairs which satisfy the condition will be assigned to one and the rest to zero. However there are some cases that the condition of a relation is not so clear and precise. For example “find the objects which have approximately equal size”. The condition of the relation “approximately equal size” doesn’t define in accuracy how the word “approximately” should be interpreted. As a result the size of some objects will be more equal than others. Some ordered pairs will belong to the relation to a bigger or smaller degree than others. Hence this is a fuzzy relation. Alike fuzzy sets, the membership function of a fuzzy relation will vary in the interval [0, 1] as well as values of the fuzzy relational matrix since it represents the membership degrees per each ordered pair. In Figure 3.8 two examples of relations, a crisp relation (x is equal to y) and a fuzzy relation (x is approximately equal to y) are presented. Figure 3.8: A crisp relation (on the left) and a fuzzy relation (on the right) over two sets X and Y. When the universes of discourse X and Y are discrete, then the membership function characterizing the set of a relation, 𝜇𝑅 (𝑥, 𝑦), defines discrete points in a three dimensional space (X, Y, 𝜇𝑅 (𝑥, 𝑦)). On the other side if the sets which participate to the relation are continuous then the relational membership function defines a surface over the Cartesian product X × Y. The relational matrix is difficult to be defined over continuous universes and therefore discretization of their domain may be applied by choosing an appropriate discretization step. The universe of discourse could be also described by a linguistic variable. As stated before, a linguistic variable is assigned to a set of specific linguistic values. Then the relation could be defined over different linguistic variables and more particularly over the ordered pairs of their linguistic values. Hence the relation 53 describes the way that these variables might be associated. In Figure 3.9 the universe of discourse is characterized by the linguistic variable “profession” which takes as linguistic values human professions. The set X is defined over the linguistic values: basketball player and jockey. Similarly the second set Y is defined over the linguistic values, short medium and tall of the second linguistic variable (height). The relation describes the association between human’s profession and human’s height. Figure3.9: Crisp and Fuzzy relations over linguistic variables This kind of relations comprises a very helpful tool for modelling IF-THEN rules. For example consider the resulting crisp relational matrix as presented in Figure 3.9 (left part). In this case, where the universes of discourse are represented by linguistic variables, the relational matrix may be expressed as a model of IFTHEN rules: R1: If (a human’s profession is) basketball player, THEN (he is) tall R2: If (a human’s profession is) jockey, THEN (he is) short Furthermore, the crisp relational matrix can be generalized to a fuzzy matrix (right part of Figure 3.9) giving some kind of freedom to the association between human’s profession and his height. For example, sometimes it is desirable for a basketball team to have one not so tall player (medium). Thus, while it is certain that most basketball players are tall, there is a smaller possibility that one of them will be medium. Until this point, fuzzy relations over discrete, continuous and linguistic variables have been considered. The last part is about fuzzy relations over fuzzy sets. A fuzzy relation over two fuzzy sets analyzes the meaning of linguistic expressions towards a mathematical direction. For example the expression “Today we have a 54 humid cold day”. Assuredly, the phrase “humid cold day” actually implies that “the day is humid AND cold”. The latter form of the phrase is given by the intersection between the fuzzy sets “humid” and “cold”. Therefore, the meaning of this expression will be given by applying a T-norm operator (e.g. minimum) to the fuzzy sets which describe two different linguistic variables, the relative humidity and the temperature. The linguistic values for both linguistic variables are low, medium and high. Therefore, humid day is assigned to high relative humidity and cold day means low temperature. Their corresponding membership functions are graphically presented in Figure 3.10. Note that the membership functions were defined based on the Cyprus temperatures and humidity percentages. Figure 3.10: The membership functions of the fuzzy sets “cold temperature” and “humid weather” Having defined the membership functions of the linguistic variables participating to the expression the membership function of the relation which applies over them must be also computed. Since the expression refers to a conjunction over the linguistic values “high humidity” and “low temperature” then any T-norm might be chosen. Applying the minimum norm the relational membership function would be: 𝜇𝑅 (𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒_ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒) = min{𝜇ℎ𝑖𝑔ℎ (𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒_ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦), 𝜇𝑙𝑜𝑤 (𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒)} Eq.3.8 The two fuzzy sets of humidity and temperature, are defined by their membership function. The domain of both membership functions is continuous. Therefore the relation matrix is difficult to be defined unless discretization is applied. In Figure 3.10 both membership functions are descritized by a certain step so that: high_humidity = {60, 70, 80, 90, 100} low_temperature = {0, 5, 10, 15, 20} and their corresponding membership degrees: 𝜇ℎ𝑖𝑔ℎ 1 0 0.5 1 = 1 , 𝜇𝑙𝑜𝑤 = 1 1 0.5 [1] [0] 55 The form of the resulting relational matrix is given by: 𝑅 = 𝜇ℎ𝑖𝑔ℎ × 𝜇𝑙𝑜𝑤 𝑇 0 0.5 = 1 × [1 1 1 0.5 0] 1 [1] Eq. 3.9 By applying the min operator the final relational matrix is: R 0 5 10 15 20 60 0 0 0 0 0 70 0.5 0.5 0.5 0.5 0 80 1 1 1 0.5 0 90 1 1 1 0.5 0 100 1 1 1 0.5 0 A fuzzy relation over fuzzy sets is a new set containing ordered pairs of the elements included in the fuzzy sets and their degree of belonging to the specific relation. Therefore, the elements of the relational matrix define a surface over the Cartesian product of the fuzzy sets (for the above example humidity × temperature) representing the resulting relational membership function. As shown in preceding examples, a fuzzy relation can be applied on sets which are defined on different universes. Likewise, two fuzzy relations which are subsets of different product spaces can be combined through composition. Since fuzzy relations comprise fuzzy sets, composition may be regarded as a new fuzzy relation associating these fuzzy sets (relations). Therefore it is derived that the composition of fuzzy relations introduces a new fuzzy set. Let: 𝑅1 ⊆ 𝑋 × 𝑌 𝑎𝑛𝑑 𝑅2 ⊆ 𝑌 × 𝑍 Eq. 3.10 Then the fuzzy set produced by the composition of 𝑅1 and 𝑅2 is: 𝑇 ⊆𝑋×𝑍 Eq. 3.11 There is a variety of compositional operators which can be applied on fuzzy relations. MAX-MIN and MAXPRODUCT are two common composition operators where MAX-MIN operator is defined as: 𝑇 = 𝑅1 ∘ 𝑅2 𝑇(𝑥, 𝑧) = {[(x, z), max{min{𝜇𝑅1 (𝑥, 𝑦), 𝜇𝑅2 (𝑦, 𝑧)}}] | 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌, 𝑧 ∈ 𝑍} 𝑦 Eq. 3.12 56 Accordingly the membership function of the fuzzy set T: 𝜇𝑅1 ∘ 𝑅2 (𝑥, 𝑧) = { max{min{𝜇𝑅1 (𝑥, 𝑦), 𝜇𝑅2 (𝑦, 𝑧)}} | 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌, 𝑧 ∈ 𝑍} 𝑦 Eq. 3.13 In the following composition example, the fuzzy relation presented in Figure 3.9 will be used. The relation R1 describes the interconnection between the subset of professions {basketball player, jockey} and the fuzzy set describing height scaling. Another fuzzy relation R2 is defined between the fuzzy sets of height and weight. The linguistic variables for height are {short, medium, tall} while for weight {low, medium, high}. Figure 3.10: The composition of two relations R1 and R2 leads to the creation of a fuzzy set T. In this figure their relational matrices are presented. The composition operator MAX-MIN is applied to these relations and as a result new fuzzy set T is produced. For example the membership degree of the ordered pair of (basketball player, low_weight), to the new fuzzy set T will be given by: 𝜇𝛵 (𝑏𝑎𝑠𝑘𝑒𝑡 𝑝𝑙𝑎𝑦𝑒𝑟, 𝑙𝑜𝑤) = max{𝑚𝑖𝑛{0,1} , 𝑚𝑖𝑛{0.1,0.5} , 𝑚𝑖𝑛{1,0}} = max{0,0.1,0} = 0.1 Eq. 3.14 The final form of the relational matrix describing the fuzzy set T is presented in Figure 3.10. The relational matrix of the compositional fuzzy set T can be linguistically interpreted by using IF-THEN rules as: R1: If (a human’s profession is) basketball player, THEN (he has) high weight, possibly to have medium weight and unlikely to have low weight. R2: If (a human’s profession is) jockey, THEN (he has) low weight, less likely to have medium weight and unlikely to have high weight. Composition allows the transmission from a universe to a different universe through another universe of discourse which is associated with the formal two with two relations. Thus, it is clear that the composition operation on fuzzy relations allows the formulation of fuzzy implications. 57 3.3 - Imprecise Reasoning In classical propositional calculus, inference is achieved by using two basic inference rules, modus ponens and modus tollens. Considering a specific fact A then modus ponens states that “A implies B” which can be expressed in an argument form as: Premise 1: 𝐴 Premise 2: 𝐴→𝐵 or 𝐴 ⋀ (𝐴 → 𝐵) = 𝐵 Eq. 3.15 Conclusion: 𝐵 The implication rule 𝐴 → 𝐵 can be interpreted as an IF-THEN rule so that: “If x is A then y is B”. Therefore B is derived if and only if the variable x has the exact value of A. If in any case x is approximately A or close to A then the implication is not applied and B cannot be inferred. On the other hand, fuzzy logic is flexible allowing to a variable x, whose value is close to A, to be equal to A to some degree. In other words, except from actual A, other objects can be regarded as A (of some membership grade) but classical modus ponens will exclude them from the inferential process treating them as totally non-A objects. Instead, the generalized modus ponens should be used, which allows the inference to take place even when the first premise is considered partially true. In that case the conclusion will be partially true as well. Therefore if x is A to a degree, then y will be B to a degree. The premises leading to the conclusion are not sharp bounded leading to a conclusion which is not precisely defined and that is why the reasoning is considered imprecise. Thence it arises that the implication operator must be generalised over fuzzy variables. In classic logic the implication rule “IF A THEN B”, denoted by 𝐴 → 𝐵, is set-theoretic equivalent to 𝐴̅⋁𝐵 and it is given by the truth table shown in Table 3.2 A B 𝐴→𝐵 T T T T F F F T T F F T Table 3.2 58 The only case where the implication is false is when a false consequent follows a true antecedecent. This is obvious since true premises cannot imply a false conclusion. However classical implication needs some further explanation for the cases where the antecedecent proposition is false. The paradox in these cases is that the implication is true independently of the conclusion B. Therefore the implication is valid for all the cases where the hypothesis (antecedecent) is false. However humans realize linguistic implication differently. They mostly use it to indicate the causality between the condition and the conclusion. The antecedecent proposition is equivalent to the cause of the concequent which is thought to be the effect. In other words the antecedecent is related to the conclusion whereas in classical logic relatedness between these two is not required for true implication. Fuzzy logic was inspired in the first place by the way humans think and reason. Thus fuzzy implication is defined to serve this cause-effect relation between the antecedecent and the concequent propositions. Given that in crisp logic the implication is equivalent to 𝐴̅⋁𝐵 and that fuzzy negation and disjunction are given by 𝐴̅ = 1 − 𝜇𝛢 (𝑥) and 𝐴⋁𝐵 = max{𝜇𝛢 (𝑥), 𝜇𝐵 (𝑦)} respectively, it is derived that the fuzzy implication would be 𝐴 → 𝐵 = max{1 − 𝜇𝛢 (𝑥), 𝜇𝐵 (𝑦)}. However this definition of fuzzy implication does not reflect the human reasoning since if the “cause” (or the antecedecent) is not fulfilled having 𝜇𝛢 (𝑥) = 0 , the implication is still be true allowing B to be inferred by A even if the premise A is not true. Within the scope of describing adequately human reasoning, the effect should only follow the true presence of the cause. No cause implies no effect. Therefore if 𝜇𝛢 (𝑥) = 0 then 𝜇𝐵 (𝑦) = 0. Fuzzy logic transfers human implication to fuzzy implication by restricting the truth value of the concequent, not to be larger than the truth value of the premise. This can be achieved by using fuzzy implication operators like minimum (Mamdani, 1977) and product (Larsen, 1980) which are defined in equations 3.16 and 3.17 respectively. 𝜇𝐴→𝐵 (𝑥, 𝑦) = min{𝜇𝐴 (𝑥), 𝜇𝐵 (𝑦)} Eq. 3.16 𝜇𝐴→𝐵 (𝑥, 𝑦) = 𝜇𝐴 (𝑥) ∙ 𝜇𝐵 (𝑦) Eq. 3.17 3.4 - Generalized Modus Ponens Generalized Modus Ponens (GMP) allows the transformation from a fuzzy set to another through fuzzy implication. Therefore, firstly the fuzzy relation R, representing the implication between two fuzzy sets A and B, must be defined. Having the relation R along with a value describing the state of A (e.g. low, tall or a crisp number taken from the universe of discourse) GMP is used to infer the state of B (again it can be either fuzzy or crisp). Fuzzy rules (recall that implication is a fuzzy rule) are built on fuzzy sets. So in the case that the 59 premise A is given in crisp form then its value must be fuzzified (fuzzification step). Similarly if the conclusion B is desired to be crisp then it must be defuzzified (defuzzification step). As aforementioned, an implication relation is defined over two universes and therefore 𝑅𝐴→𝐵 ⊆ 𝑋 × 𝑌 where the two fuzzy sets A and B are defined over the universes X and Y respectively. Then by given a particular element x belonging to the fuzzy set A or the fuzzy set itself the fuzzy value of the “implied” y ∈ B can be produced as: 𝑦 = 𝑥 ∘ 𝑅𝐴→𝐵 Eq. 3.18 which is the composition between the fuzzy value of x and the implication relation between A and B. The steps which are followed to accomplish GMP inference are presented along with an example of a fuzzy system which is gradually built from preprocessing step to inference. The fuzzy system which is used as an example adjusts the air condition motor speed to the room temperature. Steps of building a fuzzy system: Step 1: Identify and assign input/output linguistic variables. Two fuzzy variables compose the “air condition” problem, namely temperature which is the input variable and air condition motor speed (ac speed) which is the output variable. The universe of the variable temperature is X defined as the Celsius degrees between 0 and 45 while for the ac speed is Y representing the interval [0,100] representing the percentage of the rpm (revolutions per minute), at a given time instance, to the maximum rpm that the specific motor can catch. Step 2: Describe each linguistic variable with a linguistic value (fuzzy set). The variable temperature is quantified into the linguistic values cold, normal and hot while ac speed into slow, medium and fast. Each linguistic value defines a fuzzy set and the corresponding membership functions are shown in Figure 3.11. Note that each linguistic value must be defined over the universe of the linguistic variable. Therefore the membership functions of the temperature and ac speed describe the fuzziness of the universe X and Y respectively. 60 Figure 3.11: The membership functions of the linguistic variables temperature and air condition motor speed. Step 3: Create the rule base. Each rule should assign the input variables to the output variable using implication relations. R1 (𝑅𝑐𝑜𝑙𝑑→𝑠𝑙𝑜𝑤 ): If temperature is cold then ac speed is slow. R2 (𝑅𝑛𝑜𝑟𝑚𝑎𝑙→𝑚𝑒𝑑𝑖𝑢𝑚 ): If temperature is normal then ac speed is medium. R3 (𝑅ℎ𝑜𝑡→𝑓𝑎𝑠𝑡 ): If temperature is hot then ac speed is fast. It is important to mention that during inference step all the rules are enabled. They fire in parallel and therefore two or even more rules can be used to apply inference on a system at the same time. The rule base (RB) comprises a union of rules in such a way that 𝑅𝐵 = 𝑅1 ∨ 𝑅2 ∨ … ∨ 𝑅𝑁 where N is the total number of rules. Additionally, in most of the times there are more than one input variables. For example besides temperature, humid could be also considered an input variable for the air condition system. However each linguistic variable is described by several fuzzy sets. The rule base should assign every possible combination of different input fuzzy linguistic values to the output variable. In the case where Ni is the number of the fuzzy sets describing the linguistic variable xi and there are M linguistic variables then the total number of rules that should be coded in a rule base is 𝑁 = 𝑁1 × 𝑁2 × … × 𝑁𝑀 . Step 4: Introduce the states of the input variables to the system. The only input variable in the air condition problem is the temperature. Temperature is measured in Celsius degrees but it is also quantified into linguistic values. So the state of the temperature could be either a Celsius degree drawn directly from the universe of discourse X or a linguistic value picked from cold, normal and hot. If the input variable x is a fuzzy value: 61 Then x ∈ {cold, normal, hot}. Each one of them defines a fuzzy set and therefore there is no need for further fuzzification. In case the relational matrix of implication must be defined then the universe of discourse must be discretized. In Figure 3.11 the discretization step for temperature is 5. After discretization the discrete elements of the chosen fuzzy set are known. X = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45] Thus input variable will be assigned to the membership function degrees to which the discretized elements belong to the chosen fuzzy set. Let x = hot then the input variable would be: 𝜇ℎ𝑜𝑡 = [0, 0, 0, 0, 0, 0, 0, 0.5, 1, 1] Eq. 3.19 Else if it is assigned to a crisp value x’: It means that it is a specific Celsius degree ranging in the interval 0 ≤ 𝑥 ≤ 45. Since it is a crisp value it must be fuzzified which means it must be expressed by a fuzzy set. Introducing a crisp value states that the temperature is exactly x’ Celsius degrees which means that it is fully true that the temperature is x’ and nothing else. Hence the state of the input variable can be fuzzified by a singleton membership function which essentially will return 1 (true) for x’ and zero (false) for all the rest elements of the universe X. 𝜇𝑥 ′ (𝑥) = { 1, 𝑖𝑓 𝑥 = 𝑥′ 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Eq. 3.20 Let x = 35℃ then 𝜇35 = [0 0 0 0 0 0 0 1 0 0] For the following steps it is assumed that x = 35℃ with Eq. 3.21 𝜇𝑥′ = 𝜇35 = [0 0 0 0 0 0 0 1 0 0] is the input variable of the system. Step 5: Construct the implication relational matrices for the firing rules. The crisp value temperature = 35℃ falls into two fuzzy sets, the normal (fully part of normal) and the hot (partially member with 0.5 grade of membership). As a result two rules fire, R2 and R3. To construct the relational matrix discretization must be applied to the output universe as well and so: Y = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] 62 R2 actually represents 𝑅𝑛𝑜𝑟𝑚𝑎𝑙→𝑚𝑒𝑑𝑖𝑢𝑚 and therefore the interest here is to infer how strongly each normal temperature value induces each medium ac speed value. The conclusion of this rule is y = medium (M) and it is given by the fuzzy set 𝜇𝑀 = [0, 0, 0, 0, 0.5, 1, 0.5, 0, 0, 0]. The fuzzy set of x = normal (N) is 𝜇𝑁 = [0, 0, 0, 0, 0.4, 1, 1, 1, 0, 0, 0]. As observed the elements which don’t belong to the sets normal (for temperature) and medium (for ac speed) have zero membership degree. Therefore they could be excluded from the sets (since their implication relational membership degree will be certainly zero). As a result: 𝜇𝑁 = [ 0, 0.4, 1, 1, 1, 0] 𝜇𝑀 = [ 0, 0.5, 1, 0.5, 0, ] By applying the Mamdani implication operator, the relational membership function would be given by: 𝜇𝑅2 (𝑥, 𝑦) = min{ 𝜇𝑁 (𝑥), 𝜇𝑀 (𝑦)} where x ∈ X and y ∈ Y Eq. 3.22 and the final relation matrix R2: y∈M 0/30 0.5/40 1/50 0.5/60 0/70 x∈N 0/15 0 0 0 0 0 0.4/20 0 0.4 0.4 0.4 0 1/25 0 0.5 1 0.5 0 1/30 0 0.5 1 0.5 0 1/35 0 0.5 1 0.5 0 0/40 0 0 0 0 0 𝜇𝑅2 (20,60) = min{0.4, 0. ,5} R3 (𝑅ℎ𝑜𝑡→𝑓𝑎𝑠𝑡 ) The relational matrix of this rule is constructed the same way as the matrix of R2. Therefore R3 relational matrix is: 63 y∈F 0/50 0.2/60 0.4/70 0.6/80 0.8/90 1/100 x∈H 0/30 0 0 0 0 0 0 0.5/35 0 0.2 0.4 0.5 0.5 0.5 1/40 0 0.2 0.4 0.6 0.8 1 1/45 0 0.2 0.4 0.6 0.8 1 Step 6-Α: Apply generalized modus ponens This step describes the actual inference based on a fact x’, which is the given state of the temperature (Equation 3.21) and an implication rule (R2 and R3) resulting the new state y’ of the output variable ac speed, through composition operation. Firing by R2: 𝜇𝑥 ′ ∘𝑅2 (𝑦) = 𝜇𝑀′ (𝑦) = max {𝑚𝑖𝑛{𝜇𝑥 ′ (𝑥), 𝜇𝑅2 (𝑥, 𝑦)}} 𝑦 𝑥 Eq. 3.23 Results a new fuzzy set of ac speed: 𝜇𝑀′ = [0, 0, 0, 0, 0.5, 1, 0.5, 0, 0, 0, 0]. Firing by R3: 𝜇𝑥 ′ ∘𝑅3 (𝑦) = 𝜇𝐹′ (𝑦) = max {𝑚𝑖𝑛{𝜇𝑥 ′ (𝑥), 𝜇𝑅3 (𝑥, 𝑦)}} 𝑦 𝑥 Eq. 3.24 Results a new fuzzy set of ac speed: 𝜇𝐹′ = [0, 0, 0, 0, 0, 0, 0.2, 0.4, 0.5, 0.5, 0.5]. Both rules are fired in parallel and therefore they are implicitly connected by a disjunction operator (in this case min norm is applied) 𝜇𝑦′ (𝑦) = 𝜇𝑀′ (𝑦) ⋁ 𝜇𝐹′ (𝑦) = max{𝜇𝑀′ (𝑦), 𝜇𝐹′ (𝑦)} Eq. 3.25 The inferred ac speed fuzzy set is: 𝜇𝑦′ (𝑦) = [0, 0, 0, 0, 0.5, 1, 0.5, 0.4, 0.5, 0.5, 0.5] Step 6-Β: Apply generalized modus ponens 64 An alternative way of applying GMP inference is described in this sub-step. This method is simpler and doesn’t require any discretization. The whole inference process is graphically represented in Figure 3.12. In the first place input crisp value x’ is presented to the system and its membership degree to each of the fuzzy sets describing the input variable is calculated (𝜇antecedecent (𝑥′)). The antecedecent and the concequent of a fuzzy rule comprise linguistic values (fuzzy sets) of the input and output variables. Therefore the firing rules are those for which the antecedecent fuzzy set includes the input element x’ to a degree larger than zero. For each enabled rule the concequent fuzzy set is formed as: 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 ′(𝑦) = 𝑚𝑖𝑛{𝜇antecedecent (𝑥′), 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑦)} Eq. 3.26 Eventually all the concequent of all the firing rules are unified to result the final set describing the output variable. Returning to the air condition example: The input crisp value x’ = 35 belongs to cold with zero degree, to normal (N) with one degree and to hot (H) with 0.5 degree. Therefore the firing rules are R2 and R3 since their antecedecent is normal and hot respectively. Firing by R2: 𝜇𝑀′ (𝑦) = 𝑚𝑖𝑛{𝜇N (𝑥′), 𝜇𝑀 (𝑦)} Eq. 3.27 𝜇𝐹′′ (𝑦) = 𝑚𝑖𝑛{𝜇H (𝑥′), 𝜇𝐹 (𝑦)} Eq. 3.28 Firing by R3: The results are shown in Figure 3.12. These are the basic steps one should follow to design and implement fuzzy inference with GMP. There is however something which is missing. In most cases, the input variables of realistic fuzzy systems are more than one. In that case the antecedents must be combined using fuzzy operators (by conjunction, disjunction, etc). The resulting fuzzy set will be then used to infer the fuzzy set of the concequent. An example of a fuzzy system with two input variables is presented in Figure 3.13. The rules which are firing in the particular example are: R1: If x = small AND y = low then z = strong. R2: If x = small OR y = low then z = medium. 65 R3: If x = big AND y = high then z = weak. The output fuzzy set is shown in Figure 3.13 after introducing the input crisp value x’ and y’. 66 Figure 3.12: Construction of the membership function describing the fuzzy set of the concequent inferred by the two firing rules R2 and R3 and a single input variable. The purple graph is the input variable temperature where the orange is the output variable ac speed. The final concequent membership function is the union of the two fuzzy sets constructed by the composition of the input variable and the fuzzy rules. Figure 3.13: Inference over a fuzzy system with two input variables. 67 Step 7: Defuzzification Fuzzy inference results fuzzy concequent. However in many cases a crisp output value is required. The process which transforms a fuzzy value to a crisp value is called defuzzification. Defuzzification methods comprise a big family and only the most well known subset of them is presented in this study. Maxima Method The methods which fall into this category, make use of the elements which belong to the concequent’s fuzzy set with the maximum membership grade (let them comprise the maximum set and be named as maximum elements). These methods are the first – of – maxima, middle – of –maxima and last – of – maxima. The former returns the maximum element with the minimum domain value. Accordingly the last – of – maxima returns the maximum element with the maximum domain value. The middle – of – maxima returns the average of the maximum elements. Center – of – area / gravity: The center of gravity xcog of a surface given by a function f(x) between the domain points x1 and x2 is: 𝑥𝑐𝑜𝑔 = 𝑥 1 𝑥 ∫𝑥 1 𝑓(𝑥)𝑑𝑥 2 ∫𝑥 2 𝑥 𝑓(𝑥)𝑑𝑥 Eq.3.29 Therefore the center-of-gravity defuzzification method returns the center of gravity of the output’s fuzzy set’s membership function 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 , as the crisp output value: 𝑥𝑐𝑜𝑔 = 𝑥 1 𝑥 ∫𝑥 1 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (x)𝑑𝑥 2 ∫𝑥 2 𝑥 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑥)𝑑𝑥 Eq.3.30 where x1 and x2 are the minimum and maximum domain elements of the membership function, respectively. If the elements of the fuzzy output set are discrete the center-of-gravity can be calculated by substituting the integrals with discrete summations: 68 𝑥𝑐𝑜𝑔 = ∑𝑁 𝑖=1 𝑥𝑖 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑥𝑖 ) Eq.3.31 ∑𝑁 𝑖=1 𝜇𝑐𝑜𝑛𝑐𝑒𝑞𝑢𝑒𝑛𝑡 (𝑥𝑖 ) A little attention is required in cases where the fuzzy set of the concequent is totally empty and therefore all the domain elements belong to this set with zero degree. The center-of-gravity should be directly assigned to zero avoiding division with zero. In Figure 3.14 the output fuzzy set of the inference example presented in Figure 3.13 is defuzzified with three different ways. For example the crisp value using the center-ofgravity method for this set would be calculated as: 𝑧𝑐𝑜𝑔 = 10 ∗ 0.4 + 20 ∗ 0.4 + 30 ∗ 0.4 + 40 ∗ 0.4 + 50 ∗ 0.75 + 60 ∗ 0.25 + 70 ∗ 0.25 + 80 ∗ 0.25 + 90 ∗ 0.25 0 + 0.4 + 0.4 + 0.4 + 0.4 + 0.75 + 0.25 + 0.25 + 0.25 + 0.25 + 0 = 45.5 45.5 Figure 3.14: Defuzzification methods: (a) Middle-of-maxima (b) Left-of-maxima (c) Center-ofGravity 69 Figure 3.15: Basic preprocessing and inference steps for a fuzzy system Until this point only generalized modus ponens (GMP) has been presented for fuzzy reasoning. Generalized modus tollens (GMT) is also defined for fuzzy logic systems. Recall that classical modus tollens can be expressed as: (¬𝐵 ∧ (𝐴 → 𝐵)) = ¬𝐴 Eq. 3.32 Therefore if A “implies” B and it is given that B is not true then A is certainly not true as well. Considering that A and B are fuzzy sets defined over the universes X and Y respectively, the Equation 3.32 can be expressed as a rule: R: If y is not B then x is not A. In Fuzzy Logic context B and A comprise fuzzy sets describing a linguistic variable. Therefore, similarly to the modus ponens equation (Equation 3.18) the concequent of the GMT is given: 𝑥 = 𝑅𝐴→𝐵 ∘ 𝑦 Eq. 3.33 After deriving the concequent fuzzy set by using GMT the rest inference process is the same as for GMP, presented concisely in Figure 3.15. 3.5 - Fuzzy System Development The man who first introduced the concept of a fuzzy set referred to Fuzzy Logic as “computing with words” (Zadeh, 1996). Humans have the ability of inferring conclusions out of imprecise, non-numerical input data. Not only that but they can also transfer these thoughts and reasoning ideas to each other using vague expressions and still understand each other. Actually this is the 70 main form of human communication in most real life cases using statements of the type: “If you decrease a little more the aircondition temperature, the room will become cold”! Essentially humans and Fuzzy Logic manipulate “words” as labels. These labels characterize a set of objects/behaviors/actions/etc. However, there is a great and important difference between humans and Fuzzy Logic. Humans learn the meaning of each word during their lives and through their experiences. The function which assigns various input values to a word-label is an automatic process of the brain and it happens beyond human awareness. On the other site, fuzzy systems are designed and implemented on computers. Computers don’t comprise a human brain and therefore despite the imprecise/fuzzy behavior of a fuzzy system, its parameters must be precisely defined. More precisely there are two categories of fuzzy modeling identification, the structure identification and the parameter identification (Sugeno and Yasukawa, 1993). Structure identification concerns about finding input variables (namely Type I problem) and the relations connecting them to output variables (Type II problem). Type I: Find input variables A system might accept a big number of input variables. However high input dimensionality increases the complexity of the system and furthermore in many cases only a small subset of the whole input set of variables affects drastically the system. As a result the initial set of input features must be decreased in such a way that it will retain all the necessary information the system needs to achieve qualitative reasoning. The problem of reducing the initial pool of candidate input variables to the minimum number and still allowing to the system to accomplish correct inference is under the category of Structure Identification Type I. Type II: Find input-output relations This problem requires that the Type I problem is already solved and so the input variables which are going to be used to model the input feature space of the system are known. The next question is about the structure and the number of the relations between input variables and output. In the context of fuzzy logic, the rules which transform the system from input to output are called rules. Hence Type II is about rules identification and it is subdivided into two categories: o Subtype IIa: Find the number of the fuzzy rule 71 Every system presents a certain behavior (expressed as the output variable) which is depending upon a particular combination of values of the input variables. Therefore a specific number of rules must be built in order to define every possible behavior of the system (every possible combination of causes and effect). o Subtype IIb: Find the structure of premises Each rule is composed by a set of premises and a concequent. The maximum number of premises is equal to the total number of the input variables (let it be N) yet some input variables may not affect a certain system’s behavior (concequent). Each variable is quantified into a number of linguistic values and therefore for each rule there are 𝑁 × 𝑀 different combinations of premises. Each rule defines a specific partition of the input space (if the introduced input features fall into this partition the rule is firing). The position of the partition is defined by the membership functions describing the linguistic variables which participate to the rule. The problem thus is to define the premises values for each rule so that the input space is partitioned into the minimum possible subspaces without loosing any information about the system’s output behavior. Parameter Identification comprises an optimization problem for the parameters of the membership functions describing each linguistic variable. The number of the parameters which must be optimized per each membership function depends on the type of the function (trapezoidal, Gaussian, etc). If no prior knowledge is available for the fuzzy system to be constructed, structure and parameter identification problems cannot be dealt separately. There are some proposed methods of solving the above-mentioned problems. Genetic algorithms are very popular for solving Parameter Identification and Type I problems. For the former one some other models have been also used to find a solution like Artificial Neural Networks, Particle Swarm Optimization, gradient descent methods. For Type II problem clustering methods have been proved to separate adequately the input space into a certain number of clusters which represent the partitions. Each cluster is then transformed to a rule form assigning the corresponding input values to the rule’s premises. 72 Chapter 4 - Evolutionary Algorithms Evolutionary Algorithms (EA) is the field of Computer Science which studies programs that evolve solutions to problems, incorporating nature’s evolutionary principles. Evolutionary algorithms are biologically inspired and many of their terms and parameters are identical to the ones used to describe genetics. For better understanding of their meaning and their use in the context of EC some biological terms of genetics are explained in this section. 4.1 - Biology Every living organism is comprised by a certain number of cells. The cell is the structural and functional unit of life. Each cell contains the same set of chromosomes which is called genome. For example, human genome is constituted by forty six chromosomes. The chromosomes lie in the nucleus of the human cell and they are divided into twenty two pairs of homologous chromosomes and one pair of sex chromosomes. A chromosome is basically a long chain of biochemical material, namely deoxyribonucleic acid (DNA), which encodes the necessary information about the development of a specific set of characteristics describing the organism. The “ribbon” of the DNA is further divided into segments, called genes, which serve as “blueprints” for a part, or the whole, of a particular protein. Recall that proteins are considered the building blocks of an organism and therefore the whole genome may be naively thought as nature’s “recipe book” including all the instructions about the construction of the organism’s physical and mental features. The genes of homologous chromosomes encode the same protein functionality and they are addressed to corresponding locations on the chromosomes. As the genome is inherited from the organism’s parents, one part of the homologous chromosomes is a copy of the mother’s corresponding chromosome and the other part is of the father’s. Each gene is positioned at a specific location on a chromosome which is called locus. The DNA sequence which is enclosed in a gene locus represents the encoded information about the expression of a characteristic. The variant of these DNA sequences, which are possible to appear at a locus, are called alleles. For example, let it be a gene which is responsible for the production of the eye colour protein. The possible alleles of this gene are {Black, bRown, bLue, Green}. Since each chromosome is always accompanied by its homologous, the eye 73 color will be set based on the alleles of the two homologous genes. The two alleles may be different from each other and therefore if N is the number of the possible alleles of a gene then there are 𝑁 × 𝑁 different combinations of alleles. Now, consider an organism whose eye color homologous genes present the following allele combination: {R, L} (Figure 4.1).That means that one of the genes define brown color for the eyes while the other blue. The combination of the alleles {R, L} is labeled as the eye color genotype of the organism. Brown color is dominant over blue and thus the color of the eyes of the particular organism will be brown. The actual color of the eyes is the expression of the genotype and it is called phenotype. A cell can make a copy of itself through the process of mitosis. During mitosis the DNA of the nucleus is replicated creating two identical sets of chromosomes. In the end of the process the cell is divided into two duplicated cells which have the exact number of chromosomes as the maternal one. The first organism’s cell which undergoes mitosis is the zygote and after consecutive divisions billions of cells are created each of them containing the same number of chromosomes as the zygote. The zygote is the result of the union between two gamete cells (two parents) which are joined during reproduction and based on the zygote, a new organism can be developed. By Figure 4.1: A homologous pair of chromosomes appears on the right sight of the figure. The alleles of the two homologous eye colour genes define that the genotype of eye colour is RL (brown and blue) where the phenotype is two brown eyes shown on the right part of the figure. reproducing, the organisms manage to pass on their genes to next generation and therefore gain in evolution. Reproduction is the driving vehicle of evolution and natural selection is the wheel. 74 Natural selection is a term which first introduced Charles Darwin in his work “The Origin of Species” back in 1859. The concept of natural selection can be analyzed into two other concepts: The survival of the fittest The fittest organism is considered the one which has been best adapted to its environment. As a result it has more potential to survive from dangers or threats hidden in its environment whereas the less adapted organisms will not. Differential reproduction This concept emanates from the “the survival of the fittest”. If an organism survives better than others then it is more likely to reach reproductive age and consequently it increases the possibility of reproducing better leaving offspring carrying part of its genome to next generations. Hence natural selection doesn’t favor the organisms which just live longer than others, rather the ones which live long enough to have more offspring surviving to adulthood, than others. An organism is selected by nature to survive based on its interaction with its environment, which is highly depended on the organism’s phenotype. As a result the genes which are passed through generations are the ones which express the “selected” phenotype. Over a number of generations the phenotype of the fittest organisms dominates a population. Besides reproduction, new information and new characteristics might appear in a population after mutation. Mutation happens when a DNA sequence changes spontaneously and randomly altering the expression of the corresponding gene. Mutation may have an impact on the phenotype of the mutated organism and it is likely that totally new characteristics can emerge for the particular population. Although reproduction is the main process of evolving organisms, mutation may lead to the appearance of a new characteristic which could be the key to the better adaptation of the next generation to a certain changing environment. Summarizing up, evolution could be given as: Evolution = reproduction with inheritance + variability + selection Eq. 4.1 75 4.2 - EC Terminology and History Biological terms are used to describe common parameters and concepts of the EC algorithms and they are presented in Table 4.1. Biology EC algorithms Chromosome A specific data structure which represents the solution to the modeled problem. A problem might be affected by a number of variables. Each of these variables is somehow encoded to a chromosome. Population A total number of different chromosomes participating to an EC algorithm. Genotype The values of the encoded data of a chromosome. Phenotype The decoded values of a chromosome which are assigned to the corresponding problem variables. Environment The target of the problem being modeled. Fitness Value Evaluation value on how well a specific chromosome, based on its phenotype, interacts with the environment. In other words, how well the solution suggested by a chromosome, “fits” the problem. Crossover The process where two artificial chromosomes exchange genotype (Reproduction) information. Mutation The process where the genotype of a chromosome is altered in a specific way and position. Selection The process where a certain number of chromosomes usually based on their fitness values. Table 4.1: How biological terms are used in EC algorithms The EA area comprises four basic techniques: (a) Genetic Algorithms (Holland, 1975) (b) Genetic Programming (Koza, 1992) (c) Evolutionary Strategies (Rechenberg, 1973) (d) Evolutionary Programming (Fogel, 1966) 76 They all work on populations of artificial chromosomes (also called individuals). Each chromosome suggests a solution to the problem being modeled. The initial population of chromosomes is evolved through a number of generations and eventually converges to a “good” solution. Their main difference between the four EC paradigms is the chromosome representation scheme. A short description about them follows whereas a more comprehensive description will be given for Genetic Algorithms which seems to be the most widely used among all of them. 4.2.1 - Genetic Algorithms (GA) Representation: The basic representation for a GA chromosome is a fixed-length bit string. Each position of the string represents a gene. A set of one or more sequential genes express the phenotype (value) of one specific variable of the problem. Basic Operators: Reproduction: Bit-string crossover. Two chromosomes are selected as parents and swap a specific sequence of their genes producing two new offspring which eventually replace their parents. Mutation: Bit-string mutation. A number of genes (could be zero as well) constituting a chromosome is chosen and they get flipped. Selection: A number of chromosomes are selected using probabilistic methods (which will be described in greater detail later) which are based on the chromosome fitness values. 4.2.2 - Genetic Programming (GP) Representation: GP is mainly used to evolve programs. A program consists of predefined terminal set, which includes inputs, constants or any other value the variables of the program can take, and a function set which includes different statements, operators, functions which are used by the program. 77 Terminal Set Function Set {-4, -3, -2, -1, 0, 1, 2, 3, 4} {+, - , *, /} {a, b, c} {and, or, not} {true, false} {if, do, set} Table 4.2: Terminal and Function set examples The basic representation for a GP chromosome is then a tree whose inner nodes take elements from the function set and the terminal nodes (leaves) from the terminal set (Figure 4.2). Basic Operators: Reproduction: Sub-tree crossover. Two chromosomes (program trees) are selected as parents per selection policy. In each parent a sub-tree is selected randomly (the subtrees may have unequal sizes) and then the sub-trees are swapped. Usually there is a maximum tree depth constraint which restricts the choice of a sub-tree. Mutation: Sub-tree mutation. A node is randomly chosen and replaced by a randomly generated sub-tree. Again the max depth constraint applies. Selection: Alike Genetic Algorithms. Figure 4.2: Genetic tree expressing the program sin(x) + exp(x 2 + y) 4.2.3 - Evolutionary Strategies (ES) Representation: A fixed length vector of real values is the ES representation. Opposed to GA, the basic representation of ES represents the phenotype of a chromosome rather the genotype. Each element of the vector has the actual value of a problem’s variable. 78 Basic Operators: Reproduction: Intermediate recombination. Two (or more) parental chromosomes are averaged element by element to form a new offspring. Mutation: Gaussian mutation. A random value is drawn from Gaussian distribution and added to each element of a chromosome. Selection: The best fitted chromosomes either only amongst the offspring or both parents and children, are passed through the next generation. 4.2.4 - Evolutionary Programming (EP) Representation: Similar to ES, the representation of EP chromosomes is a fixed-length real valued vector. Basic Operators: Reproduction: None. There is not exchange of any information between the chromosomes of a population. Evolution is driven only by mutation. Mutation: Distribution Mutation (e.g. Gaussian). Selection: Each chromosome replicates itself producing an identical offspring. Then mutation is applied to that offspring. Finally, using a stochastic method, the best fitted chromosomes are selected from the parents and the offspring. 4.3 - Genetic Algorithms GA is a method which evolves a population of chromosomes (candidate solutions) aiming at finding a good solution to a problem after a number of generations. The apply searching to a problem’s potential set of solutions until a specified termination criterion is satisfied and an approximately good solution is found. Essentially Holland, who introduced GAs in their 79 canonical form, incorporated the nature’s principle “survival of the fittest” into an optimization algorithm. 4.3.1 - Search space GAs are mostly used to solve optimization problems. Optimization is about finding a good/approximate solution to a problem (e.g. minimum/maximum of a function). To achieve optimization a set of the problem’s potential solutions 𝑥 ∈ 𝑋 is needed and a method of evaluating each potential solution so that 𝑓: 𝑋 → ℜ. The search space of a problem is defined as the collection of all feasible solutions to the problem. Every single point of the search space comprises a set of the problem’s variables values which give a specific solution to the problem. Therefore the optimum solution is a point of the search space. A point lying in a problem’s search space is also assigned to its fitness value depending on how well it solves the particular problem. An optimization algorithm is then defined to explore the search space of a problem so that eventually to converge to a solution which will be the regarded as the optimum one. Hence GA which comprises an optimization algorithm applies searching to the search space of a problem. 4.3.2 - Simple Genetic Algorithm An algorithm is a sequence of steps which one must follow to solve a problem. GA is an iterative algorithm which handles a population of feasible solutions of a problem, encoded in chromosomes, at each step. GA mimics the process of evolution. Therefore at every iteration a new population of chromosomes is generated by their ancestors using certain genetic operators. GAs are search heuristic algorithms which are aimed at finding optimum solutions of a problem by miming the process of natural evolution. Their great advantage is that they can easily handle multi-dimensional problems without having to use differential equations. The principle of GAs is based on the survival of the fittest chromosomes-solutions among an initial random population of chromosomes. Once the population is generated the algorithm is applied and thus the evolution process begins. This principle can be used for all kind of problems and applications. For each application the process of chromosomal evaluation is defined to guide the process of selection for the optimum solution of the modeled problem. Chromosomes with better evaluation scores have more probabilities to reproduce compared to the ones with worse evaluation scores. 80 The first step of a GA is the generation of the initial population of chromosomes, X. A defined number of N fixed-length bit strings are created, 𝑋 = (𝑥1 , … , 𝑥𝑁 ), to represent a subset of all the candidate solutions of a problem which are randomly initialized. A fitness function is used to calculate each chromosome’s fitness value fi , which is a numerical value showing how close the particular solution is to the optimum one. Then the following steps are iterated until the termination criterion is satisfied. Until N offspring are generated repeat the sub-steps 1-4: (1) A pair of chromosomes (𝑥𝑖 , 𝑥𝑗 ), is selected to function as parents, using a probabilistic selection method based on their fitness values. (2) With a predefined probability PC the two parental chromosomes swap genetic information using a crossover operator producing two offspring (𝑜1 , 𝑜2 ). (3) Each offspring is then mutated under the probability of PM. (4) The fitness value of each offspring off_fi is evaluated. After the generation of N offspring, the current population of chromosomes is replaced by the new population of offspring and a new generation cycle begins. Using a genetic algorithm paradigm for solving an optimization problem requires a preprocessing phase including the specification of the algorithm’s parameters and methods. Therefore for any specific problem the following must be defined: 1. The solution representation into a chromosome. 2. The fitness function. 3. The genetic operators (crossover, mutation and selection) 4. Other parameters as the termination conditions and probabilities PC and PM. 4.3.2.1 - Representation The most common representation of the GA is binary which means each chromosome is a series of zeros and ones and somehow they can be decoded to numerical values. A problem is affected by a number of variables. Let v be such a variable. Then: If Then 𝑣 ∈ {0, 1, … , 2𝑙 − 1} where l ∈ ℕ v binary representation = [𝑏1 𝑏2 … 𝑏𝑙 ] Eq. 4.2 where 𝑏𝑖 ∈ {0,1} Eq. 4.3 81 and the integer value of the v1 will be decoded following to: 𝑣 = ∑𝑙𝑖=1 𝑏𝑖 2𝑖−1 Eq. 4.4 However there are some cases where a variable’s domain cannot be represented by Equation 4.2. Therefore a more generalized definition is given as: 𝑣 ∈ {𝑀, 𝑀 + 1, … , 𝑀 + 2𝑙 − 1} where l, M ∈ ℕ Eq. 4.5 Where M is the minimum value of the variable’s domain (v1_min) and 𝑀 + 2𝑙 − 1 is the maximum (v1_max) and therefore the binary representation should represent the distance v1_max – v1_min = 𝑀 + 2𝑙 − 1 − 𝑀 = 2𝑙 − 1 which is the same as given in Equation 4.4 and then the decoded value would be: 𝑣 = 𝑣_𝑚𝑖𝑛 + ∑𝑙𝑖=1 𝑏𝑖 2𝑖−1 Eq. 4.6 Therefore the distance between the maximum value and the minimum value of the variable’s domain will be divided into 2𝑙 − 1 equal sub-distances which means the domain will be restricted to 2𝑙 real values with precision of q decimal points. The decoded value is retrieved by: 𝑣 = 𝑣_𝑚𝑖𝑛 + (𝑣_ max − 𝑣_min) 2𝑙 −1 ∑𝑙𝑖=1 𝑏𝑖 2𝑖−1 Eq. 4.8 If more than one variables define a problem then the binary representation is calculated per each based on each variable’s type of domain and boundaries. Then the total problem solution representation is given by concatenating them into one. Figure 4.4: Example of decoding the binary representation of a problem with two variables (v1 and v2). 82 Figure 4.3: Logic Diagram of GA using max generations (MAX_GEN) termination criterion 83 4.3.2.2 - Fitness Function The fitness function most of the times “translates” the natural objective/cost function of the optimization problem. In cases where the problem’s objective function is known then it is used as fitness function as well. Generally the fitness function should return adequate information on how close to a good solution is the solution suggested by each chromosome. Hence the fitness function applies on the decoded values of a chromosome. Fitness values form the basis of the selection operator since generally the fittest chromosomes have more probabilities of being selected as parents. 4.3.2.3 - Genetic Operators 4.3.2.3.1 - Selection Selection takes place during the “breeding” phase of the algorithm where new chromosomes must be generated and replace the present population. Following natural selection, GA selection favors the recombination of the fittest chromosomes so that their genome has more probabilities passing to the next generation. However less fit individuals are not totally excluded from the breeding phase retaining in that way population’s diversity and taking an advantage of potential “good” partial solutions. Three selection methods are presented and for all of them fi is used to denote the fitness of the ith chromosome where 1 ≤ 𝑖 ≤ 𝑁. 4.3.2.3.1.1 - Roulette Wheel Selection Derived by its name, this selection method resembles the function of a casino roulette wheel. Firstly the percentage of each chromosome’s fitness to total population fitness must be calculated. 𝑃𝑖 = 𝑓𝑖 ∑𝑁 𝑖=1 𝑓𝑖 Eq. 4.9 A roulette wheel represents the total population fitness. Each chromosome occupies a slice of the roulette wheel, sequentially. The size of the slice is proportionate to the chromosome’s contribution to the total fitness 𝑃𝑖 . Hence if a fitter chromosome will be assigned to more slots 84 of the roulette wheel than a less fit one. The exact position of each chromosome on the roulette wheel is defined by its cumulative fitness value defined as: 𝑗 𝐶𝑗 = ∑𝑖=1 𝑃𝑖 Eq. 4.10 The roulette wheel ball is implemented as a randomly generated number r in the range [0,1]. The ball “falls” into the boundaries of a specific chromosome’s slice. The particular chromosome is considered the selected one. Therefore: 𝐶𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑−1 ≤ 𝑟 < 𝐶𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑+1 Eq. 4.11 Figure 4.5: Imagine the roulette wheel to be “unfolded” onto a ruler. A population of three chromosomes with fitness values f = {15, 25, 10}. Chromosomes are positioned on the roulette wheel and a random number is generated r = 0.83. The selected chromosome is the third one. Although roulette wheel selection is one of the GAs traditional selection methods, it exhibits some problems. Premature Convergence is the first one. Let a chromosome to have a fitness value much higher than the average population fitness value yet considerably lower than the optimum. Roulette wheel selection will allow to this chromosome to occupy a very big slice of the roulette pie (e.g. 90%) and therefore the rest chromosomes will have too few chances to be selected. As a result in a small number of generations the all the chromosomes of the population will be fairly identical to this chromosome, unable to escape from this chromosome’s genotype and therefore unable to reach the optimum or any better solution. In other words, the population may converge to a local optimum. Another scenario which roulette wheel fails to deal effectively is when all chromosomes of a population have approximately equal fitness values close to the optimum value (but none of them is the optimum). In this case, each chromosome possesses almost equal slice on the roulette wheel and therefore they all have equal chances to be selected eliminating any selection pressure (e.g. the degree to which fitter chromosomes have more probabilities of producing offspring). Therefore random wheel behaves just as a random selection being unable of using the fitness information given for each chromosome. 85 Fitness scaling is used to address the premature convergence as described above. The idea is to rescale the fitness of each chromosome so that to decrease the selection pressure to the chromosome which might have fitness value much larger than the rest. By fitness scaling this chromosome is restricted from dominating the population prematurely. Linear scaling is one of the available scaling methods under which the new scaled fitness values are evaluated to: 𝑓𝑖′ = 𝑎𝑓𝑖 + 𝑏 Eq. 4.11 The basic idea of linear scaling is that the chromosome with the average fitness is expected to be selected once. Therefore the average fitness value after scaling remains equal to the average fitness value before scaling: ′ 𝑓𝑎𝑣𝑔 = 𝑓𝑎𝑣𝑔 Eq. 4.12 Given this, the user defines how many times the best fit chromosome is expected to be selected (expressed by a number C ∈ ℜ). Then the linearly mapped best fitness value is given by: ′ 𝑓𝑚𝑎𝑥 = 𝐶𝑓𝑎𝑣𝑔 Eq. 4.13 Clearly, increasing the parameter C increases the selection pressure in favor to the fittest chromosome while decreasing it selection pressure is decreased. Based on these two ′ ′ points (𝑓𝑎𝑣𝑔 , 𝑓𝑎𝑣𝑔 ) and (𝑓𝑚𝑎𝑥 , 𝑓𝑚𝑎𝑥 ) the coefficients a and b are calculated and therefore using equation Eq. 4.11 the rest rescaled fitness values are computer. Figure 4.6: Fitness rescaling function where C = 2 86 4.3.2.3.1.2 - Rank Selection Rank selection ranks individuals according to the fitness value of each chromosome. The chromosome with the best fitness is assigned to the highest rank (r = N where N is the population size) and consequently the worst fit chromosome gets the lowest (r = 1) and all the others in between. Then the ranking values are considered the new fitness values. By using rank selection all chromosomes have chance to be selected, even if one of them has substantially larger fitness value. However, rank selection is unable of taking an advantage on information which may be included into fitness differences among the chromosomes leading many times to much slower convergence. 4.3.2.3.1.3 - Tournament Selection With tournament selection, a group of m chromosomes is randomly selected and the fittest chromosome among them is chosen to be a potential parent. Tournament selection is very easy to implement, efficient and most importantly, it holds a balance between the selective pressure and population diversity. The only parameter the user must tune is the number of the chromosomes participating to each “tournament”. As the number of competitors increases, the selection pressure is also increased and other way around. Therefore the used must finely define the tournament size m so that to retain the selection pressure and population diversity to desired levels. 4.3.2.3.1.4 - Elitism When elitism is used, a number of the best fitted chromosomes are copied directly to the next generation. The fittest chromosomes are passed to the new population automatically in order to keep the best solutions (until that point) as a backup. The rest operators (selection, crossover, mutation) are applied as in canonical GA. In this way, any good genetic code achieved by that point cannot be destroyed by mutation and hence it improves GA performance. 4.3.2.3.2 - Crossover Crossover operator can be considered as the main genetic operator of evolution in the context of GAs. Crossover allows the combination of partial solutions. Two chromosomes (which were first selected among others) exchange genetic information and two offspring are created. The 87 genome of each child is a combination of its parents’ genome. This hopefully leads to offspring which are better fitted to the “environment” which means suggesting better solutions to the optimization problem. One-point Crossover This is the simplest way of applying crossover. Simply, one crossover point is randomly selected so that p ∈ [1, l] where l is the length of the chromosome bit-string. Then the two parental chromosomes swap genetic code defined by the crossover point p. Figure 4.7: One-point crossover for two chromosomes of size = 10. The position 6 is randomly selected and two offspring (on the right of the figure) are created. Two-point Crossover The only way 2-point crossover differs from the former is that two position points, p1 and p2 are randomly selected on the chromosome’s length. The genetic code included inside the range defined by these crossover points is the swapped between the two parents. Figure 4.8: Two-point crossover for two chromosomes. The genetic sequence between the randomly selected points p1 and p2 is swapped. 88 The addition of extra crossover points allows to the GA to combine different parts of the chromosome’s genome other than head and tail. On the contrary one-point crossover always swaps both head and tail. Therefore by using two-point crossover more potential solutions can be visited by GA and the search space of the problem is covered more thoroughly. However both one-point and two-point crossover methods present a common behavior. Genes which are lie close to each other are silently correlated in terms of having more probabilities of being selected together to be swapped. A problem with variables depended from each other could benefit from this phenomenon by positioning the genotype of these variables next to each other in the chromosome representation. In other cases thus, where this phenomenon is preferred to be avoided, uniform crossover can be used instead. Uniform Crossover For each pair of parents, a bit string of the same length as the chromosome’s length is randomly generated called the mask. The first offspring is created following these rules: For every gene, if the mask’s corresponding gene is one then the offspring copies the corresponding gene value from parent 1. If the mask’s value is zero then the offspring copies the gene value of parent 2. Same applies for the second offspring substituting parent 1 with parent 2 and vice versa. Figure 4.9: Uniform crossover 4.3.2.3.3 - Mutation As aforementioned, crossover is the main tool of evolution for creating new chromosomes with better fitness. However mutation can help the algorithm to introduce new genotypes which are not present in the population helping the algorithm to escape from local optima. Therefore even 89 if a “good” genotype is lost through the generations, it might be recovered by applying mutation to the population. The most common mutation operator for binary representations is to create a vector of random real values so that 𝑟 ∈ [0,1]l where l is the size of the chromosome. Each element of this vector is a random value corresponding to each gene of a chromosome. If this random value is less than a predefined probability constant Pm then the particular gene is flipped (zero to one and one to zero). 𝐼𝑓 𝑟𝑖 ≤ 𝑃𝑚 𝑡ℎ𝑒𝑛 𝑚𝑢𝑡𝑎𝑡𝑒𝑑_𝑥𝑖 = 1 − 𝑥𝑖 Eq. 4.14 Figure 4.10: Mutation of a chromosome under the probability Pm = 0.2 4.3.2.4 - Setting parameters GA is a stochastic and heuristic method. One should fine tune the various GA parameters based on the particular problem to be modeled. Since GAs is applied to a big set of different class problems, there are not generalized precise rules about each parameter’s definition. However, knowing the effect of each parameter helps to adjust them properly to a given problem. First of all, GAs convergence depends on the initial positioning of the chromosomes in the search space. As aforementioned GA uses chromosomes to apply parallel searching in the problem space driven by a selection policy. Therefore if the population size is too small (in respect to the problem’s search space) the GA might be unable to search adequately the space for the optimum solution in reasonable time. On the other hand, if the population size is too large then the population convergence will become much slower and GA’s performance will be significantly decreased. The “right” set of the population size is problem depended. The dimensions of the problem space must be considered in defining the population size (which means the chromosome’s length). Crossover and mutation are the basic genetic operators applied in canonical GA. Adjusting properly their parameters enhances GA’s effectiveness and efficiency. 90 Crossover phase strongly depends on the only crossover parameter, the crossover probability PC. If PC is zero then no crossover takes place, and the new generated population is just the copy of the previous one. On the other hand, if the probability PC is one then all selected pair of parents will be eventually recombined to create two offspring with their genetic code mixed up. Basically, crossover probability defines the frequency of chromosomes recombination. Crossover is applied to create new chromosomes which will combine the genetic information of their parents. This means that they will be able to combine partial solutions (which may be good solutions) to create a new solution (which may be better). It is also desirable to leave some chromosomes passed to the new population untouched with the hope that some chromosomes suggest a very good solution which will help evolution better if they are not disrupted. Probability Pm defines how often the genes of a chromosome are mutated. If Pm is extremely small then the algorithm cannot benefit from mutation function since very few genes are going to be altered very rarely. Therefore either the population will converge very slowly to an optimum solution, loosing from GA performance, or it may not converge at all. On the other hand if the Pm is large then the genome of the population will be highly disrupted with every generation. Eventually, the chromosome won’t evolve rather they will change their genotypes totally randomly and therefore the GA won’t apply any exploration to the problem’s search space. Besides, the basic idea of the GA is to incorporate the effects of the natural selection principles to an optimization algorithm. By increasing substantially the mutation probability these principles cannot be applied in anyway. Typically the mutation probability is advised to be approximately equal to 1⁄𝑐ℎ𝑟𝑜𝑚𝑜𝑠𝑜𝑚𝑒 𝑙𝑒𝑛𝑔𝑡ℎ. Fine tuning crossover and mutation probabilities helps GA to balance between exploration and exploitation on the searching for an optimum solution. Exploration is mainly “powered by” crossover operator and much less by mutation while exploitation by selection. Crossover allows the GA to explore the search space of the problem driven by the information given in the present chromosomes. Chromosomes are combined to produce new chromosomes suggesting new solutions, built by the mixture of their ancestor’s genotype, yet still suggesting a totally new solution. Mutation also results new genetic code in a more random fashion helping the algorithm to escape potential local optima. However since mutation is generally more preferable to happen rarely it doesn’t affect strongly the exploration of the algorithm. 91 As mentioned above, crossover is an exploration operator resulting new solutions based on the present chromosomes and more precisely based on the present chromosomes and to some degree in favor of the fittest ones. This essentially means that the new chromosomes eventually after a number of generations, will introduce solutions close to the best fit chromosomes. This is exploitation and it is implemented by selection operator. After a certain number of generations, an important percentage of the population will have a fairly good and approximately close to each other fitness value. Since fitter chromosomes have more probabilities of being recombined, the resulting offspring will give a new solution, only this solution will be fairly close to their ancestor’s suggested solutions. Therefore, in this phase, the GA searches for the optima around the area of the “good” present solutions. Choosing good fitness value to give proper information about the quality of a solution along with the fine tuning of the crossover and mutation parameters and a selection method, GA increases substantially the probabilities of converging to the best possible solution. 4.3.3 - Overall In dealing with optimization problems a designer is faced with the multidimensionality of the problem and the visualization of the search space to start with. In addition, such problems are complicated to analyze, and random search is often incapable of giving any solution in a reasonable time period, mainly because the process falls in local minima. Additionally there are cases where very little is known about the problem function (fitness landscape) and it may be the case that this fitness landscape is extremely difficult in terms of discontinuity, rough boundaries, etc where conventional optimization methods (e.g. gradient methods) cannot perform. On the other side, an implementation of the problem with GAs is straight forward, practical and needs no special technical expertise. An additional advantage of the GAs, in reaching a solution is the fact that calculations are done in parallel thus having a number of processes searching for the solution at the same time. Also, processes exchange information through crossover, allowing “good” solutions to be combined and create a new “better” solution, arriving at the optimum one faster. As every other heuristic method, GAs have drawbacks. More specifically, GAs do not always find the optimum solution to a problem. It is better said that GAs give always a “suitable” solution. Therefore it is an ideal tool for retrieving a number of approximate solutions to a 92 problem but even in the cases where approximate is not adequate, GA cab be incorporated with other local search algorithms increasing greatly the probabilities of reaching the optimum. 93 Chapter 5 - Fuzzy Cognitive Maps 5.1 - Introduction A wide range of real life dynamical systems are characterized by causality through their interrelations. Most of the times, experienced and knowledgeable users can identify the parameters and the conditions of their system. However, the complexity of these systems acts as a prohibitive factor on prognosis of their future states. Nevertheless, it is crucial for the users to make a prognosis on the cost of changing a state of a concept, in taking a decision whether they will take such a risk of change or not. Kosko presented FCM as an extension of Cognitive Maps (Kosko, 1992). CMs where firstly introduced as a decision making modelling tool for political and social systems (Axelrod, 1976). Such systems can be analyzed into two fundamental elements, the set of concepts (parameters of a system) and interrelationships or causalities amongst them. They are graphically represented as nodes and signed directed edges respectively of a signed digraph. The concepts which behave as causal variables are positioned at the origin of the arrows which always show to the effect variables. Each edge can be either positive (+1) representing positive causality or negative (-1) representing negative causality. However according to Kosko, causality tends to be fuzzy and it is usually expressed or described in vague degrees (a little, highly, always, etc) where in CMs each node simply makes its decision based on the number of positive impacts and the number of negative impacts. FCM modelling constitutes an alternative way of building intelligent systems providing exactly this opportunity to the users; predicting the final states of the system caused by a change on initial concept states to a degree. Fuzzy Cognitive Maps (FCMs) constitute a powerful soft computing modelling method that emerged from the combination of Fuzzy Logic and Neural Networks. FCMs were introduced first by Kosko (Kosko, 1986) and since then a wide range of applications in modelling complex dynamic systems have been reported such as medical, environmental, supervisory and political systems. Essentially, a Fuzzy Cognitive Map is developed by integrating the existing experience and knowledge regarding a cause – effect system. 94 5.2 - FCM technical background 5.2.1 - FCM structure In a graphical form, the FCMs are represented by a signed fuzzy weighted graph, usually involving feedbacks, consisting of nodes and directed links connecting them. The nodes represent descriptive behavioural concepts of the system and the links represent cause-effect relations between the concepts. 5.2.1.1 - Concepts A system is characterized and defined by a set of parameters associated with the system. These parameters might be facts, actions, trends, restrictions, measurements, etc depending on the system’s utility, goals and nature. Since the type of FCM parameters varies they are all described as different concepts of the system. Figure 5.1: A simple FCM modelling freeway conditions at rush hour (Shamim Khan, Chong, 2003) More precisely, each concept is a fuzzy set and hence it is described by a membership function. The membership function returns zero if the concept is disabled and one if the concept is fully active. In other words the membership function (as described in Chapter 3) defines the extent of a concept’s existence in a system. For example, the fuzzy set describing the concept “Freeway congestion” of Figure 5.1 (Shamim Khan, Chong, 2003) can be given by the membership function of Figure 5.2. The domain of the function is the average speed of all cars of the freeway in the time interval of fifteen minutes. The membership degree is one if surely there is big congestion (therefore the concept “Freeway congestion” is totally true), zero if there is no congestion at all (therefore each car can drive to the maximum allowed speed if desired) and a 95 value in between if there is “some” congestion (yet the cars move with speed above the lower limit). Figure 5.2: Freeway Congestion fuzzy set Every concept is represented by a node in an FCM. A node is characterized by an activation state value, Αi where 𝐴𝑖 ∈ ℜ usually. This value is the membership degree of the concept, at a specific time instance, to its own fuzzy set as described above. So if the node of “Freeway congestion” posses a value of 0.9 it is interpreted that there is big congestion in freeway (close to full congestion). The credibility or the truthiness of an activation state value is strongly subjective since it is defined by a group of humans with knowledge, experience and expertise in the system’s domain area. Therefore it is probable that their estimations may highly diverge from each other but still saying the same thing. For example consider the evaluations of the activation state values in Figure 5.3. Their estimations about each concept activation value differ very much. However the difference between the two concept values is quite similar for both experts. Hence, it could be the case that each expert defined the system’s parameters on different scaling. To avoid such tricky situations, most of the times the activation value of a concept is bounded in the interval [0, 1]. Concept 1 Concept 2 Expert 1: 41 50 Expert 2: 3.5 10 Figure 5.3: The definition of the activation state values (of an imaginary FCM) by two experts. A FCM which describes a system of N variables manipulates the set of the activation state values of all the concepts with a 1 × 𝑁 vector A. The “Freeway conditions” FCM example includes five concepts (C1 – C5). In this case the experts should define the activation state values (or activation 96 levels) of each concept to reflect the real conditions of a specific freeway. Therefore if they suggest the initial activation level vector Α0 to be: Α0 = [0.2, 0.9, 0.7, 0, 1] then we can infer that the conditions of the freeway are: 1. The weather is fairly good. (C1 = 0.2) 2. There is a big congestion in the freeway. The cars are moving with relatively slow speed for some time. (C2 = 0.9) 3. The drivers are quite frustrated. (C3 = 0.7) 4. There no accident in the freeway. (C4 = 0) 5. The drivers are fully aware of the risk they run in the freeway.(C5 = 1) 5.2.1.2 - Weights (Sensitivities) In FCM context, weights or sensitivities are the links connecting two nodes. They are directional and they are also characterized by a numerical value and a sign. Generally a weight describes the cause-effect relationship between two concepts. The causal relations (weights) can be positive or negative. A positive weight expresses a direct influence relation whereas a negative one defines an inverse relation between two concepts. Thence, if two concepts Ci and Cj are related to a degree defined by the weight Wij and: 1. Wij > 0 then If the activation levels (AL) of the concept Ci , Αi increases then the AL of the concept Cj, Αj will also increase. If the AL decrease on the other hand, the AL of the effect concept will decrease. 2. Wij < 0 then If the AL of Ci , Αi is increased, then the AL of the Cj, Αj will be decreased and vice versa. 3. Wij = 0 then there is no causal relation between the concepts. FCM models graded causality. A concept can be designed to affect another concept to a degree. So FCM escapes the drawback of bivalent causality, from which CMs suffer. FCM causal 97 relations are fuzzy and hence they belong to a fuzzy set. Each of them is assigned to a real, numerical value expressing the degree of the influence a concept accepts from another concept. In other words the weight shows how strong and intensive the causal relation is. The value of a weight lies in the interval [-1, 1]. If the weight is -1 then the causal relation is totally negative (or inverse) and if it is +1 then it is and totally positive (or direct). Anything between zero and -1 or zero and +1 correspond to various fuzzy degrees of causality. The closer a weight is to zero the weaker the relation is. As the weight increases to one the causal concept affects more intensively the effect concept whereas as the weight approaches -1 the causal concept affects more intensively but inversely the effect concept. A FCM which models a system with N variables utilizes the relationships amongst them with a 𝑁 × 𝑁 matrix W (weight matrix). The element Wij represents the weight of the relation between the concepts Ci and Cj. Therefore the weights of the example in Figure 5.1 would be represented in a matrix: W C1 C2 C3 C4 C5 C1 0.0 0.5 0.0 0.2 0.0 C2 0.0 0.0 1.0 0.0 0.0 C3 0.0 0.0 0.0 1.0 0.0 C4 0.0 1.0 0.0 0.0 0.0 C5 0.0 0.0 0.0 -0.6 0.0 Figure 5.4: The weight matrix of the “Freeway Congestion” example as given in Figure 5.1 which can be analyzed as: 1. W12 = 0.5 Bad weather conditions (C1) contribute to the cause of congestion in the freeway (C2) 98 2. W14 = 0.2 Bad weather conditions (C1) contribute to a small degree the creation of an accident in the road (C4) 3. W23 = 1 When there exist congestion in the freeway (C2) then surely the drivers will become frustrated (C3) 4. W34 = 1 When the drivers are frustrated (C3) then there will be certainly an accident (C4) 5. W42 =1 If an accident happens in the freeway (C4) then congestion follows for sure (C2) 6. W54 = -0.6 The fact that drivers have risk awareness (C5) decreases in an adequate degree the chances of an accident in the freeway (C4) 7. All the rest weights are zero and they show the absence of any kind of causal relation between the concepts. Consequently, a FCM can be represented by an activation levels vector A and a weight matrix W. The beauty of this model is that by using only one vector and one matrix, one can describe in linguistic terms and analyze the relations of the system due to the fact that the weights can be represented by fuzzy sets. A fuzzy set is defined by a membership function. Membership functions were described thoroughly in Chapter 3 of this study. FCM use membership functions as a communication channel between the linguistic description of the influence level between two concepts and the numerical value of their relation (weight). The name of the linguistic variable describing a weight is “Influence”. The linguistic values express different levels of “Influence” (e.g. low, medium, high) and they are defined by the designers of the system. Each linguistic value comprises a fuzzy set whose membership function is defined over the domain [-1, 1]. For example, the designers of the system presented in Figure 5.5 (Papageorgiou, 2004) defined nine linguistic values for the linguistic variable “Influence”. Therefore a weight can be {negative 99 very strong, negative strong, negative medium, negative weak, zero, positive weak, positive medium, positive strong, positive very strong} and their corresponding membership functions are: μnvs, μns, μnm, μnw, μz, μpw, μpm, μps,and μpvs. Figure 5.5: Membership functions describing the linguistic variable “Influence” After defining the fuzzy sets of the “Influence” amongst the concepts, each weight of the system must be linguistically described (by a certain linguistic value). Thereafter, each weight is deffuzified (following a defuzzificaiton method as mentioned in Chapter 3) and then each weight is characterized by a real value in the interval [-1, 1]. The weights of the example in Figure 5.1 can be described by the linguistic values as shown in Figure 5.5 and thus: 1. W12 = 0.5 This weight belongs totally (degree of one) to the set positive medium (μpm). Therefore C1 affects C2 directly (positive) to a medium degree. 2. W14 = 0.2 This weight belongs to two “Influence” fuzzy sets, zero influence (μz) and positive weak influence (μpw). Therefore C1 generally doesn’t really affects C4 but if it does too low levels of influence are directly applied to C4. 3. W23 = 1, W34 = 1, W42 =1 All of them belong to the fuzzy set positive very strong influence (μprs= 1). Therefore all the cause concepts (2, 3, 4) will cause the maximum direct effect on the effect concepts (3, 4, 2 respectively). 4. W54 = -0.6 100 It belongs to the fuzzy sets of negative strong influence (μns) and negative medium influence (μnm). Therefore C5 affects inversely C4 to an adequate degree. 5.2.3 - Static Analysis An FCM can be analyzed into three elements, density, indirect and total effects and balance degree. Density Density is an FCM characteristic which can provide useful information about the dynamic behaviour of the network. Density is calculated based on the number of the concepts (N) and the number of the causal relationships among the concepts (R) of an FCM. Density can be regarded as a metric describing the network connectivity given by: 𝑅 D = 𝑁2 , Eq. 5.1 Density is a measure of the network’s complexity. Larger density indicates more complex network and therefore it is recommended to be bounded in the range of [0.05, 0.3] (Miao, Liu, 2000). Relative centrality of a concept Centrality is a measure defining the importance of a concept in the whole FCM system (Kosko, 1986) and it is calculated by aggregating the weights showing to the concept and the weights beginning from the concept. 𝑐𝑜𝑛𝑐𝑒𝑝𝑡_𝑐𝑒𝑛𝑡𝑟𝑎𝑙𝑖𝑡𝑦(Ci ) = in(Ci ) + out(Ci ) Eq. 5.2 in(Ci ) = ∑𝑛𝑗=1 |𝑊𝑗𝑖 | Eq. 5.3 out(Ci ) = ∑𝑛𝑗=1 |𝑊𝑖𝑗 | Eq. 5.4 where 1 ≤ i, j ≤ N. Concepts with large centrality play a vital role in distributing causality in the network and so, most of the times, they should be analyzed in greater care and detail. 101 Indirect and total effects Consider two concepts of an FCM, 𝐶𝑖 and 𝐶𝑗 , which are connected through M different causal paths. Every path may be given as a vector(𝑖, 𝑘𝑙1 , … , 𝑘𝑙𝑃 , 𝑗) where 1 ≤ 𝑙 ≤ M and P is the number of the concepts participating into the particular l path. Let 𝐼𝑙 (𝐶𝑖 , 𝐶𝑗 ) to denote the indirect effect on 𝐶𝑗 caused by 𝐶𝑖 through the lth causal path. Then 𝑇(𝐶𝑖 , 𝐶𝑗 ) represents the total effect between 𝐶𝑖 and 𝐶𝑗 : 𝐼𝑙 (𝐶𝑖 , 𝐶𝑗 ) = 𝑚𝑖𝑛{𝑊(𝑧, 𝑧 + 1)| 𝑧 ∈ (𝑖, 𝑘𝑙1 , … , 𝑘𝑙𝑃 , 𝑗)} 𝑇(𝐶𝑖 , 𝐶𝑗 ) = max(𝐼𝑙 (𝐶𝑖 , 𝐶𝑗 )) , Eq. 5.5 Eq. 5.6 where z and z+1 are sequential concepts in a specific path. Therefore the indirect effect of a path between 𝐶𝑖 and 𝐶𝑗 is the minimum weight of the path while the total effect is the maximum indirect effect among all paths between the two concepts. Balance Degree of an FCM An FCM is considered unbalanced if two concepts are connected through two paths whose causality is characterized by different signs. In any other case the FCM is balanced. In order to find the balance degree of an FCM the number of the positive and negative semi-cycles in the FCM must be counted. A semi-cycle of a graph is any cycle which is created by not considering the direction of the edges. A positive semi-cycle is a cycle which is constituted by a number of positive weighted edges and an even number of negative weighted edges. On the other hand a negative semi-cycle is constituted by a number of positive edges and an odd number of negative. A method of calculating the balance degree (Tsadiras, Margaritis, 1999) is: 𝑎̂ = ∑𝑚 𝑝𝑚 𝑓(𝑚) ∑𝑚 𝑡𝑚 𝑓(𝑚) , Eq. 5.7 ∑𝑚 𝑝𝑚 𝑓(𝑚) ∑𝑚 𝑛𝑚 𝑓(𝑚) , Eq. 5.8 or 𝑎̂ = where: 𝑝𝑚 is the number of the positive semi-cycles with size m 102 𝑛𝑚 is the number of the negative semi-cycles with size m 𝑡𝑚 is the total number of semi-cycles with size m 𝑓(m) is a monotonously decreasing function e.g. f(m) = 1/m , f(m) = 1/m2 ή f(m)= 1/2m Considering the example of Figure 5.1, there are three semi-cycles and all three of them are positive: 1. C1 → C2 → C4 → C1 ( length m = 3) 2. C2 → C3 → C4 → C2 ( length m = 3) 3. C1 → C2 → C3 → C4 → C1 (length m = 4) Therefore: 𝑝3 = 2, 𝑝4 = 1, 𝑡3 = 2, 𝑡4 = 1 and by using f(m) = 1/m the balance degree is: 𝑎̂ = 2 ∗ 1/3 + 1∗ 1/4 = 2 ∗ 1/3 + 1∗ 1/4 1 The closer the balance degree is to one the more balanced the FCM is considered and vice versa. 5.2.4 - Dynamical Analysis of FCM (Inference) FCM is widely used as a dynamical tool analysis of certain scenarios (involving the parameters of the system) through time (discrete iterations). Inference in FCM is expressed by allowing feedbacks through the weighted causal edges. Given the weight matrix along with the initial activation levels vector Α0, FCM “runs” through certain steps until converge to the final activation levels vector Αfinal including the final stable states of the concepts. Finally, by analyzing the final state vector one can retrieve important information about potential effects caused by a change on the system’s parameters. The FCM inference algorithm is: 103 Begin Step 1: Read the initial state vector Α0 , t=0 Step 2: Read the weight matrix W Step 3: t = t + 1 Step 4: For every concept (Ci) calculate its new activation levels (Ati ) by: 𝐴𝑡𝑖 = ∑𝑛𝑗=1(𝐴𝑡−1 𝑊𝑗𝑖 ) 𝑗 Eq. 5.9 or any other activation function Step 5: Apply a transformation function to the present activation levels vector At = f(At) Eq. 5.10 Step 6: If (At == At-1) or (t==maximum iterations) Αfinal = At STOP Otherwise GO to STEP 3 End In the above algorithm the variable t enumerates the iterations needed until the FCM model reaches a certain behaviour which could be either steady point attractor, limit cycle or chaotic attractor. The vector Α(t) includes the activation levels of the concepts at tth iteration. FCM allows the concepts to interact with each other to a certain degree specified by the weight connecting them. The new state value of each concept strongly depends on the state values of the concepts affecting it. The whole update process of the activation levels can be described as a simple vector – matrix multiplication: 𝐴𝑡 = 𝑓(𝐴𝑡−1 𝑊) Eq. 5.11 The algorithm terminates when one of the following three FCM behaviours is identified: 104 1. Steady point attractor: The FCM concepts stabilize their state to a specific value which implies that the activation levels vector is not altered through iterations. Figure 5.6: Steady point attractor. The X-axis represents the iterations number and the Y-axis presents the values of each concept state. 2. Limit Cycle: In this case the FCM model falls into a loop of activation level vectors which repeatedly appear in a certain number of iterations in the same sequence. Figure 5.7: Limit Cycle behaviour 3. Chaotic attractor: The activation levels are continuously changing through iterations with a totally random, non-deterministic way, leading the FCM model to present unsteadiness. 105 Figure 5.8: Chaotic attractor FCM model can be used to answer “IF-THEN” questions about how a change to a subset of parameters can affect the whole system. Therefore FCM modelling tool provides a scenario development framework for retrieving hidden relations between the concepts of a system. To run a scenario a particular sequence of steps must taken. The first step is to define the initial activation levels of the concepts and the weight matrix and then introduce them to the FCM model. The FCM model will run through a number of iterations allowing the interaction among the concepts and therefore changing their activation levels according to FCM inference algorithm. The inference process ends when the FCM convergences to a steady point attractor (any other behaviour at this point implies that the modelled system suffers from designing problems). The final activation levels vector exposes the present dynamics of the system based on the system’s designers’ experience and knowledge. More precisely, the resulting activation levels vector can be used as the base vector for development of various scenarios examining the system’s responses on different changes. To build a scenario, a “WHAT-IF” question is defined and the activation levels of the parameters which compose the premise of the question are locked to the desired values. The inference process is then initiated, again, yet this time the activation values of the “locked” concepts will not be updated through the iterations. The FCM inference algorithm will terminate when the FCM model presents one of the three aforementioned system behaviours. However, in this case the final activation levels vector presents the future dynamics of the system after applying the changes to its parameters. Further analysis of the results may lead to various conclusions about the modelled system’s future progress under certain conditions (defined by the “locked” valued concepts). 106 Activation Functions Every concept updates its activation levels based on the activation levels of the concepts which affect it and their relationship’s weight. This process is defined by an activation function. There are several activation functions: 𝑛 Ati = 𝑓 (∑𝑗=1( At−1 j Wji )) Eq. 5.11 𝑗 ≠𝑖 𝑛 t−1 Ati = 𝑓 (∑𝑗=1( At−1 j Wji + Ai )) Eq. 5.12 𝑗 ≠𝑖 𝑛 t−1 Ati = 𝑓 (𝑘1 ∑𝑗=1( At−1 j Wji + 𝑘2 Ai )) Eq. 5.13 𝑗 ≠𝑖 where: 𝐴𝑡𝑖 represents the activation levels of the concept Ci at iteration t n is the number of the concepts which affect Ci 𝑊𝑗𝑖 is the weight value between the concepts Cj and Ci k1 is a coefficient which defines in what degree the new state value of a concept will be affected by the activation levels of its neighbouring concepts (which are defined to affect it). k2 is a coefficient which defines in what degree the new state value of a concept will be affected by its own present activation levels value. f(.) is a transformation function which is described further down. Transformation functions The activation levels of a concept are defined over the interval [0, 1] or [-1, 1]. However, during inference phase, the state value of a concept might exceed this interval (e.g. if a concept accepts continuously positive effects which are aggregated using the activation function in Eq. 5.12 then at some point its new activation levels might be larger than one) . Transformation functions are used to map the real state value of a concept to the interval [0, 1] (or [-1, 1]). 107 Some functions which are widely used to transform activation levels are: 𝑓(𝑥) = 1 1+ 𝑒 −𝜆𝑥 Eq. 5.14 𝑓(𝑥) = tanh(𝑥) 𝑓(𝑥) = { −1, 𝑓(𝑥) = { 0, 1, 0, 1, Eq. 5.15 𝑥≤0 𝑥≥0 𝑥 ≤ −0.5 − 0.5 ≤ 𝑥 ≤ 0.5 𝑥 ≥ 0.5 Eq. 5.16 Eq. 5.17 5.2.5 - Building a FCM FCM development is traditionally based on human experience and knowledge. Therefore it is very important for a FCM designer to extract qualitative information from experts in the system’s domain and transform this information into an FCM structure. In any other case the system may not be adequately modelled and the results might be invalid. Hence a group of humans with proper expertise in the scientific or industrial area of the system to be modelled is required for building an FCM. The experts group is responsible to define the number and the type of the concepts which will be represented by the FCM model. An expert is able to identify and distinguish the concepts which compose the environment of a system and which affect its behaviour. Due to their experience and interaction with similar kind of systems, each expert has a specific map of the system’s relations between the concepts. Each of them has a notion about which concepts affect other concepts and whether this influence relationship amongst them is positive or negative. They can also describe linguistically the degree of influence between the concepts. FCM development methodologies exploit expertise about the system’s function and representation to create an FCM. Linguistic Description of weights This FCM development methodology is based on the linguistic description of the graded levels of the causal system’s interconnections (Stylios 1999; 2000). According to this methodology, the expert/experts along with the FCM designer must define the desired linguistic values of the linguistic variable “Influence” which describe at the best the causal relations of the system. More 108 precisely, they must define the linguistic terms expressing the grading of the influence (e.g. weak, medium, strong) and then assign to each of them a membership function on the domain [-1, 1]. Right afterwards, the experts must spend some time in thinking and describing the present causal relationships between every pair of concepts. As a first step they define the sign of the causeeffect connection (positive or negative). Then they must linguistically evaluate the influence degree of the relation using one predefined linguistic value. Each causal relation is expressed by an IF-THEN rule and therefore every expert must define N number of IF-THEN rules where N is the number of the existing system’s interconnections. An IF-THEN rule describing a causal relation has the form: IF the concept Ci takes the value of A THEN the value of the concept Cj changes to the value of B. Conclusion (about the weight value) The influence of Ci upon Cj is D. The terms A, B, and D are linguistic values (A,B describing the state of a concept and D describing linguistically the influence level). Each expert works individually. Hence if the number of the experts is E then there are E linguistic terms describing a specific weight which are aggregated using the SUM method (or any other S-norm method). Finally, the real value of each weight is computed by using a deffuzificaton method (e.g. centre of area method) on the resulting aggregated membership function. Group decision support using fuzzy cognitive maps This method of FCM development strongly requires a group of several experts because, as it is mentioned, more than one experts increases the credibility of the resulting FCM model (Kosko, 1988; Khan, Quaddus, 2004; Salmeron 2009). Following to this methodology, the experts do not predefine the number of the participating concepts rather each expert works on his own, from the very beginning of the development process, defining the number and the type of the concepts and describing their causal interconnections with real numbers in the range [-1, 1] (grading from strongly negative influence (-1) to strongly positive influence (+1)). Therefore each expert “builds” his own FCM based totally on his own knowledge and experience. Then, all the proposed FCM models (by the experts) are merged to create the preliminary version of the final FCM. The merging process is relatively easy since the proposed weight matrices are averaged to give the final resulting matrix: 𝑊𝑖𝑗 = 𝑒 ∑𝐸 𝑒=1 𝑊𝑖𝑗 𝐸 Eq. 5.18 109 A variation of this method requires each matrix to be weighted by a coefficient p representing the credibility of the expert who built the matrix. 𝑊𝑖𝑗 = 𝑒 𝑒 ∑𝐸 𝑒=1 𝑝 𝑊𝑖𝑗 𝑒 ∑𝐸 𝑒=1 𝑝 𝑤ℎ𝑒𝑟𝑒 𝑝 ∈ [0,1] Eq. 5.19 If the credibility factor is one then the expert is considered totally credible. The credibility factors are initialized to one for all the experts. Then a credibility adjustment policy has to be determined by the FCM designer. For example in every step, if a weight value 𝑊𝑖𝑗𝑒′ differs from all the rest 𝑊𝑖𝑗𝑒 𝑓𝑜𝑟 1 ≤ 𝑒 ≤ 𝐸 𝑎𝑛𝑑 𝑒 ≠ 𝑒′ (or if a weight value diverges from the mean value of the whole set of weights) then penalize the corresponding expert by multiplying its credibility factor with a penalty factor μ so that 𝑝𝑒′ = 𝜇𝑝𝑒′ (Stylios, Groumpos, 2004). Finally, the group of experts examines this preliminary version of FCM to find groups of concepts which are identical (in their meaning) yet with different names. For each of these identified groups all but one concept are eliminated. Clearly all the links coming or leaving from the removed concepts are also deleted. 5.2.6 - Learning and Optimization of FCM In FCM context, learning and optimization mainly refer to methods which are used to adapt the weights (of FCM interconnections) so that to achieve more precise and credible representation of the modelled system. All FCM learning algorithms take for granted that the number and the type of the concepts is known and given. The general motivation of developing such methods is to overcome potential conflicts among a group of expert about one or more weight evaluations. Furthermore, some (Stach, Kurgan, Pedrycz, 2005; Stach, Kurgan, Pedrycz, Reformat 2005) argue that human based knowledge is subjective by nature and therefore data-driven FCM methodologies should be used instead, to enhance the accuracy and the operation of the FCMs. 5.2.6.1 - Unsupervised Learning Unsupervised training of FCMs comprises Hebbian – based methodologies. All of them involve the definition of an initial weight matrix (as suggested by the experts). Based on the initial weight matrix and by applying iterative methods, which are alternations of the Hebbian law, they converge to the final weight matrix which is considered optimally adjusted to the modelled decision making and prediction problem. Differential Hebbian Learning (DHL) was the first proposed learning method for FCM (Dickerson and Kosko, 1994). DHL tries to identify pairs of concepts whose values (activation levels) are positively or negatively correlated (e.g. if the value of the cause concept increases then the value of the effect concept increases as well). The algorithm increases the 110 weight value of the edge connecting such pairs of concepts. If the modification of two connected concepts is not correlated then the algorithm decreases their weight. For every discrete iteration t the algorithm assigns the following value to each weight connecting the concept Ci to Cj: 𝑊𝑖𝑗𝑡+1 = { 𝑊𝑖𝑗𝑡 + 𝑐 𝑡 (Δ𝛢𝑡𝑖 Δ𝛢𝑗𝑡 − 𝑊𝑖𝑗𝑡 ) 𝑖𝑓 Δ𝛢𝑡𝑖 ≠ 0 𝑊𝑖𝑗𝑡 𝑖𝑓 Δ𝛢𝑡𝑖 = 0 Δ𝛢𝑡𝑖 = 𝛢𝑡𝑖 − 𝛢𝑡−1 𝑖 t 𝑐 𝑡 = 0.1 [1 − 1.1q] 𝑤ℎ𝑒𝑟𝑒 𝑞 ∈ ℕ Eq. 5.20 Eq. 5.21 Eq. 5.22 where 𝛢𝑡𝑖 represents the activation levels of the concept Ci in tth iteration and Δ𝛢𝑡𝑖 is the difference of its activation levels values between the present iteration and the previous (t-1). The learning coefficient ct is slowly decreased through iterations and should ensure that the updated weights will lie in the interval [-1, 1]. The term Δ𝛢𝑡𝑖 Δ𝛢𝑗𝑡 simply indicates whether the changes in the activation levels of the two concepts move to the same or opposite direction. The weights are updated in every iteration until converge. Following to this learning approach, a weight is updated based only on the changes of the two participating (to that causal relationship) concepts. However the effect concept (of the specific relationship) might be also affected by other concepts and thus accepting aggregated influence from many sources. This serious drawback was pointed out in (Huerga, 2002) who presented the Balanced Differential Algorithm (BDA), which is an extension of the DHL. DHL defines that each weight update will depend on the changes occurring to other concepts which also affect the particular effect concept at that certain time instance. Therefore the weight connecting concepts Ci and Cj is updated as a normalized value of the change in their activation levels to the changes (to the same direction) occurring in all concepts which affect Cj at the same iteration. Compared to DHL this approach presented better results (Huerga, 2002) but it was only applied to binary FCM models (using bivalent activation function). Another FCM optimization method which utilizes a modified version of Hebbian law is the Nonlinear Hebbian Learning (Papageorgiou et. al., 2003; 2005). This method needs four requirements to be fulfilled in order to be applied. The initial weight matrix (which the algorithm optimizes) should be defined by the experts, the concepts should be divided into two categories input and output, the intervals in which concept’s state value is supposed to lie should be specified as well as the desired sign of each weight. Obviously, only the enabled initial weights (W≠0) are updated through the running of this algorithm while the rest remain disabled. The algorithm updates the weights iteratively until the satisfaction of a termination criterion which 111 can be either reaching the maximum number of iterations or the minimization of the difference of output concepts state value (after presenting steady state) between two sequential iterations. Note that the concepts which are characterized as output are more important concepts for the system than other (according to the experts) and hence they deserve more attention on behalf of the designer. Another unsupervised learning algorithm proposed for FCM is the Adaptive Random FCM (Aguilar, 2001; 2002) which is based on the same learning principle as the Random Neural Networks (Gelenbe, 1989). RNNs generally use the probabilities that a signal leaving from a neuron to another is positive or negative so that to calculate whether the neuron which accepts the signals will “fire” or not. To fire it needs the aggregated signals it accepts to be positive. Based on this idea, ARFCM divides every FCM weight Wij into two components, a positive component W+ij and a negative W-ij. Therefore whenever the weight between two concepts is positive then W+ij > 0 and W-ij = 0 while if the weight is negative then W+ij = 0 and W-ij > 0. Otherwise, W+ij = W-ij = 0. Each effect concepts is characterized by a probability of activation qj which is calculated based on the positive and negative influence accepting from the cause concepts and their activation probabilities. At each learning step the weights are updated as: 𝑊𝑖𝑗𝑡 = 𝑊𝑖𝑗𝑡−1 + 𝜂(𝛥𝑞𝑖𝑡 𝛥𝑞𝑗𝑡 ) , Eq. 5.23 The algorithm terminates whenever the system converges to the activation levels of the concepts as defined by the experts. 5.2.6.2 - Supervised Learning Data-Driven NHL (DDNHL) extends the use of NHL basic algorithm by allowing weight adaptation on historical data (Stack, Kurgan, Pedrycz, 2008). Historical data include the initial state values of the concepts and their corresponding “caused” steady state values. The algorithm updates the weights, allows the FCM to run with the new weights drawing the initial activation levels from the data base and then compares the system’s final output state values with the correct ones. 5.2.6.3 - Population-based Algorithms There is a plethora of population-based algorithms proposed for optimizing FCMs especially the last ten years. All of them but one share the same learning goal of finding the optimum weight matrix which best represents the causal paths of the modelled system. Additionally the representation structure of the individuals (either chromosomes or particles) is approximately 112 identical with very small variations (e.g. some do not include the zero weights in the representation). The basic steps of encoding, decoding and evaluating individuals which represent FCMs are presented in Figure 5.9. The first population-based algorithm for FCM optimization was used is Evolution Strategies (Koulouriotis et al., 2001). Each chromosome suggests a different connectivity of the modelled FCM system. The algorithm requires that the domain experts of the system have defined the output concepts which gain the main interest about how they evolve in the system under any scenario. Additionally, the algorithm uses historical paired input/output data points. Input data are the initial states of the concepts and output data are the final steady state concept values. The fitness function of the algorithm simply calculates the distance between each output concept’s real steady state value and the desired one for every pair of datapoints. The algorithm terminates if the maximum number of generations is reached or if the fitness of the best fitness individual is less than a very small positive number. Genetically Evolved FCM comprises an alternative evolutionary approach of mutli-objective decision making in terms of satisfying more than one desired concept values at the same time (Mateou, Andreou, 2008). In the context of a specific desired scenario, the concepts of the main interest are characterized as output and their desired activation levels are defined. Then a population of randomly initialized chromosomes is evolved in order to find the weight matrix which generates, through FCM dynamic analysis, the desired output values. The distance between the generated activation levels of the output concepts and the corresponding desired values is calculated and aggregated. The fitness value of a chromosome is then inversely proportional to the sum of these distances. The writers also state that by using this algorithm they also manage to handle limit cycle phenomenon. The FCM which is suggested by a certain chromosome, generated by crossover or mutation, runs and if for a number of iterations the same activation levels (among the output concepts) keep repeating it is categorized as a limit cycle case. After the identification of a limit cycle case, the weights of the specific FCM are examined one by one until the detection of the one/ones which are responsible for the limit cycle phenomenon. As soon as they are identified, the weight/weights are iteratively changed randomly until the FCM converges to steady state behaviour. 113 Figure 5.9: General FCM solution representation for individuals participating in population-based algorithms Another genetic algorithm named Real Coded GA (RCGA) was proposed which also utilizes historical data to achieve optimized weight adaptation (Stach et al, 2005; 2007). The dataset is built as follows: For each system example, the initial values and the final (steady state) values of the concepts are given as well as all the intermediate state values per iteration (of a FCM model). The form of a dataset describing the desired evolution of the concept values through FCM inference is shown in Figure 5.10 (left part). In order to evaluate each chromosome, the dataset is divided in pairs of cause and effect concept values. Therefore if 𝐶̂ denotes the set with the correct activation levels of a system per each iteration, the new dataset will be {𝐶̂ 𝑡 , 𝐶̂ 𝑡+1} = {initial concept values, final concept values} so that 𝐶̂ 𝑡 → 𝐶̂ 𝑡+1. Hence for every candidate FCM (represented by chromosomes) a set of initial values 𝐶̂ 𝑡 will be presented and the resulting concept values will be compared with the correct final concept values (𝐶̂ 𝑡+1 ). In other words, if the dataset includes the concept values for K iterations, the K-1 initial values will be presented to the candidate FCM and K-1 comparisons between the generated final values and the correct ones 114 will occur. The fitness of a chromosome is then evaluated by aggregating the K-1 distances between the candidate model’s steady state values and the corresponding desired ones. Figure 5.10: The activation levels per each iteration, of a system consisted of seven nodes, are presented in the first part of the figure. The RCGA is then called to optimize a random weight. The optimum FCM, suggested by the fittest chromosome, “runs” and, in every iteration, the concept state values should be approximately identical with the corresponding concept values of the historical data. The algorithm terminates if the fitness best fit chromosomes reaches a specific threshold or if the maximum generation number is satisfied. Besides evolutionary algorithms, Particle Swarm Optimization (PSO) was also proposed for optimizing FCMs (Parsopoulos et al., 2003; 2005; Papageorgiou et al, 2005; Petalas et al, 2005; 2009). PSO is a stochastic heuristic algorithm as well, and it uses a swarm (population) of particles which also represent a potential solution to the problem being optimized (in this case, the weight matrix of an FCM model). The particles move in the problem space with a certain velocity trying to find the optimum position. The position of each particle suggests a solution to the modelled problem. The idea is that the particle with the best position tends to attract other particles close to it, and therefore forming a bigger neighbourhood of particles around the best position (best solution). The velocity of each particle depends on various parameters (e.g. the best position which was ever visited by any particle until a specific time point and the best position that the particle itself visited). Therefore each particle must “remember” its best position and each neighbourhood must keep track of the best position among the particles comprising the neighbourhood. The writers suggest that the weight should be bounded in intervals defined by experts. Therefore for each weight between two concepts, the experts define a minimum and maximum limit. The particles are then let to search within these concept values’ domains. Once more, the experts must 115 also define the output concepts which are more significant than others. In every iteration of the algorithm, the velocity of each particle is updated based on its present position, its own optimum scored position and the best particle’s position from its neighbourhood. Having calculated the velocity, the new position of the particle can be found (and therefore the particle moves to a new solution to the problem). The position of each particle is evaluated by a fitness function which essentially examines whether the final activation levels of the candidate FCM (suggested by the particle) fall into the bounding regions of each concept which were predefined by the experts. The termination criterion is the minimization of the fitness function after a certain number of iterations. A GA for FCMs was introduced by (Khan and Chong, 2003) which diverges from other algorithms, of the same family, in the learning goal. This is the only case of FCM optimization/learning algorithm where the weight matrix adaptation is not the learning objective but the detection of the initial activation levels which caused a certain effect (on the concepts). Therefore given the final steady state values of the concepts (of a system) the algorithm applies GA operators to find the initial conditions A0 which caused the specific effects. 5.2.7 - Time in FCM FCM comprises a dynamic modelling tool which can capture future states of a system under certain circumstances. Therefore FCMs exhibit a temporal aspect due to feedback presence in their dynamic operation. However FCM model “runs” as a discrete simulation. After presenting an initial set of activation levels, the model is free to iterate through several concept states until converge to the final steady states which represent the behavioural response of the system to the initial conditions. The set of the activation levels at each iteration represents a snapshot of an intermediate system’s state between the initial conditions and the final converged state. Unfortunately, FCM model is unable to define the time period in which all these changes will occur. As a result, the model does not give any information about how long it will take to the system to present the final steady state behaviour given that the specific initial conditions apply. The model is only in position to answer “WHAT-IF” questions while it would be of greatest importance if it could also answer “WHATandWHEN-IF” for scenario building and decision making in several domains (e.g. political/economic sciences). 116 Although introducing time dependencies into FCM model is a very challenging extension to classic FCM, there is no equivalent interest in research area. Only few researchers have been involved in this field. The first approaches to this subject proposed modelling time for FCM should be done in discrete space, stating that introducing time in continuous space is a very difficult task. (Hagiwara, 1992; Park, Kim, 1995). The proposed models require from experts to define a discrete time unit (e.g. one month, year so on) with which they must afterwards describe every causal relationship for every pair of concepts. If a relationship is assigned to more than one time units (e.g. 6 months) then dummy nodes are added in between the corresponding pair of concepts and the weight of each new added edge is 𝑚√𝑊𝑖𝑗 where m is the number of the causal links between the two original concepts and 𝑊𝑖𝑗 is their former weight (before adding dummy nodes in between). They assume that initially a concept Ci is stimulated. At the next iteration, the indirect and total effects on its neighbours (effect concepts) Cj are calculated and they become stimulated. Therefore, at each iteration n the indirect and total effects are calculated for the concepts which are distanced by n edges from the very first stimulated concept (following causality paths). Finally the total effects are given for each concept and therefore one can infer the type of causality on a certain concept. The process is quite complicated and they do not clarify how the final states of the concepts could be given using this kind of dynamic analysis. The next attempt of representing time relations in FCMs was the Ruled Based FCM (Carvalho, 2001). According to the writers RBFCM is more flexible to encode time relations between the concepts than classic FCMs by using fuzzy rules describing each causal relationship. Once more, the experts must define the time unit which will be represented by an iteration (e.g. one iteration could be one month, ten days etc) which means the RBFCM models time dimension in discrete space as well. Then the experts use fuzzy rules to describe the causal relationships amongst the concepts. A fuzzy rule should define the type of the effect on a concept (positive or negative) and its strength which will appear in the time period defined by the predefined time unit. During FCM inference process, an iteration represents one time unit. Therefore, during FCM inference partial effects (as defined in the fuzzy rules) will be applied on concepts which will be aggregated to the total effect as soon as model converges. Experts should keep in mind the time unit, represented by an iteration, when defining the strength of the effect in a causal relation. For example, consider two concepts, C1 and C2, describing the concepts of greenhouse effect and ice melting percentage respectively. If the defined time unit is one day then the strength of the effect on C2 (caused by C1) should be much smaller than when the time unit is 10 years! The writer refers to the time period represented by an iteration as Based-Time. All the causal relations 117 of the model are defined based on the B-Time. The B-Time variable is decided by the designer and the experts of the system depending on the type of the system being modeled and the potential scenarios they would like to examine on the modeled system. If the designer whishes to examine the system’s behavior in long term period then a big B-Time (e.g. 1 year) should be defined while if they would like a short-term system’s reaction to a change then a small B-Time (e.g. 1 day) is suggested. Additionally, the writers state that if the system exhibits a chaotic behavior then the B-Time should be decreased, the experts should reconsider the fuzzy rules and allow the model to “rerun”. The proposed model also tries to deal with the time delay often observed in cause-effect systems. Many times, although a change might happen to a cause concept, the effect concept reacts after a certain time period (not immediately). This phenomenon is called dead-time which begins from the exact moment a change happens to a cause concept and ends at the same time that the state of the corresponding effect concept changes (from its previous state value). A very well known real life example describing this phenomenon is the time delay observed in changing the oil price in gas stations after the price per barrel changes. To encode this phenomenon, the writer suggests that a number of iterations should pass until the effect concept changes for the first time its state value. The number of the “delay” iterations are equal to the time delay (given as a natural number) divided by the B-Time (time delay / B-Time). The experts also are responsible to define the time delay (if any) for every causal relationship. Temporal FCMs (tFCM) also discuss time dependencies in FCMs (Zhong et al, 2008). Following to this model, each causal relationship is described by two different linguistic variables. The first one is the time phases characterizing the particular relationship and the other one is the type of the effect. The time phases of a causal relationship, S, essentially describe the time length of an effect’s duration (time length) and they are given by linguistic values. For example a set of linguistic values could be Sn ∈ {short term, medium term, long term}. Each one of them 𝑛 (𝑡) (Figure 5.11). comprises a fuzzy set and therefore its membership function must be defined 𝜇𝑖𝑗 The type of the effect is defined as a number in the interval γij ∈ [-1, 1] which in turn comprise a singleton membership function with membership degree one. Then fuzzy rules are defined by the experts for every causal relation between two concepts Ci and Cj as: 𝑛 {𝑅 𝑛 |𝑅 𝑛 : 𝑆 𝑛 → 𝛾𝑖𝑗 } Eq. 5.24 118 𝑛 where 𝑆 𝑛 denotes the nth time set (defined for this relation) and 𝛾𝑖𝑗 is a singleton number defining the strength of the effect and it is assigned to the nth time phase. Hence, considering the example 𝑀𝑇 𝐿𝑇 of Figure 5.1, three singletons should be assigned to every time phase, 𝛾𝑖𝑗𝑆𝑇 , 𝛾𝑖𝑗 and 𝛾𝑖𝑗 . Figure 5.11: Three membership functions describing the three time phases of a causal 𝑆𝑇 𝑀𝑇 𝐿𝑇 (𝑡), 𝜇𝑖𝑗 (𝑡), 𝜇𝑖𝑗 (𝑡) ). The time unit in this model is one month. relationship (𝜇𝑖𝑗 Each iteration is considered to count as one time unit (in Figure 5.11 is one month). Therefore at each iteration t, the causal relation belongs to a degree to one or more time phases (given 𝑛 (𝑡)) and so modus ponens is applied to move from the firing premises (of the fuzzy rules) by 𝜇𝑖𝑗 to the effect degree (conclusions of the rules). Then, since a concept might accept effects from many other concepts, fuzzy logical operators are used to aggregate the partial effects (AND or OR). Finally the Center of Gravity defuzzification method is used to calculate the total effect hij(t) which the effect concept Cj accepts from the cause concept Ci at that certain iteration t: hij(t) = 𝑛 𝑛 ∑ 𝛾𝑖𝑗 𝜇𝑖𝑗 (𝑡) 𝑛 𝛾𝑖𝑗 Eq. 5.25 (At this point I would like to note that there could be a typo in Equation 5.25 as written in the paper. Since Center of Gravity method is used then the form of the Equation 5.25 should be: hij(t) = 𝑛 𝑛 ∑ 𝛾𝑖𝑗 𝜇𝑖𝑗 (𝑡) 𝑛 𝜇𝑖𝑗 (𝑡) ). 5.2.8 - Modified FCM There are also some proposed FCM models whose structure, function or goal diverges from the canonical FCMs. For example there is a trend lately of using FCMs as classifying machines. Two modified FCMs have been proposed under the umbrella of this goal. The first FCM classifying model, named FCMper consists of two layers, an input and an output layer (Papakostas et al, 2008). The input concepts represent the set of certain features affecting the classifying object which is represented in nodes of the output layer. The input concepts are not allowed to affect 119 each other while the output concepts (which represent the potential classes/labels of the dataset) are connected to each other with negative weights allowing the competition between the classes. Two hybrid FCMpers were also proposed in cooperation with trained ANN classifiers (e.g. MLP). Therefore the first hybrid FCMper comprises a trained ANN and a typical FCMper. The features of the problem are presented to a trained neural network. The state values of the neural network’s output nodes are then inserted to the FCMper as input nodes. Each output value of the neural network can be regarded as a degree of belief for a specific class (of the problem). The second hybrid FCMper works just like the previous one only the features of the problem are also inserted to the FCper (not only the degrees of belief for each class). All FCMper structures (typical and hybridized) are presented in Figure 5.12. The efficiency of all three models along with a classic neural network (the specifications of the NN are not mentioned) was tested (Papakostas, Koulouriotis, 2010) on Two Spiral problem. The hybrid FCMper 3 achieved better results than all the other FCMpers and the neural network but still not into a satisfying degree. Figure 5.12: (a) Typical FCMapper, (b) Hybrid FCMapper 1 (c) Hybrid FCMapper 2 The second model totally adopts the structure of the feedforward network as described in Chapter 1 and employs backpropagation learning algorithm as well, called Fuzzy Rules IncorporatedFCM (Song et al, 2010; 2011). Actually this model diverges to a big degree of the basic structure and inference process of FCM models. It can be more thought as a fuzzified MLP rather than a modified FCM. This model assumes that there are certain fuzzy rules which describe the specific classification problem in the form: If feature_1 = A and feature _2 = B then the causality with yclass = C where A, B and C are linguistic values which must be defined for each feature. The network comprises five layers. The input layer nodes are the features of the problem. The second layer is 120 called the antecedent layer. Its nodes represent the linguistic values describing the input features. Therefore each node symbolizes a fuzzy set. They accept the crisp values of the input layer nodes and they output a fuzzified value of them through a Gaussian function which is the membership function characterizing the corresponding fuzzy set. The next layer is named as the Rule Layer where each node represents a rule which accepts the truth degree of each antecedent member and returns the degree to which the whole antecedent part of the rule is true (using a Gaussian function). The Concequent layer nodes are the linguistic values of the output variables (classes) represented again by Gaussian functions. The output of each Concequent layer concept is the defuzzified crisp value of the fuzzy set as formed by the firing of the rules. Finally the output nodes (of the last layer) aggregate the weighted defuzzified crisp values which describe their states. The network is trained over several examples describing the problem being modelled using backpropagation method. The model was tested on IRIS data and its success rate was a bit lower than FCMpers (hybrid version 3). Figure 5.13: Basic structure of the FRI-FCM The next model incorporates the experts’ hesitations about their own weight estimations into the FCM inference process. Besides the influence weight of each causal relationship, the expert must also define the hesitation degree defining that way how “unsecure” he feels with the specific estimation (Iakovidis et al, 2011). Therefore the FCM models remain the same except the activation function which is altered as: 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒 𝐴𝑖 = 𝑓(𝐴𝑖 + ∑𝑗≠𝑖(𝑤𝑗𝑖 𝐴𝑗 − 𝑤𝑗𝑖ℎ𝑒𝑠𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝐴𝑗 )) Eq. 5.26 5.3 - Issues about FCMs Fuzzy Cognitive Maps were proposed as a model of incorporating, representing and adapting human knowledge and experience to uncover hidden causal relationships and influence paths between the concepts of a system and therefore being able to examine the system’s dynamics under certain circumstances. 121 An FCM represents a dynamic system that evolves over time. Time in FCM is modelled by discrete iterations. In every iteration the state value of every concept is updated according to the aggregated effects it accepts from other concepts. The concepts are connected with directed links and so virtual causal paths are created, transferring weighted effects from one concept to another, even if they are not directly connected, allowing feedbacks to occur. After a number of iterations the system will converge meaning the states of the concepts will stabilize to certain values which remain unchangeable through time. Their biggest advantage is the simplicity in their operation combined with the great benefit of using FCMs in representing, utilizing and inferring based on human knowledge and experience in systems which are complex and complicate in their relations. Furthermore, they offer the potentiality of modelling the evolution real systems behaviour. Given an FCM system, the user is able to observe, examine and test different scenarios by changing and “locking” specific concept values and allowing the network to “run”. The steady states of the concepts can be then analyzed to infer or conclude to certain decisions related to the system and the initial conditions presented to it. Not only that, but the weights between the concepts have essential meaning, they can be linguistically defined and explained and therefore the causality flow can be more easily understood and used to enhance final conclusions. These advantages make FCM an appropriate tool of analyzing systems which are characterized by cause-effect relationships. Such systems are constituted by a big number of concepts and relations. Experts might know both concepts and relationships amongst them but due to the size of the system, they are not able to use their own knowledge and experience to “predict” or infer potential resulting system behaviour after one or more changes in the system’s parameters. Instead, FCM can be used to exploit experts’ cumulative knowledge and experience, by allowing a dynamic representation of the system which in turn can be used to investigate/examine/test potential future behaviours of the system states by “playing”/tuning the initial system’s conditions. Consequently, FCM can be utilized as a decision support and system’s designer consultant tool! Such a tool is very useful in domains where a change in a system’s parameter might cause an unwanted and harmful effect on other system’s parameters or on the overall system’s behaviour. Many times, changing a system’s variable involves risk of negative impact on the system. A change to a system means stepping from the known conditions to the unknown. Therefore, in order to take the decision to apply a change the possible resulting “unknown” conditions must be evaluated by weighting the potential positive or negative effects of the change on the system. Unfortunately, neither the “undo” button for real life actions, has been invented (yet!) nor travelling back to time, fixing the initial “move” which 122 caused the unwanted result, is possible (yet!). FCM comprises a software modelling tool which can be used towards that direction and avoid driving the parameters of the system to undesirable non reversible states. In other words, FCMs provide a “safety pillow” for every decision making since the user/users may simulate the system following to their scenario and examine how the system responds to the changes of the parameters through dynamic analysis of the FCM system. However, as mentioned in previous section of this chapter, FCMs in their canonical form present deficiency in modelling real time. By using dynamic analysis of an FCM model there is no indication in what time period the whole process occurs, how much time an effect will take to appear and how much time the total system will stabilize after one or more changes to its parameters. In a real systems (social, political, economic, etc), when a state of parameter is changed all the parameters which are affected by, react in different ways, in different times and the effect takes different time periods to appear in its total. Therefore introducing time dimension in FCMs is of vital importance. The time needed for the total effect to be expressed in a concept (system’s parameter) after a change to another concept can be analyzed into two elements (time parameters): 1. Dead Time (DT) 2. Reaction Time (RT) DT is defined as the time period needed from a concept of the system, to begin changing its state after the change of the state of another parameter. This time period is initiated by the change of the cause concept state and is terminated as soon as the effect concept state is altered. For example consider the concept of oil exploration by a country (C1) and the concept describing the standard living of the people living in that country (C2) and the causal relation C1 C2. If C1 is zero meaning the country is not active in oil exploration and one day the country initiates exploring its oil reserves. Then C1 will change to one. Obviously the standard living of the inhabitants of that country will also change but not right after the change in the cause concept. For a certain time period, after the country’s first attempt to explore oil reserves, the standard living will remain the same. Additionally the change to the standard living won’t appear to its whole at once rather gradually through time. This is the reaction time which describes the time needed for a parameter to present the total effect derived by a change to another conceptparameter. One concept may change its value slower or faster than other. Reaction time follows dead time. It begins when the concept changes for the first time its value and ends when the concept stabilizes again to a certain value. Therefore both DT and RT must be incorporated in 123 FCM model to represent more realistically the evolution of a system through time. As described in previous section of this chapter, there were some attempts of modelling time in FCMs but the first three proposed methods assume that the partial effect has equal strength in every iteration (which counts as a time unit) and they act in a discrete time space which is also used by (Zhong et al, 2008) to model the strength of an effect in different predefined time periods ignoring dead time. The later is a more realistic approach but still is inflexible in representing adequately the change of a concept state through continuous time for different kind of systems and fails in incorporating dead time. Additionally, none of them is tested in real systems rather they are applied in one virtual system (different at each case) with few concepts and only one scenario is tested. Another FCM aspect is the weight adaptation in terms of applying learning or optimization. In most FCM applications experts are used to identify and describe the concepts constituting a system and their interconnections. However, there exists an idea stating that causal relationships should not be defined by humans because, although they might have expertise in the domain, they are still subjective and can make mistakes (under tiredness, pressure, etc) to conclude that automatic methods should be used to develop FCM models (Stach, Kurgan, Pedrycz, 2005). However FCM model was first proposed and has been widely used ever since, because it can represent and manipulate human knowledge and experience. Although there has been a fast and impressing growing development of machine learning algorithms introducing some features of learning intelligence to machines, humans can still learn, remember, cumulate knowledge, and based on that, they can also infer much better. There are cases where some humans are considered very important, should I dare say almost indispensable, due to their extended devotion in researching or working in a certain area. Their knowledge contribution to a system’s domain is very strong and vital. . FCM model gives the availability of representing information coming from experts in a systemic way, creating a knowledge and experience “backup” which can be used either by the experts themselves to reveal seemingly unknown causal paths and relations amongst the concepts of the system or by any other people who would like to take an advantage of foreign expert knowledge in inferring about the particular system. Therefore the need of extracting information about the system from the experts appropriately is very crucial. The experts must define the set of the concepts (out of many) which describe the system properly and all their causal interconnections. There are some methodologies suggested as described in previous section, but further work must be done in using aspects from psychology (in building questionnaires or defining in a proper manner the questions during an interview). 124 However learning and optimization algorithms might be used to enhance FCM modeling as supplementary methodologies to expert-based knowledge. Furthermore, there are cases were no experts are available. In such cases the FCM can be built by adapting its weight by a data-driven machine learning algorithm as described above. However, only two machine learning algorithms which were proposed for training FCMs have been applied so far in real systems. The NHL was applied in (Papageorgiou et al, 2008; Kannappan et al, 2011) and the RCGA in (Stach et al, 2005), which both optimize the weight matrix of an FCM in a different manner. NHL applies a variant of the Hebbian rule e while RCGA comprises a genetic algorithm about FCMs whose fitness function is defined over historical data which describe the expected state of every concept in every iteration. General Hebbian rule-based learning methods have been compared to RCGA in (Stach et al, 2010) where the experimental results show that RCGA achieves better modeling of FCM weight matrix than others. However RCGA assumes the existence of historical data to work properly. Not only that, but per each example, the activation values of each concept at each iteration must be known. However creating this kind of dataset for complex cause-effect systems is most of the times, not feasible. Therefore further research must be done in this area in order to provide efficient and practical learning algorithms for FCMs. Further study must be done in FCM research field in identifying and handling different kind of synergies between the concepts. The strength of an effect on a specific concept C1 might change if at the same time another concept C2 accepts a certain type of effect (so it could be enhanced or weakened). Therefore FCMs would be able to offer more realistic modeling of various systems by using methods of defining and manipulating synergies amongst concepts during the dynamic analysis of an FCM. It would be even more interesting to investigate how these synergies vary in the context of an FCM with real time causal relations. Finally, more attention must be paid in defining methods of detecting, analyzing and solving structural and functional faults of an FCM model which can lead the modeled system to unstable behaviours through inference (limit cycle and chaotic attractor). 125 Chapter 6 - Discussion In this study the main focus areas comprising Computational Intelligence (Artificial Neural Networks, Fuzzy Logic, and Evolutionary Computation) have been presented along with some of their features, classic paradigms and algorithms. It has been described how CI models encompass knowledge extraction, learning, adaptation, optimization and generalization based on knowledge. All these attributes are present in nature in individuals as well as in populations of individuals. That is why the development of CI methods is mainly driven by nature processes and behaviors. The success of a novel technology is measured in the range and the quantity of the applications which employ it. CI techniques find response into many application areas. Only some of them are medical, biomedical and bioinformatics problems, decision support, optimization, stock market and other time-series prediction, generic data clustering and classification, engineering, economics and finance as well as project scheduling. Additionally another great advantage of CI methods is their ability to be easily hybridized with other methods drawn either from CI area or more conventional mathematics areas. Therefore many applications are found in hybrid CI models combining different CI areas (e.g. ANN with GAs or Fuzzy logic with ANN, etc). Overall CI techniques turned to be very popular since their stochastic and emergent properties allow them to find approximate or exact solutions to problems whose solution landscape is not well understood or is very complex and difficult to be dealt with classic mathematical methods. As life becomes more complex and demanding in private and professional level, it becomes more crucial to solve all the computational issues, which already exist or arise in humans everyday lives. CI research studies heuristic approaches of solving such problems which cannot be easily solved by other algorithms. Hence, it seems that CI area evolves with respect to the human problems and questions evolution creating room of further research on providing more efficient, stable and theoretically established algorithms in the CI field. Two of the topics presented in this study gather much interest in their applications. GAs are the one of the most popular Evolution Computation technique while Fuzzy Cognitive Maps gain growing attention due to their practical and great domain applicability. 126 Some of the GAs application areas are Chemistry, Telecommunications, Economics, Engineering Design, Manufacturing Systems, Medicine, Physics, Scheduling, Bioinformatics and other related disciplines. GAs have become very popular in approximating solutions for problem domains which have complex, discontinuous, noisy with many local optima fitness landscapes under specific constraints. The fact that they evolve potential solutions of a problem in parallel gaining time and avoiding bother with derivatives makes them practical and easy to develop and be adjusted by paying very small effort in different problems. GAs can explore a problem’s solution space in more than one directions at the same time instance through a population of chromosomes which behave as artificial agents searching all over the space for the optimum combination of the problem’s parameters to give the “best” fitted solution. Additionally, each chromosome in the context of GAs can be regarded as a sample of a larger set of spaces. For example the chromosome 100010 can be considered as a sample from the space 1**0** (where * is a wild card character – either one or zero) and the space **001* and the space 1*0*1* and so on. Therefore through generations GA will achieve to evaluate the average fitness of a larger set of solution spaces than the actual number of chromosomes. This advantage was first stated by the man who first introduced GAs and called the “Schema Theorem” (Holland, 1992; Mitchell, 1996). Combining Schema Theorem with their ability to function in parallel, GAs are available of effectively and efficiently searching for solutions in problem spaces which are large and vast in relatively reasonable amount of time. Genetics is a rapidly evolving area creating great potentialities of evolving GAs by incorporating new biologically inspired extensions (e.g. new genetic operators, employ different type of sexes within chromosomes, etc). Any improvement in GAs area for better optimization is very important due to the fact that already GAs have been employed by so many applications is a large set of domains. Furthermore, policies under which fitness function, population size, mutation and crossover probabilities would be selected, must be established so that to balance between exploration and exploitation avoiding premature convergence on behalf of the GA algorithm. Fuzzy Cognitive Maps are also positioned in CI area since they combine attributes of ANN structure and information processing with Fuzzy Logic knowledge representation. By using FCM model, one may perform a qualitative simulation of the system and experiment with the model by running different scenarios by changing one or more parameters of the system. Successful analysis of complex systems which are characterized by causality in the relationships between the system’s parameters highly depends on the correct knowledge representation. FCMs allow the user/users to reveal influence paths hiding behind the already known causal relationships of the 127 concepts and to take an advantage of the causal dynamics of the system in predicting potential effects under specific initial circumstances. The fact that the concepts and their interconnections can be represented and explained in linguistic terms help the users to tune, analyze and understand better the causal paths as well as the results given after an FCM model inference simulation. That is why FCMs comprise an attractive system modeling methodology since they can utilize appropriately human knowledge in human language through fuzzified concepts and weights. FCMs modeling and simulation power is reflected in the plethora of applications fields as social and political systems, distributed group-decision support, control systems, medical diagnostic systems, ecology, agriculture, engineering, etc. Essentially FCMs are able to represent many different domain variables as causal events, objectives, trends, behaviours, situations, etc in one model and unwrap their cause and effect relations and show how each of them is affected by all others. Further research in FCMs should improve FCMs in helping humans making wiser and more accurate decisions by presenting them the evolution of the modeled system after a set of changes and the final future consequences of their possible choices (on the system’s parameters). The research directions in this area are described in Chapter 6. Some of the open issues concerning FCMs are the introduction of time dependencies in the causal relationships of the model, the development of efficient learning algorithms whenever data is available, FCM’s structure and stability analysis, establishing expert knowledge extraction methods or methods of extracting causal weights through an existing fuzzy rule base. In conclusion, CI is an area which gives the researcher the opportunity for deeper understanding of the natural and human developmental, cognitive and evolutionary processes as well as the excitement of utilizing these features as puzzle pieces building a new world of technology based on observations, natural language, human reasoning and knowledge extraction. Essentially CI is an area of incorporating partial features of human and nature intelligence into machines for solving difficult real life problems. 128 References AGUILAR, J., 2005. A survey about fuzzy cognitive maps papers (Invited paper). International Journal of Computational Cognition, 3(2), pp. 27-33. AGUILAR, J., 2002. Adaptive Random Fuzzy Cognitive Maps, Proceedings of the 8th IberoAmerican Conference on AI: Advances in Artificial Intelligence 2002, Springer-Verlag, pp. 402410. AGUILAR, J., 2001. A Fuzzy Cognitive Map Based on the Random Neural Model. 2070, pp. 333-338. AIZERMAN, A., BRAVERMAN, E.M. and ROZONER, L.I., 1964. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, pp. 821-837. ALPAYDIN, E., 2004. Introduction to Machine Learning (Adaptive Computation and Machine Learning). The MIT Press. AXELROD, R.M., 1976. Structure of decision: the cognitive maps of political elites. Princeton University Press. B\ACK, T. and SCHWEFEL, H., 1993. An overview of evolutionary algorithms for parameter optimization. Evol.Comput., 1(1), pp. 1-23. BEN WANG, XINGSHE ZHOU, GANG YANG and YALEI YANG, 2010. Web Services Trustworthiness Evaluation Based on Fuzzy Cognitive Maps, Intelligence Information Processing and Trusted Computing (IPTC), 2010 International Symposium on 2010, pp. 230233. BENTLEY, P.J., 2007. Digital Biology: How Nature Is Transforming Our Technology and Our Lives. SIMON \& SCHUSTER. BERTOLINI, M. and BEVILACQUA, M., 2010. Fuzzy Cognitive Maps for Human Reliability Analysis in Production Systems. 252, pp. 381-415. BEYER, H. and SCHWEFEL, H., 2002. Evolution strategies - A comprehensive introduction. Natural Computing, 1(1), pp. 3-52. BISHOP, C.M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc. BOUTALIS, Y., KOTTAS, T.L. and CHRISTODOULOU, M., 2009. Adaptive Estimation of Fuzzy Cognitive Maps With Proven Stability and Parameter Convergence. Fuzzy Systems, IEEE Transactions on, 17(4), pp. 874-889. BUENO, S. and SALMERON, J.L., 2009. Benchmarking main activation functions in fuzzy cognitive maps. Expert Systems with Applications, 36(3, Part 1), pp. 5221-5229. BURGES, C.J.C., 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, pp. 121-167. 129 CARVALHO, J.P., 2010. On the semantics and the use of Fuzzy Cognitive Maps in social sciences, Fuzzy Systems (FUZZ), 2010 IEEE International Conference on 2010, pp. 1-6. CARVALHO, J.P. and TOME, J.A.B., 2004. Qualitative modelling of an economic system using rule-based fuzzy cognitive maps, Fuzzy Systems, 2004. Proceedings. 2004 IEEE International Conference on 2004, pp. 659-664 vol.2. CARVALHO, J.P. and TOMÉ, J.A.B., 2001. Rule based fuzzy cognitive maps - Expressing time in qualitative system dynamics, 2001, pp. 280-283. CARVALHO, J.P. and TOME, J.A.B., 1999. Rule based fuzzy cognitive maps and fuzzy cognitive maps - a comparative study. Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS, , pp. 115-119. CHERKASSKY, V. and MULIER, F., 1998. Learning from Data : Concepts, Theory, and Methods (Adaptive and Learning Systems for Signal Processing, Communications and Control Series). Wiley-Interscience. CHO, S., 1999. Pattern recognition with neural networks combined by genetic algorithm. Fuzzy Sets and Systems, 103(2), pp. 339-347. CHUNMEI LIN and YUE HE, 2010. The study of inference and inference chain of fuzzy cognitive map, Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on 2010, pp. 358-362. CLARK, T., LARSON, J., MORDESON, J., POTTER, J., WIERMAN, M., CLARK, T., LARSON, J., MORDESON, J., POTTER, J. and WIERMAN, M., 2008. Fuzzy Set Theory. 225, pp. 29-63. CORDÓN, O., GOMIDE, F., HERRERA, F., HOFFMANN, F. and MAGDALENA, L., 2004. Ten years of genetic fuzzy systems: current framework and new trends. Fuzzy Sets and Systems, 141(1), pp. 5-31. CORTES, C. and VAPNIK, V., 1995. Support-Vector Networks, Machine Learning 1995, pp. 273-297. DAVIS, L.D. and MITCHELL, M., 1991. Handbook of Genetic Algorithms. Van Nostrand Reinhold, . DAWKINS, R., 2006. The selfish gene. Oxford University Press. DE JONG, K., 2002. Evolutionary Computation: A Unified Approach. The MIT Press. DICKERSON, J.A. and KOSKO, B., 1994. Virtual Worlds as Fuzzy Cognitive Maps. Presence, 3(2), pp. 73-89. DUBOIS, D. and PRADE, H.M., 1980. Fuzzy sets and systems: theory and applications. Academic Press. DUDA, R., HART, P. and STORK, D., 2001. Pattern Classification (2nd Edition). WileyInterscience. EIBEN, A.E. and SMITH, J.E., 2003. Introduction to Evolutionary Computing. SpringerVerlag. 130 FLOREANO, D. and MATTIUSSI, C., 2008. Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies. The MIT Press. FOGEL, L.J., OWENS, A.J. and WALSH, M.J., 1966. Artificial Intelligence through Simulated Evolution. New York, USA: John Wiley. FORBES, N., 2005. Imitation of Life: How Biology Is Inspiring Computing. The MIT Press. GABRYS, B. and RUTA, D., 2006. Genetic algorithms in classifier fusion. Applied Soft Computing, 6(4), pp. 337-347. GEORGOPOULOS, V.C., MALANDRAKI, G.A. and STYLIOS, C.D., 2003. A fuzzy cognitive map approach to differential diagnosis of specific language impairment. Artificial Intelligence in Medicine, 29(3), pp. 261-278. GEORGOPOULOS, V.C. and STYLIOS, C.D., 2009. Diagnosis support using fuzzy cognitive maps combined with genetic algorithms, 2009, pp. 6226-6229. GLYKAS, M. and XIROGIANNIS, G., 2005. A soft knowledge modeling approach for geographically dispersed financial organizations. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 9(8), pp. 579-593. GOLDBERG, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning. 1st edn. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. HAGIWARA, M., 1992. Extended fuzzy cognitive maps, Fuzzy Systems, 1992., IEEE International Conference on 1992, pp. 795-801. HANAFIZADEH, P. and ALIEHYAEI, R., 2011. The Application of Fuzzy Cognitive Map in Soft System Methodology. Systemic Practice and Action Research, pp. 1-30. HAOMING ZHONG, CHUNYAN MIAO, ZHIQI SHEN and YUHONG FENG, 2008. Temporal fuzzy cognitive maps, Fuzzy Systems, 2008. FUZZ-IEEE 2008. (IEEE World Congress on Computational Intelligence). IEEE International Conference on 2008, pp. 1831-1840. HAYKIN, S.S., 2009. Neural networks and learning machines. Prentice Hall. HENGJIE, S., CHUNYAN, M. and ZHIQI, S., 2007. Fuzzy cognitive map learning based on multi-objectiveparticle swarm optimization, Proceedings of the 9th annual conference on Genetic and evolutionary computation 2007, ACM, pp. 339-339. HOLLAND, J., 1975. Adaptation in natural and artificial systems. University of Michigan Press. HOLLAND, J.H., 1992. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. Cambridge, MA, USA: MIT Press. HUERGA, A.V., 2002. A Balanced Differential Learning algorithm in Fuzzy Cognitive Maps, In: Proceedings of the 16th International Workshop on Qualitative Reasoning 2002 2002. IAKOVIDIS, D.K. and PAPAGEORGIOU, E., 2011. Intuitionistic Fuzzy Cognitive Maps for Medical Decision Making. Information Technology in Biomedicine, IEEE Transactions on, 15(1), pp. 100-107. 131 JONES, S., 2000. Almost like a whale: the origin of species updated. Black Swan. JONG, K.A.D., 2006. Evolutionary computation: a unified approach. MIT Press. KANNAPPAN, A., TAMILARASI, A. and PAPAGEORGIOU, E.I., 2011. Analyzing the performance of fuzzy cognitive maps with non-linear hebbian learning algorithm in predicting autistic disorder. Expert Syst.Appl., 38(3), pp. 1282-1292. KARRAY, F. and SILVA, C.D., 2004. Soft Computing and Tools of Intelligent Systems Design: Theory and Applications. Pearson Addison Wesley. KEARNS, M.J. and VAZIRANI, U.V., 1994. An introduction to computational learning theory. Cambridge, MA, USA: MIT Press. KECMAN, V., 2001. Learning and soft computing: support vector machines, neural networks, and fuzzy logic models. MIT Press. KHAN, M.S. and QUADDUS, M., 2004. Group decision support using fuzzy cognitive maps for causal reasoning. Group Decision and Negotiation, 13(5), pp. 463-480. KHAN, M.S. and CHONG, A., 2003. Fuzzy Cognitive Map Analysis with Genetic Algorithm, IICAI 2003, pp. 1196-1205. KLIR, G.J., and YUAN, B., 1995. Fuzzy sets and fuzzy logic : theory and applications. Upper Saddle River, N.J.: Prentice Hall PTR. KOHONEN, T., 1982. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1), pp. 59-69. KOK, K., 2009. The potential of Fuzzy Cognitive Maps for semi-quantitative scenario development, with an example from Brazil. Global Environmental Change, 19(1), pp. 122-133. KOSKO, B., 1993/1995. Chapter 12: Adaptive Fuzzy Systems. Fuzzy Thinking. KOSKO, B., 1992. Neural networks and fuzzy systems: a dynamical systems approach to machine intelligence. Prentice Hall. KOSKO, B., 1986. Fuzzy cognitive maps. International Journal of Man-Machine Studies, 24(1), pp. 65-75. KOSKO, B., 1988. Hidden patterns in combined and adaptive knowledge networks. Int.J.Approx.Reasoning, 2(4), pp. 377-393. KOTTAS, T.L., BOUTALIS, Y.S., DEVEDZIC, G. and MERTZIOS, B.G., 2004. A new method for reaching equilibrium points in Fuzzy Cognitive Maps, 2004, pp. 53-60. KOULOURIOTIS, D.E., DIAKOULAKIS, I.E., EMRIS, D.M., ANTONIDAKINS, E.N. and KALIAKATSOS, I.A., 2003. Efficiently modeling and controlling complex dynamic systems using evolutionary fuzzy cognitive maps (invited paper). International Journal of Computational Cognition, 1(2), pp. 41-65. KOULOURIOTIS, D.E., DIAKOULAKIS, I.E. and EMIRIS, D.M., 2001. A fuzzy cognitive mapbased stock market model: synthesis, analysis and experimental results, Fuzzy Systems, 2001. The 10th IEEE International Conference on 2001, pp. 465-468. 132 KOULOURIOTIS, D.E., DIAKOULAKIS, I.E. and EMIRIS, D.M., 2001. Learning fuzzy cognitive maps using evolution strategies: A novel schema for modeling and simulating high-level behavior, 2001, pp. 364-371. KOZA, J.R., 1996. On the programming of computers by means of natural selection. MIT Press. KOZA, J.R., 1992. Genetic programming: on the programming of computers by means of natural selection. Cambridge, MA, USA: MIT Press. KUDO, M. and SKLANSKY, J., 2000. Comparison of algorithms that select features for pattern classifiers. Pattern Recognition, 33(1), pp. 25-41. LAUREANO-CRUCES, A., RAMÃREZ-RODRÃGUEZ, J. and TERáN-GILMORE, A., 2004. Evaluation of the Teaching-Learning Process with Fuzzy Cognitive Maps. 3315, pp. 922-931. MAMDANI, E.H., 1976. Application of fuzzy logic to approximate reasoning using linguistic synthesis, Proceedings of the sixth international symposium on Multiple-valued logic 1976, IEEE Computer Society Press, pp. 196-202. MAMDANI, E.H. and GAINES, B.R., 1981. Fuzzy reasoning and its applications. Academic Press. MARSLAND, S., 2009. Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc Machine Learning & Pattern Recognition). Chapman and Hall/CRC. MARTIN LARSEN, P., 1980. Industrial applications of fuzzy logic control. International Journal of Man-Machine Studies, 12(1), pp. 3-10. MATEOU, N.H. and ANDREOU, A.S., 2008. A framework for developing intelligent decision support systems using evolutionary fuzzy cognitive maps. Journal of Intelligent and Fuzzy Systems, 19(2), pp. 151-170. MATEOU, N.H., MOISEOS, M. and ANDREOU, A.S., 2005. Multi-objective evolutionary fuzzy cognitive maps for decision support, Evolutionary Computation, 2005. The 2005 IEEE Congress on 2005, pp. 824-830 Vol.1. MCDERMOTT, A., 1975. Cytogenetics of man and other animals / A. McDermott ; consulting editor for this volume, H. Harris. Chapman and Hall ; Wiley, London : New York. MIAO, Y. and LIU, Z.-., 2000. On causal inference in fuzzy cognitive maps. IEEE Transactions on Fuzzy Systems, 8(1), pp. 107-119. MICHALEWICZ, Z. and FOGEL, D.B., 2004. How to solve it: modern heuristics. Springer. MITCHELL, T.M., 1997. Machine Learning. 1 edn. New York, NY, USA: McGraw-Hill, Inc. NEGNEVITSKY, M., 2005. Artificial intelligence: a guide to intelligent systems. Addison-Wesley. NEOCLEOUS, C. and SCIZAS, C., 2004. Application of Fuzzy Cognitive Maps to the PoliticalEconomic Problem of Cyprus, , June 17-20 2004, pp. 340-349. NOVÁK, V. and LEHMKE, S., 2006. Logical structure of fuzzy IF-THEN rules. Fuzzy Sets and Systems, 157(15), pp. 2003-2029. 133 PAPAGEORGIOU, E., STYLIOS, C. and GROUMPOS, P., 2003. Fuzzy Cognitive Map Learning Based on Nonlinear Hebbian Rule. LECTURE NOTES IN COMPUTER SCIENCE, (2903), pp. 256268. PAPAGEORGIOU, E.I., 2011. A new methodology for Decisions in Medical Informatics using fuzzy cognitive maps based on fuzzy rule-extraction techniques. Applied Soft Computing Journal, 11(1), pp. 500-513. PAPAGEORGIOU, E.I. and GROUMPOS, P.P., 2005. A new hybrid method using evolutionary algorithms to train Fuzzy Cognitive Maps. Applied Soft Computing Journal, 5(4), pp. 409-431. PAPAGEORGIOU, E.I. and GROUMPOS, P.P., 2004. Two-stage learning algorithm for fuzzy cognitive maps, Intelligent Systems, 2004. Proceedings. 2004 2nd International IEEE Conference 2004, pp. 82-87 Vol.1. PAPAGEORGIOU, E.I., MARKINOS, A.T. and GEMTOS, T.A., 2011. Fuzzy cognitive map based approach for predicting yield in cotton crop production as a basis for decision support system in precision agriculture application. Applied Soft Computing, 11(4), pp. 3643. PAPAGEORGIOU, E.I., PAPANDRIANOS, N.I., APOSTOLOPOULOS, D.J. and VASSILAKOS, P.J., 2008. Fuzzy Cognitive Map based decision support system for thyroid diagnosis management, Fuzzy Systems, 2008. FUZZ-IEEE 2008. (IEEE World Congress on Computational Intelligence). IEEE International Conference on 2008, pp. 1204-1211. PAPAGEORGIOU, E.I., PARSOPOULOS, K.E., STYLIOS, C.S., GROUMPOS, P.P. and VRAHATIS, M.N., 2005. Fuzzy cognitive maps learning using particle swarm optimization. Journal of Intelligent Information Systems, 25(1), pp. 95-121. PAPAGEORGIOU, E.I., SPYRIDONOS, P.P., GLOTSOS, D.T., STYLIOS, C.D., RAVAZOULA, P., NIKIFORIDIS, G.N. and GROUMPOS, P.P., 2008. Brain tumor characterization using the soft computing technique of fuzzy cognitive maps. Appl.Soft Comput., 8(1), pp. 820-828. PAPAGEORGIOU, E.I., STYLIOS, C. and GROUMPOS, P.P., 2006. Unsupervised learning techniques for fine-tuning fuzzy cognitive map causal links. International Journal of Human Computer Studies, 64(8), pp. 727-743. PAPAGEORGIOU, E., GEORGOULAS, G., STYLIOS, C., NIKIFORIDIS, G. and GROUMPOS, P., 2006. Combining Fuzzy Cognitive Maps with Support Vector Machines for Bladder Tumor Grading. 4251, pp. 515-523. PAPAGEORGIOU, I. and GROUMPOS, P., 2005. A weight adaptation method for fuzzy cognitive map learning. Soft Comput., 9(11), pp. 846-857. PAPAKOSTAS, G.A., KOULOURIOTIS,D.E., 2010. Classifying patterns using Fuzzy Cognitive Maps. Studies in Fuzziness and Soft Computing, 247, pp. 291-306. PARK, K.S. and KIM, S.H., 1995. Fuzzy cognitive maps considering time relationships. Int.J.Hum.-Comput.Stud., 42(2), pp. 157-168. PARSOPOULOS, K.E., PAPAGEORGIOU, E.I., GROUMPOS, P.P. and VRAHATIS, M.N., 2003. A first study of fuzzy cognitive maps learning using particle swarm optimization, Evolutionary Computation, 2003. CEC '03. The 2003 Congress on 2003, pp. 1440-1447 Vol.2. PETALAS, Y.G., PARSOPOULOS, K.E. and VRAHATIS, M.N., 2008. Improving fuzzy cognitive maps learning through memetic particle swarm optimization. Soft Comput., 13(1), pp. 77-94. 134 POLI, R., LANGDON, W.B., CLERC, M. and STEPHENS, C.R., 2007. Continuous optimisation theory made easy? Finite-element models of evolutionary strategies, genetic algorithms and particle swarm optimizers, Proceedings of the 9th international conference on Foundations of genetic algorithms 2007, Springer-Verlag, pp. 165-193. RIDLEY, M., 2004. Evolution. Blackwell Pub. ROSS, T.J., 2004. Fuzzy logic with engineering applications. John Wiley. RUSSELL, S.J. and NORVIG, 2003. Artificial Intelligence: A Modern Approach (Second Edition). Prentice Hall. SALMERON, J.L., 2009. Augmented fuzzy cognitive maps for modelling LMS critical success factors. Know.-Based Syst., 22(4), pp. 275-278. SCHÖLKOPF, B., MIKA, S., BURGES, C.J.C., KNIRSCH, P., MÜLLER, K., RÄTSCH, G. and SMOLA, A.J., 1999. Input Space Versus Feature Space in Kernel-Based Methods. IEEE Transactions on Neural Networks, 10, pp. 1000-1017. SHAWE-TAYLOR, J. and CRISTIANINI, N., 2004. Kernel Methods for Pattern Analysis. New York, NY, USA: Cambridge University Press. SHENG-JUN LI and RUI-MIN SHEN, 2004. Fuzzy cognitive map learning based on improved nonlinear Hebbian rule, Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on 2004, pp. 2301-2306 vol.4. SIRAJ, A., BRIDGES, S.M. and VAUGHN, R.B., 2001. Fuzzy cognitive maps for decision support in an intelligent intrusion detection system, IFSA World Congress and 20th NAFIPS International Conference, 2001. Joint 9th 2001, pp. 2165-2170 vol.4. SIVANANDAM, S., N. and DEEPA, S., N., 2008. Introduction to Genetic Algorithms. New York: Springer-Verlag Berlin Heidelberg. SONG, H.J., MIAO, C.Y., WUYTS, R., SHEN, Z.Q., D'HONDT, M. and CATTHOOR, F., 2011. An Extension to Fuzzy Cognitive Maps for Classification and Prediction. Fuzzy Systems, IEEE Transactions on, 19(1), pp. 116-135. STACH, W., KURGAN, L. and PEDRYCZ, W., 2008. Data-driven Nonlinear Hebbian Learning method for Fuzzy Cognitive Maps, Fuzzy Systems, 2008. FUZZ-IEEE 2008. (IEEE World Congress on Computational Intelligence). IEEE International Conference on 2008, pp. 19751981. STACH, W., KURGAN, L. and PEDRYCZ, W., 2007. Parallel Learning of Large Fuzzy Cognitive Maps, Neural Networks, 2007. IJCNN 2007. International Joint Conference on 2007, pp. 15841589. STACH, W., KURGAN, L., PEDRYCZ, W. and REFORMAT, M., 2005. Evolutionary Development of Fuzzy Cognitive Maps, Fuzzy Systems, 2005. FUZZ '05. The 14th IEEE International Conference on 2005, pp. 619-624. STACH, W., KURGAN, L., PEDRYCZ, W. and REFORMAT, M., 2005. Genetic learning of fuzzy cognitive maps. Fuzzy Sets and Systems, 153(3), pp. 371-401. 135 STACH, W., KURGAN, L.A. and PEDRYCZ, W., 2008. Numerical and Linguistic Prediction of Time Series With the Use of Fuzzy Cognitive Maps. Fuzzy Systems, IEEE Transactions on, 16(1), pp. 61-72. STACH, W., KURGAN, L. and PEDRYCZ, W., 2010. Expert-Based and Computational Methods for Developing Fuzzy Cognitive Maps. Fuzzy Cognitive Maps Advances in Theory, Methodologies, Tools and Applications, 247, pp. 23-41. STYLIOS, C.D. and GEORGOPOULOS, V.C., 2008. Fuzzy cognitive maps structure for medical decision support systems. STYLIOS, C.D. and GROUMPOS, P.P., 2004. Modeling Complex Systems Using Fuzzy Cognitive Maps. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans., 34(1), pp. 155-162. STYLIOS, C.D. and GROUMPOS, P.P., 1999. Soft computing approach for modelling the supervisor of manufacturing systems. Journal of Intelligent and Robotic Systems: Theory and Applications, 26(3), pp. 389-403. STYRIOS, C.D. and GROUMPOS, P.P., 2000. Fuzzy Cognitive Map In Modeling Supervisory Control Systems. Journal of Intelligent & Fuzzy Systems, 8, pp. 83-98. SUGENO, M. and YASUKAWA, T., 1993. A fuzzy-logic-based approach to qualitative modeling. Fuzzy Systems, IEEE Transactions on, 1(1), pp. 7. TSADIRAS, A.K., 2008. Comparing the inference capabilities of binary, trivalent and sigmoid fuzzy cognitive maps. Information Sciences, 178(20), pp. 3880-3894. TSADIRAS, A.K. and MARGARITIS, K.G., 1999. A New Balance Degree for Fuzzy Cognitive Maps. Department of Applied Informatics, University of Macedonia. TSADIRAS, A. and KOUSKOUVELIS, I., 2005. Using Fuzzy Cognitive Maps as a Decision Support System for Political Decisions: The Case of Turkey’s Integration into the European Union. 3746, pp. 371-381. VAPNIK, V., GOLOWICH, S.E. and SMOLA, A., 1996. Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing, Advances in Neural Information Processing Systems 9 1996, MIT Press, pp. 281-287. XIANG-FENG LUO and ER-LIN YAO, 2005. The Reasoning Mechanism of Fuzzy Cognitive Maps, Semantics, Knowledge and Grid, 2005. SKG '05. First International Conference on 2005, pp. 104-104. XIAODONG ZHU, ZHIQIU HUANG, SHUQUN YANG and GUOHUA SHEN, 2007. Fuzzy Implication Methods in Fuzzy Logic, Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on 2007, pp. 154-158. XIROGIANNIS, G., GLYKAS, M., STAIKOURAS,C., 2010. Fuzzy Cognitive Maps in banking business process performance measurement. Studies in Fuzziness and Soft Computing, 247, pp. 161-200. XIROGIANNIS, G. and GLYKAS, M., 2004. Fuzzy cognitive maps in business analysis and performance-driven change. Engineering Management, IEEE Transactions on, 51(3), pp. 334351. 136 YANLI ZHANG and HUIYAN LIU, 2010. Classification Systems Based on Fuzzy Cognitive Maps, Genetic and Evolutionary Computing (ICGEC), 2010 Fourth International Conference on 2010, pp. 538-541. YOU-WEI TENG and WANG, W.-., 2004. Constructing a user-friendly GA-based fuzzy system directly from numerical data. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 34(5), pp. 2060-2070. ZADEH, L.A., 1996. Fuzzy logic = computing with words. Fuzzy Systems, IEEE Transactions on, 4(2), pp. 103-111. ZADEH, L.A., 1983. The role of fuzzy logic in the management of uncertainty in expert systems. Fuzzy Sets Syst., 11(1-3), pp. 197-198. ZADEH, L.A., 1973. Outline of a New Approach to the Analysis of Complex Systems and Decision Processes. Systems, Man and Cybernetics, IEEE Transactions on, SMC-3(1), pp. 2844. ZADEH, L.A., 1965. Fuzzy Sets. Information and Computation/information and Control, 8, pp. 338-353. ZHEN PENG, BINGRU YANG and WEIWEI FANG, 2008. A Learning Algorithm of Fuzzy Cognitive Map in Document Classification, Fuzzy Systems and Knowledge Discovery, 2008. FSKD '08. Fifth International Conference on 2008, pp. 501-504. ZHI-QIANG LIU and YUAN MIAO, 1999. Fuzzy cognitive map and its causal inferences, Fuzzy Systems Conference Proceedings, 1999. FUZZ-IEEE '99. 1999 IEEE International 1999, pp. 1540-1545 vol.3. ZIMMERMANN, H.J., 2001. Fuzzy set theory--and its applications. Kluwer Academic Publishers. 137