Chapter 6 CLASSIFIERS Chapter 6 CLASSIFIERS Classification is defined as “the act of forming into a class or classes; a distribution into groups, as classes, orders, families, etc., according to some common relations or affinities”. Therefore, a classifier will be “a subject that creates classifications”. In recent times, the automatic categorisation of patterns has become of great interest in many research areas. Machine learning methods for classification learn from data that incorporates classified instances, called training set (e.g., a collection of attribute values that are classified to a certain class), and attempt to develop models that would, given the set of attribute-values, predict a class for such instance. In the problem of supervised learning we are given a sample of input-output pairs (also called the training sample), and the task is to find a deterministic function that maps any input to an output such that disagreement with future input-output observations is minimised. There exist a huge number of classification techniques in the literature, for instance neural networks, classification trees, variants of naive Bayes, k-nearest neighbours, classification through association rules, function decomposition, logistic regression, and support vectors machines. The performance of different classification methods is to some extent dependent on the target task. For this reason, one classifier cannot be said to be better than another one and therefore many alternatives are usually attempted when facing one unique categorisation problem. 77 Chapter 6 CLASSIFIERS 6.1 Classifiers used in emotional recognition Several pattern recognition methods have been explored for automatic emotion recognition (s. [Pet99, Bat00]). Dellaert [Del96], for instance, tried maximum likelihood Bayes classification, Kernel regression, and k-nearest neighbour methods, whereas Roy and Pentland [Roy96] used Fisher linear discrimination method. Many more studies have been conducted by using different classifiers to deal with the emotional discrimination through the speech signal; this section provides an overview on methods employed by a number of published studies. [Lee01] reports on methods for automatic classification of spoken utterances based on the emotional state of the speaker. Linear discriminant classification with Gaussian class-conditional probability distribution and k-nearest neighbour methods are used to classify utterances into two basic emotion states, negative and non-negative. In addition, to improve classification performance, two feature selection methods are employed: promising first selection and forward feature selection. Principal component analysis is used to reduce the dimensionality of the features while maximizing classification accuracy. A study carried out by Amir [Ami01] also makes use of K-nearest neighbours approach. The method estimates the local posterior probability of each class by the average of class membership over the K nearest neighbours. They ran the algorithm for K from 1 to 15 but the results were mainly poor when compared with the neural networks classifiers performance. [Che98] employs supervised classification of six basic emotions1 with leave-one out (LOO) cross validation (CV). They applied two methods to perform the classification: - the nearest mean criterion, - and model each class with a Gaussian distribution and normalise by the mean and variance of the class, then find the most probable class to which the test sample belongs. Despite distance-based measurements have longer tradition, new automatic classification tools, principally Neural Networks, have recently increased their acceptance 1 Happiness, sadness, fear, anger, surprise and disgust. 78 Chapter 6 CLASSIFIERS for this task. Noam Amir compares in [Ami01] the performance of two algorithms: a classification algorithm based on Euclidean distances, and a classification algorithm based in neural networks. Both perform the classification of four emotions2 using identical feature set, on a database of emotional speech, which was validated through subjective listening tests. The distance measure method was previously discussed and outlined in detail in a study performed by the same author [Ami00] where it proved success when the characterization of each emotion was unique to each subject being studied. This method obtains representative values for each emotion by averaging the feature vectors over the whole set of utterance and then applies the Mahalanobis distance measure to compute the distance of each vector to the centroid. A small distance from a certain centroid indicates that the measurement is most likely to belong to that specific emotion. The drawback of methods based on distance is that they only model a standard way to express the emotion, for instance if we attempt to classify an utterance whose level of anger is extremely intense compared with what the classifier is used to recognise, the distance to the centroid will be larger, even when this utterance could be considered as “angrier” than many others. For the neural network classification [Ami01] uses four Feed-Forward Neural Networks, one for each emotion (OCONN). Each network had twelve input neurons and one output neuron in the range [0,1]. The internal architecture varies specifically for each network, i.e. emotion. The transfer function is Log-Sigmoid and the training method applied is the Levenberg-Marquadrat backpropagation. Neural networks were also used in [Pet99] in three different ways: a) Two-layer backpropagation neural network architecture with a 8- 10- or 14- input vector, 10 or 20 nodes in the hidden sigmoid layer and five nodes in the output to classify into five different emotions3. b) Ensembles of neural network classifiers, i.e. an odd number of neural network classifiers, which have been trained on different subsets of the training set using the bootstrap aggregation [Bri96] or the cross-validated committees [Prm96]. The ensemble makes decision based on the majority voting principle. They employed ensemble sizes from 7 to 5. 2 3 Anger, sadness, happiness and neutral. Normal state, happiness, anger, sadness and fear. 79 Chapter 6 CLASSIFIERS c) Set of experts. Instead of training a neural network to recognize all emotions, a set of specialists is built. Each of these “experts” can recognize only one emotion and then combine their results to classify a given sample. For the expert networks they used a two layer backpropagation neural network architecture with an 8-element input vector, 10 or 20 nodes in the hidden sigmoid layer and one node in the output linear layer. In [Hub98], multi layer perceptrons (MLP) were trained for the discrimination between angry and neutral patterns. PHYSTA project4 uses hybrid technology, i.e. a combination of classical artificial intelligence (AI) computing and neural networks. The classical component allows for the use of known procedures and logical operations, which are suited to language processing. The neural net component allows for learning at various levels, for instance the weights that should be attached to various inputs, adjacencies, and probabilities of particular events given certain information. 6.2 Classifiers tried in the present work Neural Network classifier has been mainly employed during this work and, consequently, its operation is specifically detailed in section 6.3. However, other classification methods, gaussian mixture models (GMMs), linear regression and decision trees were also tried. 6.2.1 Gaussian mixture models If there is a reason to believe that a data set is comprised of several distinct populations, a mixture model can be used. Mixture Models are a type of density model which comprise a number of component functions, usually Gaussian. These component functions are combined to provide a multimodal density. A Gaussian mixture model represents each class of data as a linear combination of several Gaussian densities in the feature space. Generally, the main motivations for using Gaussian mixtures are: 4 Principled Hybrid Systems: Theory and Applications (PHYSTA) is a collaboration of Kings College London, University of Milan, Queen's University of Belfast and the National Technical University of Athens. 80 Chapter 6 CLASSIFIERS • a linear combination of Gaussian basis functions is capable of forming smooth approximations of arbitrarily shaped densities. • in speaker recognition, for instance, the individual component densities could be able to model some underlying acoustic classes, such as vowels, nasals or fricatives. This method was employed in our experiments to discriminate between two classes attending to the voice quality features and using 1 and 32 Gaussian functions. However, none of the experiments carried out through this method yield better results than by using the neural network classifier and, consequently, the classification method was discarded. 6.2.2 Linear discriminant analysis Linear regression is the simplest form of regression and is usually used to predict a continuous class. Linear regression assumes that the class variable can be expressed as a linear function of one attribute: y a bx (6.1) The linear discriminant analysis method consists of searching some linear combinations of selected variables, which provide the best separation between the considered classes. These different combinations are called discriminant functions [Mja01]. 6.2.3 Decision trees A decision tree is a graphical representation of a procedure for classifying or evaluating an item of interest. It represents a function that maps each element of its domain to an element of its range, which is typically a class label or numerical value. A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Therefore, they represent Boolean functions. Functions with a larger range of outputs can also be represented. At each leaf of a decision tree, one finds an element of the range. At each internal node of the tree, one finds a test that has a small number of possible outcomes. By branching according to the outcome of each test, one arrives at a leaf that contains the 81 Chapter 6 CLASSIFIERS class label or numerical value that corresponds to the item in hand. Leaves are usually not of one class, so one typically chooses the most frequently occurring class label. Decision trees method was tried during the introductory experiments using C5.0, a state-of-the-art system that constructs classifiers in the form of decision trees and rulesets. Since many disadvantages were found, due to the simplicity of the classifier for such a complex problem as emotion discrimination, they were early discarded. The fundamental problems with decision trees are at least four folds: They look at very simple combination of attributes within a table, and hence miss many patterns. By their nature, they need to break numeric fields into fixed ranges, hence missing even more patterns, and providing less information. They are quite brittle on inexact data, and a small change in a value can have a large impact on the outcome. Decision trees can at best work on small samples of data and can not easily approach large data sets resulting in significant loss of information. Since they ignore some attributes, they may make less accurate predictions, and if some values are missing from the new data item, they make no predictions at all. Furthermore, given the same data set, one can obtain several decision trees, each making a different prediction on new data items. 6.2.4 Neural Networks Neural networks, broadly described in section 6.3, conforms the selected method to build an emotional classifier in the framework of this thesis. Reasons that make neural networks more convenient for our purposes are described in this section. Further detailed information is found in 6.3. Neural Networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. Emotions are a complex field of investigation, which includes many discrepancies even in its theoretic domain. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyse. This expert can then be used to provide projections given new situations of interest and answer "what if" questions. Other advantages include: 82 Chapter 6 CLASSIFIERS 1. Adaptive learning: An ability to learn how to do tasks based on the data given for training or initial experience. 2. Self-Organisation: An ANN can create its own organisation or representation of the information it receives during learning time. 3. Real Time Operation: ANN computations may be carried out in parallel, and special hardware devices are being designed and manufactured which take advantage of this capability. 4. Fault Tolerance via Redundant Information Coding: Partial destruction of a network leads to the corresponding degradation of performance. However, some network capabilities may be retained even with major network damage. The multiple advantages of neural networks, in addition to the general acceptance and widespread use of this method in several former approaches concerning emotion recognition through the speech signal, leads us to employ this method as our main classification tool. 6.3 Neural Networks. As established in section 6.1, Neural Networks are a frequently employed tool in the aim of emotional recognition. Such a complex classifier involves a huge number of possible configurations and therefore, the term Neural Network does not only denote a single classifier but a family of them including a countless amount of different possibilities. At the present work, diverse configurations have been tried, following some previous scientific approaches (s. [Ami01, Hub98]. All the attempted methods and architectures are detailed in later sections after a brief introduction about NN in section 6.3.1. Since a great deal of information about Neural Networks can be found in the literature, section 6.3.1 provides the reader with the basic concepts needed to understand the posterior configuration details. The software employed is the SNNS (Stuttgart Neural Network Simulator), a simulator for neural networks on Unix workstations developed at the Institute for Parallel and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The 83 Chapter 6 CLASSIFIERS software allows two possibilities: batch programming or graphical interface operation. For further information about the software see [Zel95]5. 6.3.1 Introduction to Neural Networks. An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The term Artificial is included to differentiate these networks from the biological neural systems, on which they are based, but it is usually understood within the computational environment and they can also be identified simply as Neural Networks (NNs). The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems. An input is presented to some of (or all) its input units, this input vector is propagated through the whole network and finally, some kind of output is splitted out. So, essentially, they are functions: the network gets an input as an argument and gives an output for that particular input. Because input and output can consist of many units or components, they are considered as vectors. Figure 6.1. Artificial neuron model However, ANN's real power is on its ability to learn, that is, the function is not constant but can be changed dynamically. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones. This also happens in NN learning. 5 http://www-lehre.informatik.uni-osnabrueck.de/~nn/html_info/UserManual/UserManual.html 84 Chapter 6 CLASSIFIERS Accordingly, neural networks are a form of multiprocessor computer system, with the following elements: simple processing elements (neurons or nodes), a high degree of interconnection (links between nodes), simple scalar messages, adaptive interaction between elements. Figure 6.2. Artificial neural neuron activation process. The simple processing element, the artificial neuron or nodes (figure 6.1), is a device based in the biological neuron model with many inputs and one output. Each input comes via a connection that has a strength (or weight); these weights correspond to synaptic efficacy in a biological neuron. Each neuron also has a single threshold value. The weighted sum of the inputs is formed, and the threshold subtracted, to compose the activation of the neuron. Then, the activation signal is passed through an activation function (also known as a transfer function) to produce the output of the neuron. Figure 6.2 shows this activation process. The activation function is not unique, but it can be changed and even self-programmed to get a better performance in a specific task. The artificial neuron has two modes of operation; the training mode and the using (testing) mode. In the training mode, the neuron can be trained to fire (or not), for particular input patterns. In the using mode, when a taught input pattern is detected at the 85 Chapter 6 CLASSIFIERS input, its associated output becomes the current output. If the input pattern does not belong in the taught list of input patterns, the firing rule is used to determine whether to fire or not. Depending on their function in the net, one can distinguish three types of units, depicted in figure 6.3: The units whose activations are the problem input for the net are called input units; the units whose output represent the output of the net output units. The Figure 6. 3. Different types of units within the structure of an artificial neural network remaining units are called hidden units, because they are not visible from the outside. One neural network must have both input and output units, but there can be no hidden units (single-layer), one or many layers of hidden units (multi-layer). By combining these simple units and using links between them, many different network configurations can be found. A neural network is characterised by its particular: Architecture; its pattern of connections between the neurones. Learning Algorithm; its method of determining the weights on the connections. Algorithms used during this thesis are detailed in section 6.3.3. Activation function; which determines its output. The most common activation functions are step, ramp, sigmoid and Gaussian function. Activation functions used during this thesis are detailed in section 6.3.2. Attending to the architecture, regardless of the number of layers (single-layer or multi-layer), there are two main kinds of ANN: 1. Feed-forward networks allow signals to travel one way only; from input to output. There is no feedback (loops) i.e. the output of any layer does not affect that same layer. Feed-forward ANNs tend to be straightforward networks that associate inputs with 86 Chapter 6 CLASSIFIERS outputs. They are extensively used in pattern recognition. This type of organisation is also referred to as bottom-up or top-down. 2. Feedback networks can have signals travelling in both directions by introducing loops in the network. Feedback networks are very powerful and can get extremely complicated. Feedback networks are dynamic; their 'state' is changing continuously until they reach an equilibrium point. They remain at the equilibrium point until the input changes and a new equilibrium needs to be found. Feedback architectures are also referred to as interactive or recurrent, although the latter term is often used to denote feedback connections in single-layer organizations. Figure 6.4. Multilayer perceptron employing feed forward, fully connected topology In the framework of this Thesis, only Feed-Forward architecture is employed, due to its more general use. However, with relation to the learning algorithm, activation and analysis function, diverse options are tried. A better description of these particular neural network characteristics is made in next sections. 6.3.2 Initialisation of adaptive parameters in neural networks. Before a Neural Network is trained, its weights must be initialised, in order to reach an iterative optimisation. The initialisation of adaptive parameters in neural networks, far from being trivial, is pointed by several studies (s. [Duc97, Fer01] as a key factor to 87 Chapter 6 CLASSIFIERS create robust neural networks. There is no definitive initialisation. Putting the weights to zero will halt all the gradient dependent optimisation techniques. In [Duc97] it is concluded that Neural Network initialization, most frequently done by randomizing weights, can also be accomplished by prototypes based on initial clusterization giving much better results enabling solutions to complex, real life problems. Introduction of such methods of parameter initialization should allow for creation of neural systems requiring little optimization in further training stages. However, complex initialization techniques still require deeper investigation and further assessment. Usually it is a good design to fit the weights so that the summation in the receiving unit (hidden or output unit) is in the range [-1,1]. That is, adjusting the weights according to the standard deviation of the transmitting unit, and the number of transmitting units (the fan in). Therefore, the initialisation function used in this work is the Randomise Weights of the SNNS toolkit in the mentioned range [-1, 1]. By random initialisation different parts of the weight space can be search, minimising the behaviour a local minimum have for the particular training set. 6.3.3 Learning Algorithms. One of the most important questions when using NN is how to adjust the weights of the links to get the desired system behaviour. This modification is very often based on the Hebbian rule, which states that a link between two units is strengthened if both units are active at the same time. The Hebbian rule in its general form is: wij g a j t , t j h oi t , wij (6.2) Where wij = weight of the link from unit i to unit j. aj(t) = activation of unit j in step t. tj = teaching input of unit j. oi = output of the preceding unit i. g(…) = function, depending on the activation of the unit and the teaching input. h(…) = function, depending on the output of the preceding element and the current weight of the link. 88 Chapter 6 CLASSIFIERS Training a feed-forward neural network with supervised learning consists of the following two phases: 1. An input pattern is presented to the network. The input is then propagated forward in the net until activation reaches the output layer. This constitutes the so-called forward propagation phase. 2. The output of the output layer is then compared with the teaching input. The error, i.e. the difference (delta) between the output and the teaching input of a target output unit j is then used together with the output of the source unit i to compute the necessary changes of the link . To compute the deltas of inner units for which no teaching input is available, (units of hidden layers) the deltas of the following layer, which are already computed, are used in a formula given below (6.3). In this way the errors (deltas) are propagated backward, so this phase is called backward propagation. There are two kind of training according to when the weights are updated. In online learning, the weight changes are applied to the network after each training pattern, i.e. after each forward and backward pass. In offline learning or batch learning the weight changes are cumulated for all patterns in the training file and the sum of all changes is applied after one full cycle (epoch) through the training pattern file. Methods and algorithms tried during this Diploma Thesis are described in following subsections. 6.3.3.1 Backpropagation learning algorithm. The basic idea of Backpropagation learning algorithm, is the repeated application of the chain rule to compute the influence of each weight in the network with respect to an arbitrary error function E: E E ai neti wij ai neti wij (6.3) Where wij = weight from neuron j to neuron i. ai = activation value. neti = weighted sum of the inputs of neuron i. 89 Chapter 6 CLASSIFIERS Once the partial derivative of each weight is known, the aim of minimising the error function is achieved by performing a simple gradient descent: wij (t 1) wij (t ) E (t ) wij (6.4) Where η = learning rate. Learning rate parameter is selected by the user and, as it can be deduced from equation 6.4, it plays an important role in the convergence of the network in terms of success and speed. For our experiments the most commonly used parameters are selected. The inspection of advanced possibilities related to neural network learning procedures conforms a broad field of investigation and could be, therefore, a point of further experimentation. In the backpropagation learning algorithm online training is usually significantly faster than batch training, especially in the case of large training sets with many similar training examples. On the other hand, results of the training with backpropagation and update after every pattern presentation, heavily depend on a proper choice of the parameter η [Sci94]. The backpropagation weight update rule, also called generalized delta-rule, for the SNNS software reads as follows: wij j oi f net j t j o j j f net w j k jk k (6.5) if unit j is an output unit (6.6) if unit j is a hidden unit Where η = learning factor (a constant). δj = error (difference between the real output and the teaching input) of unit j. oi = output of the preceding unit i. tj = teaching input of unit j. i = index of a predecessor to the current unit j with link wij form I to j. j = index of the current unit. k = index of a successor to the current unit j with link wjk from j to k. 90 Chapter 6 CLASSIFIERS There are several backpropagation algorithms supplied with. In our research we made use of two of them: Vanilla backpropagation / Standard Backpropagation. Vanilla backpropagation corresponds to the standard backpropagation learning algorithm introduced by [Rum86] and described above. It is the most common learning algorithm. Its definition reads as equation 6.6. In SNNS, one may either set the number of training cycles in advance or train the network until it has reached a predefined error on the training set. In order to execute this algorithm, the following learning parameters are required by the learning function that is already built into SNNS: - η: Learning rate specifies the step width of the gradient descent. Typical values of η are 0.1 …1. Some small examples actually train even faster with values above 1, like 2.0. - dmax: the maximum difference d j o j t j between a teaching value tj and an output oj of an output unit which is tolerated, i.e. which is propagated back as dj=0. If values above 0.9 should be regarded as 1 and values below 0.1 as 0, then dmax should be set to 0.1. This prevents overtraining of the network. Typical values of are 0, 0.1 or 0.2. Backpropagation with chunkwise update. There is a form of backpropagation that comes in between the online and batch versions of the algorithm with regard to updating the weights. The online version is the one described before (vanilla backpropagation). The batch version has a similar formula as vanilla backpropagation but, while in Vanilla Backpropagation an update step is performed after each single pattern, in Batch Backpropagation all weight changes are summed over a full presentation of all training patterns (one epoch). Only then, an update with the accumulated weight changes is performed. Here, a chunk is defined as the number of patterns to be presented to the network before making any alterations to the weights. This version is very useful for training cases with very large training sets, where batch update would take too long to converge and online update would be too unstable. 91 Chapter 6 CLASSIFIERS Besides parameters required in Vanilla Backpropagation, this algorithm needs to fix the chunk size N, defined as the number of patterns to be presented during training before an update of the weights with the accumulated error will take place. Based on this definition, Backpropagation with Chunkwise update can also be seen as a mixture between Standard backpropagation (N =1) and Batch Backpropagation (N =number of patterns in the file) For the experiments carried out in this thesis, which make use of this learning algorithm, the chunk size is set to 50 patterns. 6.3.3.2 RPROP learning algorithm. Rprop stands for “Resilient back propagation” and is a local adaptive learning scheme, performing supervised batch learning in multi-layer perceptrons. The choice of the learning rate η for the Backpropagation algorithm in equation 6.4, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary a large learning rate will possibly lead to oscillation, preventing the error to fall bellow a certain value. Figure 6.5 shows both phenomena. In case (a), long convergence times are required, and in the (b) case, an oscillation can be seen in the proximity of local minima. (a) (b) Figure 6.5. Error functions for the case of (a) a small learning rate and (b) a large learning rate. The basic principle of Rprop is to eliminate the harmful influence of the size of the partial derivative on the weight step. This algorithm considers the local topology of the error function to change its behaviour. As a consequence, only the sign of the derivative is considered to indicate the direction of the weight update. The size of the weight change t is exclusively determined by a weight-specific, so-called 'update-value' ij . 92 Chapter 6 wij (t ) CLASSIFIERS ij ( t ) (t ) ij 0 (t ) if E 0 wij (t ) if E 0 (6.7) wij else Where E ( t ) = Summed gradient information over all patterns of the pattern set. wij The basic idea for the improvement realised by the Rprop algorithm was to achieve some more information about the topology of the error function so that the weight-update can be done more appropriately. Each ‘update-value’ evolves during the learning process according to its local sight of the error function E. Therefore, the second step of Rprop learning is to determine the new update-values. This is based on a sign-dependent adaptation process: ij (t ) ( t 1) ij ( t 1) ij ( t 1) ij E ( t 1) E ( t ) , if 0 wij wij , if E ( t 1) E ( t ) 0 ( t 1) wij wij (6.8) , else With 0 < η- < 1 < η + Note that the update-value is not influenced by the magnitude of the derivatives, but only by the behaviour of the sign of two succeeding derivatives. Every time the partial derivative of the corresponding weight changes its sign, which indicates that the last update was too big and the algorithm has jumped over a local minimum (figure 6.5a), the update-value ij is decreased by the factor η-. If the derivative retains its sign, the (t ) update-value is slightly increased in order to accelerate convergence in shallow regions. 93 Chapter 6 CLASSIFIERS Rprop also avoids the problem encountered in the well-known SuperSAB6 algorithm [Toll90]. There, the weight-update is still strongly dependent on the magnitude of the partial derivative and the effects of this influence spread all over the entire network. Rprop prevents this influence by changing the value of the weight update directly, only depending on the sign of the partial derivative without reference to its magnitude. Since Rprop tries to adapt its learning process to the topology of the error function, it follows the principle of 'batch learning' or 'learning by epoch'. That means, that weightupdate and adaptation are performed after the gradient information of the whole pattern set is computed. In order to reduce the number of freely adjustable parameters, often leading to a tedious search in parameter space, the increase and decrease factor in SNNS are set to fixed values (η-=0.5, η+=1.2). Thus, the Rprop algorithm takes only three parameters: - o = initial update value. - max =limit for the maximum step size. - = weight decay exponent. When learning starts, all update-values are set to an initial value o . Since o directly determines the size of the first weight step, it should be chosen according to the initial values of the weights themselves, for example o =0.1 (default setting). The choice of this value is rather uncritical, because it is adapted as learning proceeds. In order to prevent the weights from becoming too large, the maximum weight-step determined by the size of the update-value, is limited. The upper bound is set by the second parameter of Rprop, max . The default upper bound is set somewhat arbitrarily to max =50. Usually, convergence is rather insensitive to this parameter as well. Nevertheless, for some problems it can be advantageous to allow only very cautious (namely small) steps, in order to prevent the algorithm getting stuck too quickly in suboptimal local minima min =1e-6. The remaining parameter α (weight decay exponent) determines the relationship between the output error and to reduction in the size of the weights. The choice of the third parameter is set to 4, what corresponds to a ratio of weight decay term to output error of 1:10000 (1:104). 6 Super self-adjusting back-propagation algorithm 94 Chapter 6 CLASSIFIERS 6.3.3.3 Pruning algorithms. Pruning algorithms try to make neural networks smaller by pruning unnecessary links or units, for different reasons: It is possible to find a fitting architecture this way. The cost of a net can be reduced (think of runtime, memory and cost for hardware implementation). The generalisation can (but need not) be improved. Unnecessary input units can be pruned in order to give evidence of the relevance of input values. (A kind of feature selection, chapter 5). Pruning algorithms can be rated according to two criterions: What will be pruned? We distinguish weight pruning and node pruning. Special types of node pruning are input pruning and hidden unit pruning. How will be pruned? The most common possibilities are penalty term algorithms (like Backpropagation with Weight and sensitivity algorithms. Sensitivity algorithms, which are used in this Thesis, perform training and pruning of a neural net alternately, according to the following algorithm: 1. Choose a reasonable network architecture. 2. Train the net with backpropagation or any similar learning function into a minimum of the network. 3. Compute the saliency (relevance for the performance of the network) of each element (link or unit respectively). 4. Prune the net with the smallest saliency. 5. Retrain the net (into a minimum again). 6. If the net error is not too big, repeat the procedure from step 3 on. 7. Recreate the last pruned element in order to achieve a small net error again. Figure 6.6. Pruning general algorithm. For the experiments carried out during this study, Magnitude Based Pruning algorithm is employed. This is the simplest weight-pruning algorithm. After each training, the link with the smallest weight is removed. Thus the saliency of a link is just the 95 Chapter 6 CLASSIFIERS absolute size of its weight. Though this method is very simple, it rarely yields worse results than the more sophisticated algorithms. The (subordinated) learning method employed in step 2 (figure 6.6) is set, for our purposes, to the Standard Backpropagation algorithm. There are two criterions to stop the pruning based on the error after retraining. It must not exceed: - the error before the first pruning by more then a certain percentage determined by the user in the SNNS field “Maximum error increase in %:” (default setting=10) and - the absolute SSE value given in the field “Maximum accepted SSE” (default setting=5). SNNS also allows to select the number of epochs of the subordinated learning function, for the first training and each retraining separately (default settings = 1000 and 100 respectively). The training, however, stops when the absolute error falls short of the “Minimum error to stop” (default setting=1). This prevents the net from overtraining. For experiments made during this thesis, all the parameters read above are set to their default values. However, since the subordinated function also has its own parameters (see section 6.3.2.1), variations over them are tried in different experiments. 6.3.3.4 Multiple step vs. One step procedure. Multiple step method is not exactly a learning algorithm but a training procedure. When a neural network is trained, with one of the previously explained algorithms, the user must fix a learning rate η. The selection of this parameter strongly influences the convergence of the network; small learning rates lead to long convergence times while large learning rates can cause oscillation in the proximity of a local minimum (figures 6.6.). As said in section 6.3.2.2, Rprop tries to solve this problem. However, another way of avoiding is attempted in this work, based on a script written by Dr. Vicky Lam that allows to select between two types of training: one step and multiple step. In one step case, the network is trained with a fixed learning rate, it can be considered as “the ordinary case”. Training stops when the number of training epochs reaches 200 cycles or when the mean square error of the previous epoch in the evaluation set is minor than the same error in current epoch. That means, that the network has reached a local 96 Chapter 6 CLASSIFIERS minimum and it should stop before worsening its results. The script tries automatically learning rates from 0.1 to 1 in steps of 0.02 (0.1, 0.12, … 0.98, 1). On the other hand, instead of achieving the training in one unique stage (one step) with a fixed learning rate, the multiple step procedure makes use of four different stages. For every step, the network is trained until the number of training epochs reaches 50 cycles or when the mean square error of the previous epoch in the evaluation set is minor than the same error in current epoch. Once the training has stopped, the resulting network is retrained with the learning rate of the next step. Sometimes, results will be better for the last step and sometimes a better performance is achieved after one of the previous training steps. In the original script by Dr. Vicky Lam, only two base learning algorithms were implemented: Vanilla Backpropagation and Backpropagation with Chunkwise Update. During this thesis, a third learning algorithm, Rprop, is allowed to operate under the script. The learning rates corresponding to each stage of the algorithm are as follows: 1. First step: η = 1 2. Second step: η = 0.5 3. Third step: η = 0.1 4. Fourth step: η = 0.05 The remaining parameters are fixed, either for one step and multiple step, within each base learning algorithm: - Standard Backpropagation: dmax = 0.1 - Chunkwise Backpropagation: dmax = 0.1 and N = 50 - Rprop: max = 50 and = 4 6.3.4 Activation functions Activation functions for the hidden units are needed to introduce non-linearity into the network. Without non-linearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a linear function of linear functions is again a linear function. However, it is the non-linearity (i.e, the capability to represent non-linear functions) that makes 97 Chapter 6 CLASSIFIERS multilayer networks so powerful. There are two main classes of activation functions: sigmoid and threshold. (a) (b) Figure 6.7. Threshold (a) and sigmoid (b) activation functions. The threshold or step function corresponds to figure 6.7 (a). There is a linear summation of the inputs and nothing happens until the threshold θ is reached at which point the neuron becomes active (i.e., shows a certain level of activation). Such units are often called linear threshold. The sigmoid function is so-called because it is shaped like one form of the Greek letter Sigma, as illustrated in see figure 6.7 (b). It is, in essence, a smooth version of a step function. It is zero for low input. At some point it starts rising rapidly and then, at even higher levels of input, it saturates. This saturation property can be observed in nature where the firing rates of neurons are limited by biological factors. The slope, ß (also called gain) of the sigmoid function can be changed: The larger ß, the steeper the slope, the more closely it approximates the threshold function. Its purpose within an artificial neurone is to generate a degree of non-linearity between the neurone's input and output. The sigmoidal functions such as logistic and tanh (hyperbolic tangent) and the Gaussian function are the most common choices. For hidden units, sigmoid activation functions are usually preferable to threshold activation functions. Networks with threshold units are difficult to train because the error function is stepwise constant, hence the gradient either does not exist or is zero, making it impossible to use backprop or more efficient gradient-based training methods With sigmoid units, a very small change in the weights will usually produce a change in the outputs, which makes it possible to tell whether that change in the weights is good or bad. With threshold 98 Chapter 6 CLASSIFIERS units, a small change in the weights will often produce no change in the outputs. In addition, DasGupta and Schnitger conducted a comparison study [Das93], in terms of efficiency and quality of approximation, of different activation functions. They conclude that the standard sigmoid is actually more powerful than the binary threshold, even when computing boolean functions. In despite of the agreement among experts about the fact that sigmoidal activation functions are optimal for neural network training, selection of an adequate activation function comprises a wide field of investigation and researches (s. [Duc01, Jan01]). By means of an the activation function, a new activation is computed from the output of preceding units, usually multiplied by the weights connecting these predecessor units with the current unit, the old activation of the unit and its bias. The general formula is: a j (t 1) f act ( net j (t ), a j (t ), j ) (6.9) Where nj = activation of unit j in step t. netj(t) = net input in unit j in step t. j = threshold (bias) of unit j. A considerable amount of different activation functions can be found. During this Diploma Thesis, we employed mainly the logistic activation function¸ but a small number of experiments also tried the tanh function. A description of both functions is given in this section. 6.3.4.1 Logistic activation function. This function computes the network input simply by summing over all weighted activations and then squashing the result with the logistic function f act ( x ) 1 /(1 e x ) . The new activation at time (t+1) lies in the range [0,1]. The variable j is the threshold of unit j. The net input is computed with: net j (t ) wij oi (t ) (6.10) i 99 Chapter 6 CLASSIFIERS This yield the well-known logistic activation function: 1 a j (t 1) 1 e i wijoi ( t ) j (6.11) Where aj(t) = activation of unit j in step t. netj(t) = net input in unit j in step t. oi(t) = output of unit i in step t. j = index for some unit in the net. i = index of a predecessor of the unit j. wij = weight of the link from unit i to unit j. j = threshold (bias) of unit j. 6.3.4.2 Hyperbolic tangent activation function. This function has a similar sigmoid shape to the logistic function, but values are spread through the interval [-1, 1], rather than [0, 1]. Its formula reads as follows: net ( t ) e j e j net ( t ) net ( t ) e j e j net ( t ) a j (t 1) (6.12) Where aj(t) = activation of unit j in step t. netj(t) = net input in unit j in step t. j = index for some unit in the net. 6.3.5 Analysing Functions. Once the network has produced some outcomes, the way they are interpreted also has a big influence on the global results. Analysis functions are not related to the neural network training itself, but they take the output of a fixed trained network and make decisions. The output of each node in a neural network is a real value in the range [0,1] and the aim of the analysing functions is to decide the meaning of the output vector. 100 Chapter 6 CLASSIFIERS SNNS has three different analysis criteria: 402040, WTA and Band. Each rule presents two adjustable parameters, h and l, whose significance is specific within a given method. The analysis rule will make a correct, wrong or unknown inference. Note that unclassified output doesn’t infer any conclusion about the input and therefore no information can be extracted. For some applications, as we found during the preliminary experiments (section 8.1), the categorisation of a pattern into the class “unknown” provides no valuable information. Nevertheless, this class can be easily avoided by modifying the thresholds h and l. The decision rules for these methods are detailed in following subsections. 6.3.5.1 402040 decision rule. A pattern is classified correctly if: the output of exactly one output unit is h. the teaching output of this unit is the maximum teaching output (>0) of the pattern. the output of all other output units is 1. A pattern is classified incorrectly if: the output of exactly one output unit is h. the teaching output of this unit is NOT the maximum teaching output of the pattern or there is no teaching output > 0. The output of all other units is 1. A pattern is unclassified in all other cases. 6.3.5.2 WTA (Winner Takes All) A pattern is classified correctly if: there is an output unit with the value greater than the output value of all other output units (this output value is supposed to be a). a > h. the teaching output of this unit is the maximum teaching output of the pattern (>0). the output of all other units is < a-1. 101 Chapter 6 CLASSIFIERS A pattern is classified incorrectly if: there is an output unit with the value greater than the output value of all other output units (this output value is supposed to be a). a > h. the teaching output of this unit is NOT the maximum teaching output of the pattern >0. the output of all other output units is < a-1. A pattern is unclassified in all other cases. 6.3.5.3 Band decision rule. A pattern is classified correctly if for all output units: the output is the teaching output - l. the output is the teaching output + h. A pattern is classified incorrectly if for all output units: the output is < the teaching output – l or the output is > the teaching output + h. This rule is especially useful when the network presents one single output node and the decision, instead of resolving which node is the winner, has to be based in a division of the output range in bands of values and assign each band to a different class. 6.3.5.4 Post- analysis method based on thresholds. This analysis procedure is applied on the neural network outputs in order to make some restrictions on the winner selection through the WTA selection. The analysis is performed using a C program created specifically for this work: confusion_th. After choosing the winner candidate of the output trough WTA rule, a decision, based in two different thresholds, determines whether this value can be actually considered as the winner or not. These thresholds are defined as follows: - Threshold 1: Minimum value of the output to be considered as the winner. When the winner candidate does not exceed this value, the pattern is classified as neutral. The 102 Chapter 6 CLASSIFIERS conceptual idea is that the pattern is no emotive enough to be classified into the winner class. - Threshold 2: Maximum value of the opposite emotion or emotional groups. When an utterance is classified into one emotion, e.g. angry, the output values for the emotions situated on the opposite side of the axis, e.g. bored and sad for the arousal dimension, must not exceed this value. Otherwise, the winner candidate is classified as neutral. This is based on the observed experimental fact (see Chapter 8) that mean output values of opposite emotions are well differentiated, for both the five outputs and the three outputs case. 6.4 Leave-one-out cross validation 6.4.1 Leave-one-sentence out. When training a classifier, the amount of data used for its training will influence the quality of the learning model. Intuitively, if more examples of a class are given, the classifier will tend to construct better generalisations. In order to increase the reliability of the results carried out during the speaker dependent experiments, for which the recorded database was not significantly large, leave-one-sentence out procedure is applied for the evaluation. Suppose we have N patterns to train and test the model. If we divide the set into two subsets, i.e. training and testing set, the results are dependent of the division and, in addition, the amount of data used for each task is reduced. With the leave-one-out method this problems are to some extent solved. The method takes N-1 patterns to train the classifier and then tests it with the remaining pattern. This procedure is repeated for all the available patterns from 1 to N. This way, the classifier is trained with almost the majority of the data (N-1) and is tested, after the whole iteration, on the complete set. 6.4.2 Leave-one-speaker out In order to evaluate the speaker independence of the classifier, it should be tested in a complete previously unknown subject. This way, from all the available speakers, some should be used for training while the remaining ones will be used for testing. Similar 103 Chapter 6 CLASSIFIERS problems to those found in 6.3.1 arise. In order to get the maximum profit from the available data, leave-one-speaker out procedure is employed for speaker independent experiments. Suppose we have S speakers, then S-1 are used during the training step and the resulting classifier is tested on the remaining speaker. It is repeated for all the speakers and statistics are computed over the whole set results. 104