Appendix: Mathematical description of the Machine Learning methods Discriminant classifier - The Linear Discriminant divides the feature space by a hyperplane decision surface that maximizes the ratio of between-class variance to within-class variance. Given a certain data vector x, a discriminant function that is a linear combination of the components of x can be written as g x wt x w0 (1) where w is the weight vector and w0 the bias or threshold weight. A two-category weightlinear classifier implements the following decision rule: Decide ω1if g(x) > 0 and ω2 if g(x) < 0. Thus, x is assigned to ω1 if the inner product 𝐰 𝑡 𝐱exceeds the threshold – w0 and ω2 otherwise. If g(x) = 0, the assignment is undefined. - The Quadratic Discriminant, divides the feature space by a hypersphere, hyperellipsoid or hyperhyperboloids decision surface. The linear discriminant function g(x) can be written as d g x w0 wi xi i 1 (2) where the coefficients 𝑤𝑖 are the components of the weight vector w. By adding additional terms involving the products of pairs of components of x, we obtain the quadraticdiscriminant function d d d g x w0 wi xi wij xi x j i 1 i 1 j 1 (3) Since xi x j x j xi , we can assume that wij w ji with no loss in generality. Thus, the quadratic discriminant function has an additional d (d 1) coefficients at its disposal with which to produce 2 more complicated separating surfaces. The separating surface defined by g(x) =0 is a second-degree or hyperquadric surface. - The Mahalanobis Discriminant employs the Mahalanobis distance D 2 x Σ 1 (x ) ' (4) where x is the vector of the data, µ is the centroid of a certain class, and is the covariance matrix of the data distribution, and assigns each datum x to the class µ that minimizes D 2 . Support Vector Machine The support vector machine (SVM) [31] is a very effective method for general-purpose pattern recognition. We are given training data {x1 ... xn} that are vectors in some space X R d . We are also given their labels {y1 ...yn}, where yi {−1, 1}. In their simplest form, SVMs are hyperplanes that separate the training data by a maximal margin. The equation of the hyperplane is of the form wx b 0 (5) where the vector w is perpendicular to the hyperplane, “ ” denotes an inner product and b is an additional parameter. The training instances that lie closest to the hyperplane are called support vectors and the distance from those instances and the separating hyperplane is called geometrical margin of the classifier. More generally, SVMs rely on preprocessing the data to represent patterns in a high dimension typically much higher than the original feature space. The original training data can be projected from space X to a higher dimensional feature space 𝐹 via a Mercer kernel operator K. In other words, we consider the set of classifiers of the form: n f x i K(xi , x) i 1 (6) When K satisfies Mercer’s condition [2] we can write: K(u, v) = Φ(u) · Φ(v), where Φ : X →F. We can then rewrite f as: n f ( x) w (x), where w i Φ( x i ) (7) i 1 After projecting the data, the SVM computes the αis that correspond to the maximal margin hyperplane in F. By choosing different kernel functions we can implicitly project the training data from X into spaces F for which hyperplanes in F correspond to more complex decision boundaries in the original space X. Two commonly used kernels are the polynomial kernel given by K(u, v) = (u · v +1)o which induces polynomial boundaries of degree o in the original space X and the radial basis function or Gaussian kernel K(u,v)=e−σ(u−v)·(u−v) which induces boundaries by placing weighted Gaussians upon key training instances [30]. If the training set is not linearly separable, the standard approach is to allow the decision margin to make a few mistakes (Soft margin SVM). We then pay a cost for each misclassified example, which depends on how far it is from meeting the margin requirement. To implement this, we introduce slack variables ξi. A non-zero value for ξi allows xi not to meet the margin requirement at a cost proportional to the value of ξi. The formulation of the SVM optimization problem with slack variables is: Find w, b, ξi ≥0 such that: 1 T w w C ξ i is minimized 2 i and for all xi , yi , yi w T xi b 1 ξ i (8) The optimization problem is then trading off how wide it can make the margin versus how many points have to be moved around to allow this margin. The margin can be less than 1 for a point setting x i by ξi 0 , but then one pays a penalty of C ξ in the minimization for having done that. The sum of i the ξi gives an upper bound on the number of training errors. Soft-margin SVMs minimize training error traded off against margin. If the error penalty factor C is close to 0, then we don't pay that much for points violating the margin constraint. The cost function can be minimized by setting w to be a small vector - this is equivalent to creating a very wide safety margin around the decision boundary (but having many points violate this safety margin). If C is close to infinity, then a lot is paid for points that violate the margin constraint, and this case is close the previously described hard-margin formulation the drawback here is the high sensitivity to outlier points in the training data [8]. AdaBoost The goal of boosting is to improve the accuracy of any given learning algorithm. In boosting we first create a classifier, and then add new component classifiers to form an ensemble whose joint decision rule has arbitrarily high accuracy on the training set [4]. Each classifier needs only to be a weak learner – that is, have accuracy only slightly better than chance as a minimum requirement. There are a number of variations on basic boosting. The most popular, AdaBoost – from “Adaptive Boosting” – allows the designer to continue adding weak learners until some desired low training error has been achieved. It initially chooses the learner that classifies more data correctly. In the next step, the data set is reweighted to increase the “importance” of misclassified samples. This process continues and at each step the weight of each weak learner among other learners is determined. Thus, in AdaBoost each training pattern receives a weight that determines its probability of being selected for a training set for an individual component classifier. If a training pattern is accurately classified, then its chance of being used again in a subsequent component classifier is reduced; on the contrary, if the pattern is not accurately classified, then its chance of being used again is raised. In this way, AdaBoost “focuses in” on the informative or “difficult” pattern. Specifically, we initialize the weights across the training set to be uniform. On each iteration k, we draw a training set at random according to these weights, and then we train component classifier Ck on the selected patterns. Next we increase weights of training patterns misclassified by Ck and decrease weights of the patterns correctly classified by Ck. Patterns chosen according to this new distribution are used to train the next classifier, Ck+1, and the process is iterated. We let the patterns and their labels in the full training set D be denoted 𝑥𝑖 and 𝑦𝑖 , respectively, and let Wk(i) be the k-th (discrete) distribution over all these training samples. Thus the AdaBoost procedure is: I) 1 begin initialize D x1 , y1 , , xn , yn , kmax , W1 i , i 1, , n II) III) k=0 do k=k+1 n (9) IV) train weak learner Ck using D sampled according to Wk (i ) V) Ek = training error of Ck measured on D using Wk (i ) VI) VII) VIII) IX) X) 1 2 1 Ek Ek k ln Wk 1 i Wk (i ) e k ifhk xi yi (correctly classified ) k Zk e ifhk xi yi (incorrectly classified ) until k=kmax return Ck and 𝛼𝑘 for k=1 to kmax (ensemble of classifiers with weights) end Note that in line V the error for classifier Ck is determined with respect to the distribution Wk (i ) over D on which it was trained. In line VII, Zk is simply a normalizing constant computed to ensure that Wk (i ) represents a true distribution, and hk(xi) is the category label (+1 or -1) given to pattern xi by component classifier Ck. Naturally, the loop termination of line VIII could instead use the criterion of sufficiently low training error of the ensemble classifier. The final classification decision of a test point x is based on a discriminant function that is merely the weighted sums of the outputs given by the component classifiers: kmax g ( x) ak hk x k 1 (10) The classification decision for this two-category case is then simply Sign( g ( x )) . Except in pathological cases, as long as each component classifier is a weak learner, the total training error of the ensemble can be made arbitrarily low by setting the number of component classifiers, kmax, sufficiently high. Supervised Neural Network An Artificial Neural Network is an adaptive, most often nonlinear system that learns to perform a function (an input/output map) from a data set (inductive learning). Adaptive means that the system parameters are changed through operation, (training phase). After the training phase, the Artificial Neural Network parameters are fixed and the system is deployed to solve the problem at hand (testing phase). The Artificial Neural Network is built with a systematic step-by-step procedure to optimize a performance criterion or to follow some implicit internal constraint, which is commonly referred to as the learning rule. The nonlinear nature of the neural network processing elements (PEs) provides the system with a great flexibility to achieve practically any desired input/output map. An input is presented to the neural network and a corresponding desired or target response set at the output (when this is the case the training is called supervised). An error is calculated as the difference between the desired response and the system output. This error information is fed back to the system and adjusts the system parameters in a systematic fashion (the learning rule). The process is repeated until the performance becomes acceptable. The structural unit of Neural Networks is a functional model of the biological neuron, called Perceptron. The synapses of the neuron are modeled as weights: the strength of the connection between an input and a neuron is characterized by the value of the weight. Negative weight values reflect inhibitory connections, while positive values designate excitatory connections. An adder sums up all the inputs modified by their respective weights. and an activation function controls the amplitude of the output of the neuron. An acceptable range of output is usually between 0 and 1, or -1 and 1. From this model the interval activity of the neuron can be shown to be: p vk wkj x j j 1 (11) The output of the neuron, yk, would therefore be the outcome of some activation function on the value of vk. As mentioned previously, the activation function acts as a squashing function, such that the output of a neuron in a neural network is between certain values (usually 0 and 1, or -1 and 1). The most common activation functions, denoted by φ(·) are the Threshold Function, the Piecewise-Linear Function, and the Log-sigmoid Function. After analyzing the properties of the basic processing unit in an artificial neural network, we will now focus on the pattern of connections between the units and the propagation of data. As for this pattern of connections, the main distinction we can make is between feed-forward neural networks, where the data flow from input to output units is strictly feed-forward, and recurrent neural networks, which do contain feedback connections [22]. A neural network has to be configured such that the application of a set of inputs produces the desired set of outputs. Various methods to set the strengths of the connections exist. One way is to set the weights explicitly, using a priori knowledge. Another way is to ‘train’ the neural network by feeding it teaching patterns and letting it change its weights according to some learning rule. We can categorize the learning situations in three distinct sorts. We talk about supervised learning when the network is trained by providing it with input and matching output patterns. These input-output pairs can be provided by an external teacher, or by the system which contains the neural network (selfsupervised); we have unsupervised learning when an (output) unit is trained to respond to clusters of pattern within the input. In this paradigm the system is supposed to discover statistically salient features of the input population. Finally, reinforcement learning, can be performed, which may be considered as an intermediate form of the above two types of learning. [23].