Appendix: Mathematical description of the Machine Learning methods

advertisement
Appendix: Mathematical description of the Machine Learning
methods
Discriminant classifier
- The Linear Discriminant divides the feature space by a hyperplane decision surface that
maximizes the ratio of between-class variance to within-class variance. Given a certain data vector x,
a discriminant function that is a linear combination of the components of x can be written as
g  x   wt x  w0
(1)
where w is the weight vector and w0 the bias or threshold weight. A two-category weightlinear
classifier implements the following decision rule: Decide ω1if g(x) > 0 and ω2 if g(x) < 0. Thus, x is
assigned to ω1 if the inner product 𝐰 𝑡 𝐱exceeds the threshold – w0 and ω2 otherwise. If g(x) = 0,
the assignment is undefined.
- The Quadratic Discriminant, divides the feature space by a hypersphere, hyperellipsoid or
hyperhyperboloids decision surface. The linear discriminant function g(x) can be written as
d
g  x   w0  wi xi
i 1
(2)
where the coefficients 𝑤𝑖 are the components of the weight vector w. By adding additional
terms involving the products of pairs of components of x, we obtain the quadraticdiscriminant
function
d
d
d
g  x   w0  wi xi  wij xi x j
i 1
i 1 j 1
(3)
Since xi x j  x j xi , we can assume that wij  w ji with no loss in generality. Thus, the quadratic
discriminant function has an additional
d (d  1)
coefficients at its disposal with which to produce
2
more complicated separating surfaces. The separating surface defined by g(x) =0 is a second-degree
or hyperquadric surface.
- The Mahalanobis Discriminant employs the Mahalanobis distance
D 2   x    Σ 1 (x   )
'
(4)
where x is the vector of the data, µ is the centroid of a certain class, and  is the covariance
matrix of the data distribution, and assigns each datum x to the class µ that minimizes D 2 .
Support Vector Machine
The support vector machine (SVM) [31] is a very effective method for general-purpose pattern
recognition. We are given training data {x1 ... xn} that are vectors in some space X  R d .
We are also given their labels {y1 ...yn}, where yi  {−1, 1}. In their simplest form, SVMs are hyperplanes
that separate the training data by a maximal margin. The equation of the hyperplane is of the form
wx b  0
(5)
where the vector w is perpendicular to the hyperplane, “  ” denotes an inner product and b is an
additional parameter. The training instances that lie closest to the hyperplane are called support vectors
and the distance from those instances and the separating hyperplane is called geometrical margin of the
classifier.
More generally, SVMs rely on preprocessing the data to represent patterns in a high dimension typically much higher than the original feature space.
The original training data can be projected from space X to a higher dimensional feature space 𝐹 via
a Mercer kernel operator K. In other words, we consider the set of classifiers of the form:
 n

f  x     i K(xi , x) 
 i 1

(6)
When K satisfies Mercer’s condition [2] we can write: K(u, v) = Φ(u) · Φ(v), where Φ : X →F. We
can then rewrite f as:
n
f ( x)  w   (x), where w   i Φ( x i )
(7)
i 1
After projecting the data, the SVM computes the αis that correspond to the maximal margin
hyperplane in F.
By choosing different kernel functions we can implicitly project the training data from X into spaces F for
which hyperplanes in F correspond to more complex decision boundaries in the original space X.
Two commonly used kernels are the polynomial kernel given by K(u, v) = (u · v +1)o which induces
polynomial boundaries of degree o in the original space X and the radial basis function or Gaussian
kernel K(u,v)=e−σ(u−v)·(u−v) which induces boundaries by placing weighted Gaussians upon key training
instances [30].
If the training set is not linearly separable, the standard approach is to allow the decision margin to
make a few mistakes (Soft margin SVM). We then pay a cost for each misclassified example, which
depends on how far it is from meeting the margin requirement. To implement this, we introduce slack
variables ξi. A non-zero value for ξi allows xi not to meet the margin requirement at a cost proportional
to the value of ξi.
The formulation of the SVM optimization problem with slack variables is:
Find w, b, ξi ≥0 such that:
1 T
w w  C ξ i is minimized
2
i


and for all xi , yi  , yi w T xi  b  1  ξ i
(8)
The optimization problem is then trading off how wide it can make the margin versus how many points
have to be moved around to allow this margin. The margin can be less than 1 for a point
setting
x i by
ξi  0 , but then one pays a penalty of C ξ in the minimization for having done that. The sum of
i
the ξi gives an upper bound on the number of training errors. Soft-margin SVMs minimize training error
traded off against margin. If the error penalty factor C is close to 0, then we don't pay that much for
points violating the margin constraint. The cost function can be minimized by setting w to be a small
vector - this is equivalent to creating a very wide safety margin around the decision boundary (but
having many points violate this safety margin). If C is close to infinity, then a lot is paid for points that
violate the margin constraint, and this case is close the previously described hard-margin formulation the drawback here is the high sensitivity to outlier points in the training data [8].
AdaBoost
The goal of boosting is to improve the accuracy of any given learning algorithm. In boosting we
first create a classifier, and then add new component classifiers to form an ensemble whose joint
decision rule has arbitrarily high accuracy on the training set [4]. Each classifier needs only to be a
weak learner – that is, have accuracy only slightly better than chance as a minimum requirement.
There are a number of variations on basic boosting. The most popular, AdaBoost – from “Adaptive
Boosting” – allows the designer to continue adding weak learners until some desired low training
error has been achieved.
It initially chooses the learner that classifies more data correctly. In the next step, the data set is reweighted to increase the “importance” of misclassified samples. This process continues and at each step
the weight of each weak learner among other learners is determined.
Thus, in AdaBoost each training pattern receives a weight that determines its probability of being
selected for a training set for an individual component classifier. If a training pattern is accurately
classified, then its chance of being used again in a subsequent component classifier is reduced; on
the contrary, if the pattern is not accurately classified, then its chance of being used again is raised. In
this way, AdaBoost “focuses in” on the informative or “difficult” pattern. Specifically, we initialize the
weights across the training set to be uniform. On each iteration k, we draw a training set at random
according to these weights, and then we train component classifier Ck on the selected patterns. Next
we increase weights of training patterns misclassified by Ck and decrease weights of the patterns
correctly classified by Ck. Patterns chosen according to this new distribution are used to train the next
classifier, Ck+1, and the process is iterated.
We let the patterns and their labels in the full training set D be denoted 𝑥𝑖 and 𝑦𝑖 , respectively,
and let Wk(i) be the k-th (discrete) distribution over all these training samples. Thus the AdaBoost
procedure is:
I)
1
begin initialize D  x1 , y1 , , xn , yn , kmax , W1  i   , i  1, , n

II)
III)
k=0
do k=k+1

n
(9)
IV)
train weak learner Ck using D sampled according to Wk (i )
V)
Ek = training error of Ck measured on D using Wk (i )
VI)
VII)
VIII)
IX)
X)
1
2
 1  Ek 

 Ek 
 k  ln 
Wk 1  i  

Wk (i )  e k ifhk  xi   yi (correctly classified )
  k
Zk
e ifhk  xi   yi (incorrectly classified )
until k=kmax
return Ck and 𝛼𝑘 for k=1 to kmax (ensemble of classifiers with
weights)
end
Note that in line V the error for classifier Ck is determined with respect to the distribution Wk (i ) over
D on which it was trained. In line VII, Zk is simply a normalizing constant computed to ensure that Wk (i )
represents a true distribution, and hk(xi) is the category label (+1 or -1) given to pattern xi by
component classifier Ck. Naturally, the loop termination of line VIII could instead use the criterion of
sufficiently low training error of the ensemble classifier.
The final classification decision of a test point x is based on a discriminant function that is merely the
weighted sums of the outputs given by the component classifiers:
 kmax

g ( x)   ak hk  x 
 k 1

(10)
The classification decision for this two-category case is then simply Sign( g ( x )) .
Except in pathological cases, as long as each component classifier is a weak learner, the total training
error of the ensemble can be made arbitrarily low by setting the number of component classifiers, kmax,
sufficiently high.
Supervised Neural Network
An Artificial Neural Network is an adaptive, most often nonlinear system that learns to perform a
function (an input/output map) from a data set (inductive learning). Adaptive means that the system
parameters are changed through operation, (training phase). After the training phase, the Artificial
Neural Network parameters are fixed and the system is deployed to solve the problem at hand (testing
phase). The Artificial Neural Network is built with a systematic step-by-step procedure to optimize a
performance criterion or to follow some implicit internal constraint, which is commonly referred to as
the learning rule. The nonlinear nature of the neural network processing elements (PEs) provides the
system with a great flexibility to achieve practically any desired input/output map.
An input is presented to the neural network and a corresponding desired or target response set at
the output (when this is the case the training is called supervised). An error is calculated as the
difference between the desired response and the system output. This error information is fed back to
the system and adjusts the system parameters in a systematic fashion (the learning rule). The process
is repeated until the performance becomes acceptable.
The structural unit of Neural Networks is a functional model of the biological neuron, called
Perceptron. The synapses of the neuron are modeled as weights: the strength of the connection
between an input and a neuron is characterized by the value of the weight. Negative weight values
reflect inhibitory connections, while positive values designate excitatory connections. An adder sums
up all the inputs modified by their respective weights. and an activation function controls the
amplitude of the output of the neuron. An acceptable range of output is usually between 0 and 1, or
-1 and 1.
From this model the interval activity of the neuron can be shown to be:
p
vk  wkj x j
j 1
(11)
The output of the neuron, yk, would therefore be the outcome of some activation function on the value
of vk.
As mentioned previously, the activation function acts as a squashing function, such that the output of a
neuron in a neural network is between certain values (usually 0 and 1, or -1 and 1). The most common
activation functions, denoted by φ(·) are the Threshold Function, the Piecewise-Linear Function, and the
Log-sigmoid Function.
After analyzing the properties of the basic processing unit in an artificial neural network, we will
now focus on the pattern of connections between the units and the propagation of data. As for this
pattern of connections, the main distinction we can make is between feed-forward neural networks,
where the data flow from input to output units is strictly feed-forward, and recurrent neural
networks, which do contain feedback connections [22].
A neural network has to be configured such that the application of a set of inputs produces the
desired set of outputs. Various methods to set the strengths of the connections exist. One way is to
set the weights explicitly, using a priori knowledge. Another way is to ‘train’ the neural network by
feeding it teaching patterns and letting it change its weights according to some learning rule. We can
categorize the learning situations in three distinct sorts. We talk about supervised learning when the
network is trained by providing it with input and matching output patterns. These input-output pairs
can be provided by an external teacher, or by the system which contains the neural network (selfsupervised); we have unsupervised learning when an (output) unit is trained to respond to clusters
of pattern within the input. In this paradigm the system is supposed to discover statistically salient
features of the input population. Finally, reinforcement learning, can be performed, which may be
considered as an intermediate form of the above two types of learning. [23].
Download