Good afternoon, everybody. Welcome to my canditure presentation

advertisement
Good afternoon, everybody. Welcome to my canditure presentation for my Mphil Degree. And
today I will talk about adapting kernel-based methods: from multiclass SVM to semisupervised
spectral analysis. And I am wuzhili, and my principal supervisor is Dr.Li and my cosupervisor is
Dr Tang.
Now let’s take a look at the outline of my presentation first. At the beginning, I will explain what
is machine learning, and the three categories of learning: pure supervised, unsupervised and the
newly emerging semi-supervised learning. After that, I will try to explain the support vector
machine, which is a successful kernel based supervised learning method. Because SVM is only a
binary classifier, the output code approach is usually used to handle the mult-class problem. This
approach will be introduced in the presentation. After the SVM section, I will present you the
effective unsupervised spectral clustering algorithm and a semi-supervised extension to the
spectral method. Finally, I give the conclusion and prospect.
So, what is Machine learning? As a branch of Artificial Intelligence, machine learning has been
there for a long time but not received enough attention untill the last decade in the past century. In
1997, Tom Mitchell defined machine learning as the study of computer algorithms that improve
automatically through experience. For the past several years, there have been many successful
application using machine learning approaches, such as data mining upon large database,
bioinformatics and information extraction. While the following sentences are extracted from the
Science Magazine in year 2001, written by Dennis Decoste., says that “ the machine learning will
lead to appropriate, partial automation of every element of scientific method, from hypothesis
generation to model construction to decisive experimentation.
Now let’s see the three categories of learning. The first one is the supervised learning with training
samples available, the task is to predict the class of unknown patterns. There are two types of
supervised learning, one is regression and the other is classification. While regression maps the
data into infinite types of labels, and the classification, usually known as pattern recognition, just
maps the data into finite types of labels. The second type of learning is unsupervised because no
any labeled data present during the learning process. A common task of unsupervised learning is to
do clustering by grouping data into several clusters. The third category of learning is a little bit
new but of great importance. It is called semi-supervised learning. Although there is not a clear
definition for it, it usually refers to the situations where labeled data are in a limited quantity, but
unlabeled data are abundantly present. For example, there are so many data for the the webpage
categorization and gene expression analysis, but the labeling process are too tedious or money
demanding.
After the general introduction to machine learning, I will try to illustrate the definition of kernel
matrix, which will play a major role in my presentation. Given a dataset X with n data points,
while each xi is a d dimensional feature vector, and y is the label, which can be one, minus one or
unknown if the data point is not labeled. Now we construct a symmetric n times n matrix K, each
entry of K is a measure of the pairwise relation between data point xi and xj. For example, the
following exponential inverse distance function is a similarity measure of any data point xi and xj.
By some rigorous mathematical proof, we can define what a kernel matrix is. If the above K is
semipositive definite, it is a kernel matrix. The special and interesting point of a kernel matrix K is
that any entry Kij can be regarded as the inner product of two vectors fixi and fixj. Where fixi is a
nonlinear mapping of xi but with a much higher dimension. That is to say, the kernel function
maps each data point in the lower dimension input space into a higher dimension feature space.
To make the idea of kernel matrix more clear, I present you a toy example. Here are five patterns
extracted from an IQ quiz problem: which is the odd one out? The y is totally unknown because it
is a unsupervised learning problem. To answer the question, the strategy of Machine learning is to
do clustering. But now I want to show how to build a matrix. To distinguish those patterns, we
might define the features of the pattern first. For example, we might score those patterns by the
symmetry, curvature, and the aesthetic appealing. Here is the dataset X, of course, you can have
your own scoring scheme.
Now we adopt a kernel function talked before. The expontional inverse distance square function.
And then we obtain a 5 times 5 kernel matrix.
After talking about the kernel matrix, we can go to the SVM section. SVM has been an effective
classifier on the basis of three characteristics, the first one is large margin and the second is the
quadratic programming and the third is kernel trick. Look at the diagram, if we are given two
labeled datasets, one set of positive samples marked as blue color and another set of negative
samples marked as red color. We wish to construct a separating plane xw plus b equal to zero such
that there exist two parallel planes: xw plus b equals to 1, called positive plane, xw plus b equals
to minus 1, called negative plane. In order the all positive samples are blocked on the right hand
side of the positive plane, and all negative samples are separated on the left hand side of the
negative plane. It can be easily calculated that the margin between the two planes are 2 overs the
square root of the Euclidean norm of w. Our objective function is to maximize the margin.
For maximizing the margin, we might need to do some mathematical transformation such that we
can easily solve it. First we change the objective function from maximum to minimum with the
same constraints, and then we relax the separable constraints to permit some violation of data
points by adding a penalty term to the objective function. For example, we introduce the penalty
term the summation of C times the violation of each data point.
And with the help of the well-established optimization theories and techniques, we can obtain the
following quadratic form by using the lagrangian optimization method. We now maximize a
function of alpha with respect to a linear constraints: the summation of any alphai yi is zero, and
the alpha is a nonnegative value bounded by C, where C is the penalty constant. It is easy to know
if we strictly require the positive and negative samples are separable, the C will go to positive
infinity. And the above QP problem can be solved by many standard QP routines, or using some
faster algorithms by utilizing the special linear constraints, such as the working set selection,
shrink and SMO. Although there is no a clear bound for its complexity, some algorithms run faster
than N square.
Now we come up with the third characteristic of SVM – the kernel trick. It is observed that there
is a term of inner product between xi and xj in the quadratic program, by simply substituting the
inner product with the entry of the kernel matrix, we can claim that we are constructing the SVM
separating plane in a higher dimensional space rather than in the input space. So what is the
benefit of such kind of substitution. Informally speaking, in a higher dimensional space, it is
believed that the data points are more sparsely distributed, thus the data points are more separable,
we then can less rely on the penalty constant C. For those familiar with Quadratic program the
following denotation might make you feel better.
Finally, we can take a look at the solution of alpha. All data points are classified to three types
according the alpha. Those data points separated by the two parallel planes are associated with a
zero alpha, and the second type of data points, which are on the two parallel planes, have
corresponging alpha within zero to C, and those points violating the separable constraints are
associated with alpha equaling to C. The last two type of data points are called support vectors
because they are supportive to the final decision function. The decision function xw + b = 1 can be
expressed by the combination of alpha,y and the inner product of data points, while the constant b
can be specified after all the alpha are obtained. Simply take a sign operation, we can get the label
of any new data point.
The graph on the right side illustrate the nonseparable case, while the data points circled with
dotted line are support vectors. The graph on the left is an example classified by SVM with the
Gaussian kernel. It is a challenging problem.
Though effective, the SVM is only a binary classifier, many real problems are muti-class. Some
researchers have proposed the multiclass SVM by using the output coding approach. First let’s
take a look at the output code. Those who are working on the communication theory or computer
graphics might know much about output code. For example, the following output code can be
found when doing line clipping in computer graphics. Back to our multi-class SVM problem,
suppose the four lines are four svm decision boundaries: two vertical and two horizontal. The 2D
plane thus is divided into nine regions, each region is reprensented by a four bit code, while each
bit is an indication of the alignment to the svm line. For example, the leftmost three regions have
the first bit 1 because they are all on the left side of the left vertical line. Other bits are specified
using the same way. Now I try to explain what is ECOC in my words. Suppose for an m-class task,
we train n SVMs, the n SVM hyperplanes divide the hyperspace into many subspaces, possibly 2
to the power n. Each subspace is encoded by an n-bit code, while because the number of class
might be less than the number of subspaces, each class thereby is a union of several subspaces.
And if you change several bits of the code, you might still get the same class.
Now I present the algorithm of ECOC SVM to you. It might be too conceptual but later I will try
to illustrate it to you by using graphics. Given a 3-class classification problem, representing each
class by a 3-bit code, thus we form a 3 times 3 code matrix M. We then train 3 SVMs by using a
different data split specified by the column of the code matrix. For example, if we adopt the
one-per-class code, we train the first SVM according to the first column of the OPC matrix. While
the first column is the transpose of 1 0 0. It means that we take the first class of data among the
three classes as the positive samples for the SVM training. And take the remaining two classes of
data as negative samples. If we use Pairwise coupling code, the SVM will try to separate the first
class of data from the second class, while the third class is not in use. Finally, for each new data,
we use the three SVMs as a bunch to do prediction by assigning the data a 3 bit code, and classify
the data to the class of the closest code.
For example, when using the one-per-class code, we train the first SVM by following the first
column of the code matrix. The red points are taken as the positive samples and the other two
types of colorful points are negative samples. The SVM tries to separate the positive and negative
samples. After we obtain three SVMs, we can use it to do a 3-class classification. But OPC
obviously has weak error correctling ability, because there are many blur regions, such as the
region 110, 000, 101 and 011 because they all have the same code distance from the three classes:
100,010 and 001.
The pairwise Coupling code is different because each SVM only tries to separate two classes of
data, then total m time m minus 1 over 2 support vector machines are needed for a m-class task.
But it has stronger error correcting ability. The only ambiguous region is coded with 000.
Under the framework of ECOC, we propose the hadamard code for multi-class SVM. Here are
two code examples with size 4 and 8 each. The first column is excluded because it is the same, so
we can train m minus 1 svms for an m class task. The hadamard code has been proved to have
strong error correcting ability because it has large row separation and all columns are mutally
orthgonal, but it is not easy to graphically illustrated here. Insteaed, we present some experimental
results on the UCI repository. From the table, we can conclude that Hadamard code has its
superiority. The comparision to Pairwise coupling code is not listed here. What I can tell is that
PWC code achieves good results too because it uses much many SVMs.
Finally I give a summary of the ECOC approach. The code selection is important but not easy. For
example, the hadamard code is good for the economic number of SVMs used and its accuracy, but
it is hard to train each SVM because it does not aware of the nature boundary beween each class
of data. And the pairwise code train too many SVMs and the OPC code uses unbalanced split, but
is popular because it is easy to understand. The decoding method is also important. The weighted
decoding scheme may help the OPC code. For example, Instead of assigning the data point to the
class of the closest code, we can use probability measure related to the decision function.
Now I will talk about another kernel-based method: the Spectral analysis. It is named from the
operation of eigen-decomposition. So again, we present our matrix. In spectral clustering, the
matrix is nearly the same as the kernel matrix as presented before. It is also n times n and
symmetric, but the pairwise relation is limited to the similarity measure. The exponeontial inverse
distance square is suitable for the requirement. The similarity matrix can be from kernel functions,
or made manually, or defined in feature space by using kernel trick.
Geometrically, the similarity matrix can be transformed to an undirected weighted graph with its
vertices representing the data points. And its edges are weighted by the entry of the similarity
matrix k. Now the clustering problem can be reformulated to the graph minimum cut. By
partitioning the graph into two disjointed node sets A and B. if a node is in the set A, we label it as
positive, otherwise negative. The graph cut is the summation of the weight of the edges
connecting set A and set B. Mathematically, it can be expressed by the following form: if yi equals
to yj, the term contributed to summation is zero, but if yi does not equal to yj, the weight of the
edge connecting xi and xj is added. Our objective is to minimize the graph cut.
It is easy to know the vector of all one or minus one make the objective function be the trivial
minimum. To prevent the trivial minimum, we simply add a constraint : the summation of yi
equals to zero, to enforce a balanced cut. But such a constraint make the optimization problem
NP-hard. It is only possible to seek approximated solutions. Now we do some trainsformation first.
Construct a matrix D with the same size as the K, and most entries of D are zero, the diagonal of
D is the rowwise summation of K. And construct another matrix L which equals to D-K. L is a
semipositive definite matrix. And the original objective function can be expressed to the
multiplication of y transpose, K and y. But such trick transformation does not reduce the
complexity of the problem. We now approximated it by spectral decomposition. By relaxing the y
from binary integer to real number and restricting the sum of square of y equals to n. the objective
function value is larger than lambda 2,where lambda 2 is the first nonzero eigenvalue of the matrix
L, and the v2 is the corresponding eigen-vector. V2 is also the y we want to approximate. And
lamda 2 is the approximated minimum. Because zero is the smallest eigenvalue of matrix L, and
the corresponding eigen-vector equals to the vector of all one. We then get to know the summation
of v2 is zero because any eigen-vector is mutally orthogonal.
Once we obtain the v 2, we can simply partition the v2 from the median to get a balanced cut. Or
it is also proved that any p percentile cut works well if we want to get a split with the ratio p. And
other than the v2, the third eigenvector, the fourth, and so on can be simultaneously used as the
basis of our clustering. And the algorithm has been used in many areas such as molecular physics,
computer vision, high performance computing and data clustering. And there are many variants of
the spectral graph cuts, like normalized cut, ratio cut, recursive cut. And it has been drawn some
connection with other algorithms such as PCA, locally linear embedding, multidimensional
scaling and random walk.
Here are two simple clustering examples given by spectral algorithm.
After talking about the spectral clustering, we now take a look at how the spectral method can be
applied to semisupervised learning. There has been many studies on the field of semi-supervised
learning. For example, Blum proposed the co-training algorithm in 1998. The basic idea of
cotraining is to split the data by feature and tries to find the latent label from the unlabeled data. In
the same year, Bennett proposed the a semisupervised SVM by using a variant of SVM and tries
to do Linear optimization upon the whole dataset. And in 1999, Joachim invented the transductive
SVM. It is a complicated algorithm by iteratively adding new data points into the SVM. Recently,
Cristianini tried to modified the metric of the kernel matrix by using spectrum.
While there are also some research on semi-supervised learning by using graph cut. Here is the
problem formulation. In the graph formed from K, there are two sets of labeled data, one for
positive s, and the other is the negative t. We want get a graph min cut with those labeled
information provided. Simply we add a supersourse capital S and a supersink capital T into the
graph, with the capacity w equals to positive infinity. Now we obtain a classic S-T minimum cut
problem.
It is well known there are many derterministic S-T cut algorithms approached by Maximum flow.
Such kind of maximum flow approach has been studied in a paper written by blum in the year
2001. Although theare are many algorithms available, and some of them are very efficient. they all
suffers from the outliers of the graph and are hard to get a balanced cut. Also the output of those
maximum flow algorithms are only the discrete binary label, not a probability measure.
We conducted some studies on semi-supervised learning by using spectral methods. The spectral
methods systematically decompose a matrix, and a balanced cut can always be obtained and it is
outlier insensitive. It can be very fast. And the eigenvector is a continous output, which can be
used as a probability measure. And the capacity for the supersource and supersink can be further
adapted to the different confidence of labeled data.
Here are the comparative results. The left graph is the comparasion among the tansductive SVM,
ordinary SVM and the extended spectral method. The transductive SVM and spectral methods are
comparable, and both of them are better than ordinary SVMs, and the right graph shows the
spectral method outperms the maximum flow approach.
Finally, the presentation is about to be finished. In my presentation, I present the kernel based
methods including the SVM and spectral clustering. And the ECOC SVM is explained and the
Hadamard code is suggested to be used in multi-class SVM. And also the presentation explains the
semisupervised learning, and give a simple extension to the spectral method which has very good
performance in semi-supervised learning.
My further Mphil study will still concentrate on the kernel-based method. There are many
questions to be further investigated such as the relation between SVM and Spectral, and the better
ECOC code , and some new approaches to semi-supervised learning.
Download