A. linear composite kernels (initial HS

advertisement
Generalized Hidden Space Support Vector
Machines
Ioannis N. Dimou, Michalis E. Zervakis
Technical University of Crete, Greece
jdimou@gmail.com, michalis@gmail.com
Abstract This paper extends the concept of Hidden Space
Support Vector Machines into the set of composite kernels
and provides proof of the extensions’ closure properties and
practical feasibility. The limitation of a linear second stage
kernel is surpassed in within the context of a more general
formulation. This also allows indefinite Gram matrices to be
used as minor kernels followed by an RBF stage. This
broadens the choice of possible mapping functions and allows
the incorporation of useful prior knowledge as invariance to
the model, thus providing an improved chance of higher
generalization capability of the trained classifiers.
Index Termshidden space support vector machines, kernel
methods, non-Positive Semi Definite Kernel
I. INTRODUCTION
Since their introduction a decade ago, Support Vector
Machines (SVMs) have gained ground as a mainstream
pattern analysis tool. Their unique mapping capabilities
balance generalization error with learning error using a two
stage approach that decouples the choice of a mapping
function (kernel) from the solution algorithm (quadratic or
linear optimization, sequential minimization).
A kernel function is a similarity function satisfying the
properties of being symmetric and positive-definite. The
choice of an appropriate kernel function has been an area of
intense theoretical and experimental research since the
required mathematical properties limit the pool of
admissible kernels. Such properties are defined by the
Mercer conditions (Vapnik 1995) which in most cases are
equivalent to positive definiteness of the integral of the
kernel function, as described in Section II.
In practice many applications require the use of non-PSD
kernels (either generic or custom designed). Such examples
include kernels to quantify similarity between sets
(Eichhorn and Chapelle 2004), sigmoid kernels (Lin and
Lin 2003), sinusoidal kernels used in image classification
(REF), (+examples from Nello’s book ).
In the above cases a number of reasons can contribute
independently or combined so that:
 The kernel is inherently not PSD due to its analytic
mathematical form.
 The kernel is not PSD in the full features’ range.
 The kernel is only PSD for a limited parameter range.
 The kernel is only PSD for specific statistical dataset
characteristics.
Workarounds that have been proposed for this problem
range from approximate solution methods to the definition
of alternative pseudo Euclidean spaces (Ong, Canu et al.
2004). These approaches however induce different
conditions and usually apply only to specific types of
kernels.
In (Cristianini, Kandola et al. 2002) the authors advocate
the principled selection of kernels via the “kernel target
alignment” metric which characterizes the applicability of a
chosen kernel to a specific classification target. They
conclude that designers should try to use the most suitable
kernel within the constraints imposed by the SVMs’ theory.
Hidden Space Support Vector Machines (HS-SVMs)
provide a way to circumvent such design shortcomings and
extend the choice of kernels to more general functions. In
the original HS-SVMs paper (Li, Weida et al. 2004)
parallels are drawn between Neural Networks and SVMs in
order to highlight the benefits of using HS-SVMs.
In our approach the same formulation is analyzed under
the prism of composite kernels. In related literature
(Cristianini and Shawe-Taylor 2000; Schölkopf and Smola
2002) this term refers to kernel functionals that are derived
by subjecting known simpler (minor) kernels to specific
transformations within the reproducing kernel Hilbert space
(RKHS).
II. SVMS BACKGROUND
SVMs have been proposed in part to overcome the
problem of classifier generalization performance. They are
a flexible method that automatically incorporates
nonlinearity
in
the
classification
model.
Let

X  x1 , x2 ,..x Ntrn
 denote the set of independently and
identically distributed training patterns each of dimension d
and
yi , i  1,..Ntrn the associated (trainingset) crisp
class labels. The d-dimensional variable space is mapped to
a high-dimensional feature space using a mapping
:
d

m
.

x  z  1  x  , 2  x  ,..., m  x 
T
(1)
In this feature space, a linear (in the parameters)
separation function
f  x  between the two classes is
constructed. The classifier takes the following (primal)
form:
f  x   sign  w T   x   b 
where,
x X 
d
, w
m
(2)
is the weight vector, b
the bias term, and
f  x  the crisp prediction of the model
after applying a sign() function to the soft output. Under the
standard SVMs formulation (Cristianini and Shawe-Taylor
2000; Schölkopf and Smola 2002) in the primal space the
objective function associated with the problem’s solution is
defined as
Ntrn
1 T
min w w  C  i
w ,b ,ξ 2
i 1
(3)
subject to the constraints
yi  w T   xi   b   1  i
i  0,
(4)
i  1...N trn
where C is the regularization parameter, and
i
are the
slack variables that allow for misclassification of some
samples. This model formulation attempts to find a balance
between maximization of the margin separating the two
classes, which corresponds to regularization, (term 1) and
minimization of misclassifications (term 2). By solving the
corresponding Lagrangian, the problem can also be
formulated in the dual space as the equivalent classifier

k  x, xi   exp  x  xi
2
2
2

(9)
where the parameter  denotes the kernel width.
The choice of kernel affects how the linear separation
hyperplane in the feature space relates to the nonlinear
separation hyperplane in the original variables’ space.
Regardless of the kernel function used, the solution
approach leads to a quadratic optimization problem.
In order for a Gram matrix to be an admissible SVM
kernel is has to satisfy 2 distinct properties:
 Symmetry
 Positive (semi)definiteness
The later in the discrete case amounts to:
  f x  f x  K x , x   0
i
xi  X x j  X
j
i
j
(10)
and in the continuous case to:
  f  x  f  x  K  x , x  dx dx
i
j
i
j
i
j
0
(11)
X X
These conditions pose a problem and greatly limit the
choice of kernel functions.
III. INDEFINITE KERNELS
 Ntrn

T
f  x   sign  i yi  x    xi   b 
 i 1


Where
i
(5)
are weight parameters called support values
of the training cases and
Since the mappings
yi the corresponding crisp labels.
  x
can be complex or unspecified
by design, it is more convenient to implicitly work in the
feature space by defining a positive definite a kernel
function
k  x, x i     x    x i 
T
(6)
that allows us to formulate appropriate hidden spaces which
ensure better separability. The corresponding matrix


K ij  k  xi , x j  , usually referred to as the Gram
matrix, can be calculated and stored once for the trainingset
and once for the testset and used repetitively in (5).
Apart from the trivial linear kernel
k  x, xi   xT xi
other commonly
polynomial kernel
used
kernel
(7)
functions
k  x, xi    xT xi  1 , p 
are
the
p
and Gaussian radial basis function kernel
(8)
The fact that Mercer conditions and PSD requirement
significantly limit the pool of available functions has been
recognized early on in the development of SVM
algorithms.
This has resulted in the widespread use of very few
admissible kernels, the most prominent of which is the RBF
kernel. Despite its optimality and theoretically infinite
mapping capability (VC confidence), this generic kernel
suffers from its own design. The bandwidth parameter is
not easy to optimize and can result in suboptimal
performance. This has been referred to as the “no free
lunch” theorem attributed to (Cristianini and Shawe-Taylor
2000).
The analogy between statistical density estimation
kernels and SVMs has triggered the idea of sharing kernel
functions between the two domains (REF). In practice this
leads to analytical forms that are not guaranteed to be PSD.
Additionally many authors have proposed numerous
kernels that attempt to incorporate domain specific
invariances in the classification model. Typical examples of
such kernels found in literature include:
The sigmoid (hyperbolic tangent) kernel (Lin and Lin
2003) has been evaluated in the past due to its
correspondence to neural networks’ sigmoid activation
function.
k (xi , x j )  tanh  ai xiT x j  b 
(12)
It is in general non-PSD and recent experimental work
has shown the sigmoid kernel to be asymptotically identical
to the RBF kernel (REF).
The Epanenchikov kernel (Li and Racine 2007).
k  xi , x j  

3
1  xi  x j
4
2

(13)
This kernel’s analytical form does not lend itself to a
proof of positive definiteness. One metric that can be used
to measure the perfomance of kernels is MISE (mean
integrated squared error) or AMISE (asymptotic MISE).
The Epanechnikov kernel minimizes AMISE and is
therefore optimal. Other kernel efficiencies are measured in
comparison to Epanechnikov kernel.
The negative distance kernel (REF) has been proposed in
an effort to incorporate translation invariance to the SVM
model.
k ND  xi , x j    xi  x j

(14)
Mahalanobis distance kernels have been used in
problems where matching probability distributions is
required, as they characterize the shape of the data.
k Mah  xi , x j    xi  xi  Cij 1  x j  x j 
T
(15)
Cij is the convolution matrix of the two vectors indexed by i
and j.
Additionally the need for indefinite kernels occurs in
many distance based metrics and when the data structure
corresponds to non-Euclidean spaces (i.e. kernels on sets,
kernels on trees). In fact most dissimilarity based kernels of
the form

k  xi , x j   f xi  x j

HS-SVMs are derived by utilizing a special kind of
mapping to the hidden space, a symmetric function
 '  xi   k1  xi , x   k1  x, xi 
k1
x  z   k1  x1 , x  , k1  x 2 , x  ,..., k1  x N , x  
(17)
Using this function the corresponding
hidden space can be expressed as
T
m  N dimensional

Z  z z   k1  x1 , x  , k1  x2 , x  ,.., k1  x N , x  , x  X
(18)
We can proceed in a way analogous to the standard SVMs’
formulation and define the decision function in the primal
space
y  x   sign  wT z  x   b 
(19)
and the corresponding function in the dual space
N

y  x   sign    i yi z  x  z  xi   b 
 i 1

N

 sign    i yi k2  xi , x   b 
 i 1

(16)
(20)
are not positive definite.
where the kernel function used by HS-SVMs
IV. WORKAROUNDS TO USING INDEFINITE KERNELS
To overcome the problem of using indefinite kernel
formulations researchers have proposed approximations of
indefinite kernels (REF), use of pseudo Euclidean spaces
(pE) and limiting the kernel function to an area that is
guaranteed to be PSD.
In the general case it has been proven that the
exponentation operation on arbitrary feature matrixes
results is admissible kernel matrixes (Kondor R and J
2002).
Other solutions include the empirical kernel mapping
(EKM) which takes S¨S as a new kernel matrix (Scholköpf,
Weston et al. 2002), and the Saigo kernel which is
constructed by substracting the smallest negative
eigenvalues from the original non-PSD kernel’s diagonal
(Saigo, Vert et al. 2004).
Still the use of non-PSD kernels can be applied, with the
acknowledged limitation that there is a high risk of
converging to a local minima of the error function surface.
This however negates one of the core advantages of SVMs
over neural networks. According to Haasdonk (REF) such
kernels result to svms which cannot be seen as margin
maximizers.
V. EXTENDING HS-SVMS
Having outlined the baseline of nonlinear soft-margin
SVMs, and indefinite kernel problems we can now
establish the HS-SVMs formulation and the proposed
extensions. As introduced by Li in (Li, Weida et al. 2004),
Ntrn
k2  xi , x j    k1  x n , xi  k1  x n , x j 
n 1
(21)
 k1  x, xi  k 1  x, x j 
T
is an inner product of the minor kernels and hence
positive semi-definite, therefore qualifying as a valid
RKHS kernel.
In this work we further extend the above formulation and
redefine the HS-SVMs’ kernel as a more general functional
in the form of

k2 ( xi , x j )  k2 k1  x, xi  , k1  x, x j 

(22)
This way we have two consecutive kernels are shown in
Figure 1.
Depending on the choice of k1 and k2 , different
mapping spaces can be introduced to handle specific
classification problems. For demonstration purposes we
will analyze some basic combinations of known SVM
kernels and justify their properties.
A.
k2 linear composite kernels (initial HS-SVMs)
If we opt to use a linear second stage kernel
resulting HS-SVM kernel becomes:
k2 , then the

k(x,xi) =
φ(xi)
xi
y(xi)
φ(x)φ(xi)
φ(xi)
xi
φ(xi)
xi
k1(x,xi) =
k2(x,xi) =
φ(x)φ(xi)
k1(x,xi)k1(x,xi)
y(xi)
k1(x,xi) =
k2(x,xi) =
φ(x)φ(xi)
k2[k1(x,x’) k1(x’,xi)]
y(xi)
Figure 1 Hidden space mappings for SVMs (top), HS-SVMs (middle) and generalized HSSVMs (bottom)
(25)

k  x i , x j   k 2 k 1  x, x i  , k 1  x , x j 
 k1  x, xi  k 1  x, x j 

T
(23)
= '  xi   '  x j 
T
k1 polynomial - k2 linear:

k  x i , x j   k 2 k 1  x , x i  , k 1  x, x j 
 k1  x, xi  k 1  x, x j 
T
This model reduces to the original HS-SVMs proposed by
Li. In fact this assertion holds irrespective of the choice
of k1 . Some earlier implementations of this family of
models include:
k1 linear- k2 linear:

k  x i , x j   k 2 k 1  x, x i  , k 1  x, x j 
 k 1  x, x i  k 1  x , x j 
=  xT xi  1
p
x x
 1 , p 
p
T
j
It has been shown that this class of composite kernels
achieves a more sparse representation of the problem space,
in the sense that in a training phase it results in a lower
number of support vectors.

T
= x, x i

(24)
x, x j
B.
k2 polynomial kernels:
Even with traditional SVMs linear class separability is
seldom achievable. An alternative that balances
computational requirements with mapping power is in the
k1 RFB - k2 linear:

k  xi , x j   k2 k 1  x, xi  , k 1  x, x j 

 k1  x, xi  k 1  x, x j 
T
  x  x 2 
  x  xi 2 
j

= exp  
 exp  
2
2




2

2





form of polynomial

k2 kernels:
k  xi , x j   k2 k1  x, xi  , k1  x, x j 



 k 1  x, x i  k 1  x , x j   1 , p 
T
p
(26)
Comparative results on the performance of this similarity
mapper are given in Section 4.
C.

k Mah  rbf  xi , x j   k2 k 1  x, xi  , k 1  x, x j 
k2 RBF kernels:
In order to obtain maximum flexibility of the decision
boundaries the prominent model is a parametrized RBF
(Gaussian) kernel. This scenario leads to the following
formulation:

k  xi , x j   k2 k1  x, xi  , k1  x, x j 

T
T

1
x

x
C
x

x

x

x
Cij 1  x j  x j






i
i
ij
j
j
i
i

= exp  
2 2




 k  x, x   k x, x
1
i
1
j
 exp  
2

2






2
(27)
In order to be able to utilize such complex kernels we
first have to prove their admissibility as RKHS mappings.
In general the sufficient conditions for a function
k  xi , x j  to correspond to a dot product in the feature
Where  is the function’s spread/bandwidth parameter.
An additional benefit of using generalized HS-SVMs
lies in the ability to wrap and utilize indefinite minor
kernels. To this end we provide the derivations for the
sigmoid, Epanenchikov, negative distance and Mahalanobis
kernel functions. We implemented only the rbf as a second
stage kernel since in practice it the most usually used
functional.
D. sigmoid-RBF composite kernel:
The

ksig  rbf  xi , x j   k2 k1  x, xi  , k 1  x, x j 

E. Epanenchikov-RBF composite kernel:
The

k Ep  rbf  xi , x j   k2 k1  x, xi  , k1  x, x j 
 3
  1  xi  x
4
= exp   




2
k ND rbf  xi , x j   k2 k1  x, xi  , k1  x, x j 



xi  x  x  x j

= exp  
2 2



 
2


G. Mahalanobis-RBF composite kernel:
The

i
2
j
i
j
i
j
0
(28)
X X
where




f  x   L2  X  . However direct evaluation of
the above expression is often infeasible.
In the context of this work, proofs of the admissibility of
the above functionals as SVM kernels are given in the
appendix. The proofs are largely straightforward and rely
on the kernel properties defined in (Cristianini and ShaweTaylor 2000).
From a broader perspective certain SVM kernel
functions have intuitive meaning. RBF kernels are metrics
of the smooth weighted distances of a given sample
xi to
second order statistic that measures the similarity of the

 

 k  x , x  f  x  f  x  dx dx
all other samples in the same class. Then a standard HSSVM kernel function with k1 : RBF can be regarded as a
3
 1 x  x j
4
2 2
F. Negative Distance-RBF composite kernel:
The
space F are defined by the Mercer theorem:

 tanh a xT x  b  tanh a x T x  b
i j 
 j j

= exp  
2

2



distances of a given sample
2

 
 
 



2
xi to the corresponding
distances of all other samples in the class. This can be used
as an indication to the cluster’s overall compactness.
VI. EXPERIMENTAL RESULTS
In order to evaluate the performance of generalized HSSVM kernels we used known non-PSD kernels as minor
kernels and applied them to a series of benchmark datasets.
The first test scenario included indefinite sigmoid
kernels (Luss and d’Aspremont 2008) used on 3 datasets
(diabetes, german and ala) from the UCI repository
(Newman, Hettich et al. 1998). Additional results on the
same datasets have also been reported in (Lin and Lin
2003).
The breast cancer diagnosis dataset (Wisconsin) contains
683 complete cytological tests described by 9 integer
attributes with values between 1 and 10. The outcome is a
binary variable indicating the benign or malignant nature of
the tumor.
The diabetes diagnosis dataset was first collected to
investigate wether the patients show signs of diabetes
according to the World Health Organization criteria. The
polulation consists of female Pima Indians, aged 21 and
onler, living near Phoenix, Arizona. It contains 768
instances each described by 8 continuous varibles and a
binary outcome variable.
Te German credit dataset contains dataused to evaluate
credit applications in Germany. It has 1000 cases. The
version that we used each case is described by 24
continuous attributes.
The second test scenario included …
will outline two exemplary applications which can
benefit from the use of non-standard, non-positive definite
structure-based
k1 kernels.
The classification of microarray data has become a
standard tool in many biological studies. Gene expressions
of distinct biological groups are compared and classified
according to their gene expression characteristics, in tumor
diagnosis (Golub TR, Slonim DK et al. 1999). Kernel
methods play important roles in such disease analyses,
especially when classifying data with SVMs and other
kernel methods classify such data based on the feature or
marker genes that are correlated with the characteristics of
the groups. In most of those studies, only standard kernels
such as linear, polynomial, and RBF, which take vectorial
data as input, have been popularly used and generally
successful.
Other than the above vectorial data kernel family, there
is another family called structured data kernel family that
has been studied in many other fields including
bioinformatics and machine learning. The structured data
kernel family conveys structural or topological information
with or without numerical data as input to describe data.
For example, the string kernel for text classification
(Lodhi H, Saunders C et al. 2002), the marginalized count
kernel (Tsuda K, Kin T et al. 2002) for biological
sequences, the diffusion kernel (Kondor R and J 2002) and
the maximum entropy (ME) kernel (Tsuda K and WS
2004) for graph structures are well known in the biological
field.
In microarray analysis, one of the main issues that
hamper accurate and realistic predictions is the lack of
repeat experiments, often due to financial problems or
rarity of specimens such as minor diseases. Utilization of
public or old data together with one's current data could
solve this problem; many studies combining several
microarray datasets have been performed (Warnat P, Eils R
et al. 2005; Nilsson B, Andersson A et al. 2006). However,
due to the low gene overlaps and consistencies between
different datasets, the vectorial data kernels are often
unsuccessful in classifying data from various datasets if
naïvely integrated (Warnat P, Eils R et al. 2005). Part of the
solution may lay on devising new kernels that can handle
this type of problem (Fujibuchi and Kato 2007).
When using different available genes from each
contributing dataset statistical measures are employed to
provide an invariant kernel representation. For example we
can incorporate a kNND RBF gene distance metric as a
k1 kernel along with an ME k2 in an HS-SVM. Our aim is
to develop kernels that are robust to heterogeneous and
noisy gene expression data.
The classification of text based on term presence and
proximity is also an evolving research field especially in
relation to the improvement of web search results and
document categorization [REF]. Standard methods such as
the bag-of-words [REF] actually create a histogram of work
frequencies that can be used by established vectorial
kernels (linear, polynomial, RBF). These methods however
disregard the relative location of the terms, thus providing a
suboptimal classification solution.
In order to incorporate structural information of this kind
we have to resolve to metrics that might not comply with
the Mercer conditions and their equivalent kernel matrix
positive semidefiniteness. Once again using them inside the
HS-SVM context we can, without significant increase in
complexity benefit from a more suitable invariance mapper.
Both datasets 2 and 3 had normalized predictors in the
0,1
range and were available in 100x stratified
randomizations. The dimensionality and features of the
three used datasets are shown in Table I.
TABLE I
COMPARISON OF SVM CLASSIFIERS’ ACCURACY
k1
composite kernel
k2
lin
poly
rbf
lin
poly
rbf
lin
poly
rbf
lin
poly
rbf
sigmoid
Epanenchikov
Negative Dist
Mahalanobis
lin
lin
lin
poly
poly
poly
rbf
rbf
rbf
rbf
rbf
rbf
rbf
microarray
0
0.888
0
0
0
0
0
0
0
0
0
0
datasets
web
search
0
0.888
0
0
0
0
0
0
0
0
0
0
Διάλεξα 3 biomedical datasets ώστε να μπορούμε να το
δικαιολογήσουμε και σε in-between journals (Artificial
Intelligence in Medicine, BMC Medical Informatics and
Decision Making, etc).
TABLE II
COMPARISON OF SVM CLASSIFIERS’ #SVS
composite kernel
k2
k1
lin
poly
rbf
lin
poly
rbf
lin
lin
lin
lin
poly
microarray
0
0.888
0
0
0
0
0
datasets
web
search
0
0.888
0
0
0
0
0
poly
rbf
lin
poly
rbf
sigmoid
Epanenchikov
Negative Dist
Mahalanobis
poly
poly
rbf
rbf
rbf
rbf
rbf
rbf
rbf
0
0
0
0
0
0
0
0
0
0
TABLE III
COMPARISON OF SVM CLASSIFIERS’ CONCORDANCE INDEX
k1
composite kernel
k2
lin
poly
rbf
lin
poly
rbf
lin
poly
rbf
lin
poly
rbf
sigmoid
Epanenchikov
Negative Dist
Mahalanobis
lin
lin
lin
poly
poly
poly
rbf
rbf
rbf
rbf
rbf
rbf
rbf
microarray
0
0.888
0
0
0
0
0
0
0
0
0
0
datasets
web
search
0
0.888
0
0
0
0
0
0
0
0
0
0
VII. CONCLUSIONS AND SUMMARY
This paper showed how the existing methodology of HSSVMs can be extended to a more general class of kernel
functions and applied to real world problems such as DNA
sequencing.
The methodology presented here is not strictly restricted
to SVMs. Due to the nature of these algorithms the derived
functional can be used as modules in other kernel methods
including K-PCA.
APPENDIX
Admissibility proofs of generalized HS-SVMs kernels.
In order to derive the proofs that the functionals proposed in equations (23),(26),(27) are admissible kernels we make
use of some basic properties of RKHS kernels, which are analysed in (Cristianini and Shawe-Taylor 2000) (secton 3.3.2).
For simplicity we list them here as well.
We assume that
k1 and k2 are valid kernels over the set X  X , X 
d
,a

f  is a real valued
and that
function on X . Then the following functions are also valid kernels:
1.
2.
k  xi , x j   k1  xi , x j   k2  xi , x j 
k  xi , x j   ak1  xi , x j 
3.
k  xi , x j   k1  xi , x j  k2  xi , x j 
4.
k  xi , x j   f  xi  f  x j 
5.
k  xi , x j   k3   xi  ,  x j 


Regarding the kernels introduced here the following derivations hold:
i.


Assuming k1 xi , x j is a valid kernel, for equation (26) we have:
 

  k  x, x   1
k  x i , x j   k 2 k 1  x, x i  , k 1  x , x j   k 1  x , x i  k 1  x , x j   1
T
where
k3 is a kernel by property (3) and k4 by property (1).
Finally
k is a valid kernel as a power of k4 by property (3).
ii.
Similarly for equation (27) we have:

k  xi , x j   k2 k 1  x, xi  , k 1  x, x j 



p
3
 k  x, x   k x , x
1
i
1
j
 exp  

2 2



2
j
p
 k 4  x, x j 
p




 k x ,x 2 
3 i
j
  exp   1 k x , x   exp k x , x
 exp  
4 i
j 
5 i
j

2
2


2
 2



where k3 is a valid kernel by property (1) and k4 is also one by property (3). Then k5 is also a valid kernel by property


(2). Using the exponent’s infinite series expansion
ex  1  x 
x 2 x3
xn
  ...  , x  , n  
2! 3!
n!
we get


exp k5  xi , x j   1  k5  xi , x j  
k5  xi , x j 
2!
2
 ... 
k5  xi , x j 
which is an admissible RHKS kernel by properties (3) and (1).
n!
n
,
n
(29)
REFERENCES
Cristianini, N., J. Kandola, et al. (2002). "On kernel target
alignment." Journal of Machine Learning
Research 1.
Cristianini, N. and J. Shawe-Taylor (2000). An introduction
to Support Vector Machines and other kernelbased learning methods. Cambridge ; New York,
Cambridge University Press.
Eichhorn, J. and O. Chapelle (2004). Object categorization
with SVM: Kernels for Local Features, Max
Planck Institute for Biological Cybernetics. 137.
Fujibuchi, W. and T. Kato (2007). "Classification of
heterogeneous microarray data by maximum
entropy kernel." BMC Bioinformatics 8.
Golub TR, Slonim DK, et al. (1999). "Molecular
Classification of Cancer: Class Discovery and
Class Prediction by Gene Expression Monitoring."
Science(286): 531-537.
Kondor R and L. J (2002). Diffusion kernels on graphs and
other discrete structures. 19th Intl Conf on
Machine Learning (ICML), San Francisco, CA,
Morgan Kaufmann.
Li, Q. and J. S. Racine (2007). Nonparametric
econometrics : theory and practice. Princeton,
N.J., Princeton University Press.
Li, Z., Z. Weida, et al. (2004). "Hidden space support
vector machines." Neural Networks, IEEE
Transactions on 15(6): 1424-1434.
Lin, H.-t. and C.-J. Lin (2003). A Study on Sigmoid
Kernels for SVM and the Training of non-PSD
Kernels by SMO-type Methods, National Taiwan
University.
Lodhi H, Saunders C, et al. (2002). "Text classification
using string kernels." The Journal of Machine
Learning Research(2): 419-444.
Luss, R. and A. d’Aspremont (2008). "Support Vector
Machine Classification with Indefinite Kernels."?
?
Newman, D. J., S. Hettich, et al. (1998). "UCI Repository
of machine learning databases." from
www.ics.uci.edu/~mlearn/),Irvine.
Nilsson B, Andersson A, et al. (2006). "Cross-platform
classification in microarray-based leukemia
diagnostics." Haematologica 6(91): 821-824.
Ong, C. S., S. Canu, et al. (2004). Learning with NonPositive Kernels. In Proc. of the 21st International
Conference on Machine Learning (ICML).
Saigo, H., J. Vert, et al. (2004). "Protein homology
detection using string alignment kernels."
Bioinformatics 20(11): 1682-9.
Schölkopf, B. and A. J. Smola (2002). Learning with
kernels : support vector machines, regularization,
optimization, and beyond. Cambridge, Mass., MIT
Press.
Scholköpf, B., J. Weston, et al. (2002). A Kernel Approach
for Learning From Almost Orthogonal Patterns.
13th European Conference on Machine Learning,
Helsinki.
Tsuda K, Kin T, et al. (2002). "Marginalized kernels for
biological sequences." Bioinformatics 18: 268275.
Tsuda K and N. WS (2004). "Learning kernels from
biological networks by maximizing entropy."
Bioinformatics(20): 326-333.
Vapnik, V. (1995). The Nature of Statistical Learning
Theory. N.Y., Springer.
Warnat P, Eils R, et al. (2005). "Cross-platform analysis of
cancer microarray data improves gene expression
based classification of phenotypes." BMC
Bioinformatics 6.
Download