Lecture 8

advertisement
CIS526: Machine Learning
Lecture 8 (Oct 21, 2003)
Preparation help: Bo Han
Support Vector Machines
Last week, we discussed the construction of support vector machines on linearly separable
training data. This is a very strong assumption that is unrealistic in most real life applications.
Question: What should we do if the training data set is not linearly separable?
+ 

+
+


+ +
 
+
+
Solution: Introducing the slack variables i, i=1, 2, …, N, to relax the constraint
yi (w T x i  b)  1 to yi (w T x i  b)  1   i ,  i  0 . Ideally, one would prefer all slack variables to
be zero and this would correspond to the linearly separable case. Therefore, the optimization
problem for construction of SVM on linearly nonseparable data is defined as:
find w and b that minimize:
subject to:
1 w 2 C
2
 i 2
i
T
yi (w x i  b)  1   i ,  i  0, i
where C > 0 is an appropriately selected parameter. The additional term C   i 2 enforces all
i
slack variables to be as close to zero as possible.
Dual problem: As in the linearly separable problem, this optimization problem can be converted
to its dual problem:
find  that maximizes
 i  12  i  j yi y j xiT x j
i
i
j
i 1i yi  0,
N
subject to
0   i  C , i
NOTE: The consequence of introducing parameter C is in constraining the range of acceptable
values of Lagrange multipliers i. The most appropriate choice for C will depend on the specific
data set available.
SVM: Generalized solution
Problem: Support vector machines represented with a linear function f(x) (i.e. a separating
hyperplane) have very limited representational power. As such, they could not be very useful in
practical classification problems.
X2
Good News: With a slight modification, SVM could solve
highly nonlinear classification problems!!
- - - + + + - - + + + - + + - - - - - - -
X1
Justification: Cover’s Theorem
 Suppose that data set D is nonlinearly separable in
the original attribute space. The attribute space can
be transformed into a new attribute space where D is
linearly separable!
 Caveat: Cover’s Theorem only proves the existence
of the transformed attribute space that could solve the nonlinear problem. It does not
provide the guideline for the construction of the attribute transformation!
Example 1. XOR problem
X2
- - - + + +
+ +
+ + +
+ +
- - - -
X1
By constructing a new attribute:
X1’ = X1 X2
the XOR problem becomes linearly separable by the new
attribute X1’.
Example 2. Taylor expansion.
Value of a multidimensional function F(x) at point x can be approximated as


  
 
  
 
F ( x )  F ( x0 )  F ( x0 )( x  x0 )  ( x  x0 )T  2 F ( x0 )( x  x0 )  O(|| x  x0 ||3 )
Therefore, F(x) can be considered as a linear combination of complex attributes derived from
the original ones,
m
m


F ( x )  F ( x0 ) 
a i xi 
aij xi x j 

i 1

i , j 1
m


 aijk xi x j xk  O(|| x  x0 ||3 )
i , j , k 1
new attr
new attr
Example 3. Second order monomials derived from the original two-dimensional
attribute space
(x1 , x 2 )  (z1 , z 2 , z 3 )  (x12 , 2x1 x 2 , x 22 )
Example 4. Fifth order monomials derived from the original 256-dimensional attribute
space
 5  256  1
  1010 of such monomials, which is an extremely high-dimensional
5


There are 
attribute space!!
SVM and curse-of-dimensionality:
 If the original attribute space is transformed into a very high dimensional space, the
likelihood of being able to solve the nonlinear classification increases. However, one is
likely to quickly encounter the curse-of-dimensionality problem.
 The strength of SVM lies in the theoretical justification that margin maximization is an
effective mechanism for alleviating the curse-of-dimensionality problem (i.e. SVM is the
simplest classifier that solves the given classification problem). Therefore, SVM are able
to successfully solve classification problems with extremely high attribute
dimensionality!!
SVM solution of classification:
Denote : M  F as a mapping from the original M-dimensional attribute space to the highly
dimensional attribute space F.
By solving the following dual problem
 i  12  i  j yi y j (xi )T (x j )
find  that maximizes
i
i

N
 y
i 1 i i
subject to
j
 0,
0   i  C , i
the resulting SVM is of the form
T
f ( x)  w  ( x i )  b 
N
  i y i  ( x i ) T  ( x)  b
i 1
Practical Problem: Although SVM are successful in dealing with highly dimensional attribute
spaces, the fact that the SVM training scales linearly with the number of attributes, and
considering limited memory space could largely limit the choice of mapping .
Solution: Kernel Trick
It allows computing scalar products (e.g.  (x i )T  (x) ) in the original attribute space. It follows
from Mercer’s Theorem:
There is a class of mappings  that has the following property:
(x)T (y )  K (x, y ) ,
where K is a corresponding kernel function.
Examples of kernel function:
xy
• Gaussian Kernel:
K (x, y )  e
2
A
, A is a constant
T
• Polynomial Kernel: K (x, y )  (x y  1) B , B is a constant
By introducing the kernel trick:
The dual problem:
 i  12  i  j yi y j K (xi , x j )
find  that maximizes
i
i

N
 y
i 1 i i
subject to
j
 0,
0   i  C , i
The resulting SVM is:
f ( x)  w T  ( x i )  b 
N
  i y i K ( x i , x)  b
i 1
Some practical issues with SVM


Modeling choices: When using some of the available SVM software packages or
toolboxes a user should choose (1) kernel function (e.g. Gaussian kernel) and its
parameter(s) (e.g. constant A), (2) constant C related to the slack variables. Several
choices should be examined using validation set in order to find the best SVM.
SVM training does not scale well with the size of the training data (i.e. scaling as O(N3)).
There are several solutions that offer speed-up of the original SVM algorithm:
o chunking; start with a subset of D, build SVM, apply it on all data, add
“problematic” data points into the training data, remove “nice” points, repeat).
o decomposition; similar to chunking, the size of the subset is kept constant
o sequential minimal optimization; extreme version of chunking, only 2 data
points are used in each iteration.
SVM-Based solutions exist for problems outside binary classification





multi-class classification problems
SVM for regression
Kernel PCA
Kernel Fischer discriminant
Clustering
PCA
Kernel PCA
Clustering
One of the most popular unsupervised learning problems.
Broad Definition: Grouping a collection of objects into subsets (or clusters) such that the
objects assigned to the same cluster are more closely related than objects assigned to different
clusters.
Examples of clustering applications
 Marketing: discover distinctive customer groups
and develop targeted marketing program
 Image segmentation: identification of areas with
similar land use from satellite data
 Portfolio optimization: discover groups of stock
time series with different behaviors
 Gene expression data analysis: discover genes
with similar expression profiles
 Dimensionality reduction: represent each object
based on its cluster membership (instead based on
the original attributes)
4
2
Cluster2
0
-2
Cluster1
-4
-2
0
2
4
6
Types of data suitable for clustering
Standard data matrix representation D = {xi, i = 1, …, N}, where xi is an M-dimensional

attribute vector, and N is the number of data points;
Proximity (or similarity) matrix D = {dij, i,j = 1, …, N}, dij is the “similarity” between

data points i and j. Most often, proximity matrix is symmetric, dij=dji, and triangle
inequality is satisfied, dij ≤ dik + dkj. Sometimes, the triangle inequality is not satisfied
(e.g. human judgment is used to compare the similarity between objects)
Types of clustering algorithms
 Partitioning algorithms (K-means clustering)
 Hierarchical algorithms (create hierarchical decomposition of objects)
 Probabilistic algorithms (assume that clusters satisfy some probability distribution)
Similarity measures
Choice of the similarity measure is critical for the success of the clustering. It should be selected
to confirm to the user’s perception of the similarity between the objects.
Let us assume the data is available in the standard data matrix format. The most popular choices
for the similarity dij between objects xi and xj are:


Euclidean distance:
Weighted Euclidean distance:
 
d ( xi , x j ) 
 
d ( xi , x j ) 
M
 ( xik  x jk ) 2
k 1
M
 wk ( xik  x jk ) 2
k 1
1/ R

Minkowski distance:
M

 
d ( xi , x j )   ( xik  x jk ) R 


 k 1


x  x i x jk  x j 
M  ik
M
1

Correlation coefficient: d ( xi , x j ) 
k 1
M
 xik  xi 2
k 1
 x jk  x j 
M
k 1
M
, where x i   xik
k 1
Download