GKEL

advertisement
Dd
Generalized Optimal Kernel-based
Ensemble Learning for
HS Classification
Problems
Prudhvi Gurram, Heesung Kwon
Image Processing Branch
U.S. Army Research Laboratory
Outline
 Current Issues
 Sparse Kernel-Based Ensemble Learning (SKEL)
 Generalized Kernel-Based Ensemble Learning
(GKEL)
 Simulation Results
 Conclusions
Current Issues
Sample Hyper spectral Data
(Visible + near IR, 210 bands)
 High dimensionality of hyperspectral data vs.
Curse of dimensionality
 Small set of training samples (small targets)
Grass
 The decision function of a classifier is over fitted
to the small number of training samples
 Idea is to find the underlying discriminant structure
NOT the noisy nature of the data
 Goal is to regularize the learning to make
decision surface robust to noisy samples and outliers
 Use Ensemble Learning
Military vehicle
Kernel–based Ensemble Learning
(Suboptimal technique)
 Idea is not all the subsets are
useful for the given task
 So select a small number of
subsets useful for the task
Random Subsets
of spectral bands
Sub-classifiers Used:
Support Vector Machine
(SVM)
SVM 1
Decision
Surface f1
Training Data
Random subsets of spectral bands
SVM 3
Decision
Surface f3
SVM 2
Decision
Surface f2
d1
d2
SVM N
Decision
Surface fN
d3
dN
Majority Voting
Ensemble Decision
d1  d2 
 dN
Sparse Kernel-based Ensemble
Learning (SKEL)
 To find useful subsets, developed SKEL built on the idea of multiple kernel learning (MKL)
 Jointly optimizes the SVM-based sub-classifiers in conjunction with the weights
 In the joint optimization, the L1 constraint is imposed on the weights to make them sparse
Training Data
Random Subsets
of Features (random
bands)
SVM 1
f1
SVM 2
f2
d1  0
Optimal subsets useful for the
given task
 x  x'
'
k (x, x )  exp 
 2 2

SVM 2
fN
d N  0.1 MKL (sparsity)
d
Combined
Kernel Matrix




SVM N
f3
d2  0.2 d  0
3
2
m
 1, d m  0
m
(L1 norm constraint)
Optimization Problem
Optimization Problem (Multiple Kernel Learning, Rakotomamonjy at al) :
1
1
min 
fm
{ f m },b , d 2
m dm
2
H
s.t. yi ( f m ( xi )  b)  1 i,
m
d
m
=1, d m  0 m
m
f m : kernel-based decision function
dm : weighting coefficient
L1 norm
Sparsity
Generalized Sparse Kernel-based
Ensemble (GKEL)
 SKEL
 SKEL is a useful classifier with improved performance
 However, some constraints in using SKEL
 SKEL has to use a large number of initial SVMs to maximize the
ensemble performance causing a memory error due to the limited
memory size
 The numbers of features selected for all the SVMs have to be the same
also causing sub-optimality in choosing feature subspaces
 GKEL
 Relaxes the constraints of SKEL
 Uses a bottom-up approach, starting from a single classifier, subclassifiers are added one by one until the ensemble converges, while a
subset of features is optimized for each sub-classifier.
Sparse SVM Problem
 GKEL is built on the sparse SVM problem* that finds optimal sparse features
maximizing the margin of the hyperplane,
Primal optimization
problem:
1
2
min min
w  C  i
d D w ,
2
i
subject to yi ( w, xi  b)  1  i for all i
~w d
where x  x d , w
: Elementwise product
D  {d ∣ d j  {0,1}, j  1, , m}
i.e. d : binary vector, {1, 0, 0,1,
, 0}
 Goal is to find an optimal d resulting in optimal
that maximizes the margin of the hyperplane
* Tan et al, “Learning sparse SVM for feature selection on very HD datasets,” ICML 2010
w~
Dual Problem of Sparse SVM
 Using Lagrange multipliers and the KKT conditions, the primal problem
can be converted to the dual problem
 The mixed integer programming problem is NP hard
1 T
maxRn min dD  e   YK (d )Y 
2
T
subject to  Y  0, 0    C ,
T
where  : a vector of Lagrange multipliers
e : a vector of all ones
Y : diag(yi ),
K (d ) : Kernel matrix based on sparse feature
vectors xi  xi
d
 Since there are a large number of different combinations of sparse
features, the number of possible kernel matrices K ( d ) is huge
 Combinatorial Problem !!!
Relaxation into QCLP
 To make the mixed integer problem tractable, relax it into Quadratically
Constrained Linear Programming (QCLP)
 The objective function S ( , d ) is converted into inequality constraints
lower bounded by a real value t
max  R n ,tR
t
subject to  T Y  0, 0    C
t  S ( , d l ), d l  D
1 T
T
where S ( , d )   e   YK ( d )Y 
2
 Since the number of possible K (d ) is huge, so is the number of the
constraints , therefore it’s still hard to solve the QCLP problem
 But, among many constraints, most of the constraints are not actively
used to solve the optimization problem
 Goal is to find a small number of constraints that are actively used
Illustrative Example
 Suppose an optimization problem with a large number of inequality constraints (SVM)
 Among many constraints, most of the constraints in the problem are not used to find the feasible
region and an optimal solution
 Only a small number of active constraints are used to fine the feasible region
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
 Use a technique called the restricted master problem that finds the active
constraints by identifying the most violated constraints one by one iteratively
 Find the first most violated constraint
aX  b  0
aX  b  0
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
 Use the restricted master problem that finds the most violated constraints
(features) one by one iteratively
 Find the first most violated constraint
 Based on previously found constraints, find the next most violated constraint
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
 Use the restricted master problem that finds the most violated constraints
(features) one by one iteratively
 Find the first most violated constraint
 Based on previously found constraints, find the next one
 Continue the iterative search until no violated constraints are found
(Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)
 Use the restricted master problem that finds the most violated constraints
(features) one by one iteratively
 Find the first most violated constraint
 Then the next one
 Continue until no violated constraints are found
Flow Chart
 Flow chart of the QCLP problem based on the restricted master problem
Initialize (t0 ,  0 ),
0 
1
,I 
N
Find dˆi given (ti 1 , i 1 )
I=I  d i
max R n ,tR t
subject to  T Y  0, 0    C
t  S ( , dˆ l ), dˆ l  I
S ( , d )   T e 
1 T
 YK ( d )Y 
2
I : Restricted set of spase features
Yes
S (i 1 , d i )  ti 1
No
Update (ti , i )
given I = I  d i
Terminate
d : a subset of features that maximally violates
t  S ( , dˆ ), i.e., min S ( , d )
d
1
 Find max M (d )   T YK (d )Y 
d
2
Most Violated Features
min S ( , d )
d

max dˆ M (d ) 
1 T
 YK (d )Y 
2
t  S ( , d )
 Linear Kernel
f1 f 2 f3
fn
- Calculate M (di ) for each feature separately and select
features with top values
- Does not work for non-linear kernels
 Non-linear Kernel
f1 f 2 f3
fn
- Individual feature ranking no longer works because it exploits non-linear
correlations among all the features (e.g. Gaussian RBF kernel)
- Calculate M (di ) for  i where d i being all the features
except i th feature,
- Eliminate the least contributing feature
- Repeat elimination until threshold condition is met (e.g. if change in M (d )
exceeds 30% then stop the iteration)
- Variable length features for different SVMs
How GKEL Works
1
0
ˆ
d0   
, I  {d 0 }
, d1}
, d 2}
, , d N 1}
N
( i , w2i )
( i , w1i )
( i , w3i )
SVM 1
SVM 2
SVM 3

I  {dˆ0 , dˆ1 ,
W  {w1 , w2 ,
( i , wNi )
SVM N
dˆi  dˆ j
, dˆ N } : selected features (variable lengths)
, wN } : weights
A bottom-up approach is used
Images for Performance
Evaluation
Hyperspectral Images (HYDICE) (210 bands, 0.4 – 2.5 microns)
Forest Radiance I
Desert Radiance II
: Training samples
Performance Comparison (FR I)
Single SVM (Gaussian kernel)
SKEL (10 to 2 SVMs) (Gaussian kernel)
GKEL (3 SVMs) (Gaussian kernel)
ROC Curves (FR I)
 Since each SKEL run uses different random subsets of spectral bands, 10 SKEL runs were
used to generate 10 ROC curves
Performance Comparison (DR II)
Single SVM (Gaussian kernel)
SKEL (10 to 2 SVMs)
(Gaussian kernel)
GKEL (3 SVMs) (Gaussian kernel)
Performance Comparison (DR II)
10 ROC curves from 10 SKEL runs, each run with different random subsets of spectral bands
Performance Comparison
 Data downloaded from the UCI machine learning database called
Spambase data used to predict whether an email is spam or not
SKEL:
Initial SVMs: 25
After optimization: 12
GKEL:
SVMs with nonzero weights: 14
Spambase Data
Conclusions
 SKEL and a generalized version of SKEL have been introduced
 SKEL starts from a large number of initial SVMS and then is optimized to a
small number of SVMs useful for the given task
 GKEL starts from a single SVM and Individual classifiers are added one by one
optimally to the ensemble until the ensemble converges
 GKEL and SKEL performs generally better than regular SVM
 GKEL performs as good as SKEL while using less resources (memory) than
SKEL
Q&A
?
Optimally Tuning Kernel
Parameters
 Prior to the L1 optimization, kernel parameters of each SVM are optimally tuned.
 Gaussian kernel with single bandwidth has been used treating all the bands equally - suboptimal
 x  x' 2 
T
1
1

k (x, x' )  exp 
k
(
x
,
x
')

exp
(
x

x
'

( x  x '))


2
 2



 1 0
0 
: Gaussian kernel (Sphere Kernel)
3
2

=  0
0

0

0 
 L 
: Full-band diagonal Gaussian kernel
 Estimate the upper bound to Leave-one-out (LOO) error (the Radius-Margin bound)
1 R2
L
 f RM , R: the radius of the minimum enclosing hypersphere
2
l 
 : The margin of the hyperplane
 Goal is to minimize the RM bound using the gradient descent technique
f RM
: the gradient of f RM
 p
Ensemble Learning
Sub-classifier 1
Sub-classifier N
Sub-classifier 2
-1
 The performance of each classifier
is better than random guess and
independent each other
 By increasing the number
of classifiers performance
is improved.
1
-1

Ensemble decision
Regularized Decision Function
(Robust to noise and outliers)
SKEL : Comparison
(Top-Down Approach)
Training Data
Random Subsets
of Features (random
bands)
SVM 1
SVM 2
f2
f1
d1  0
 x  x'
k (x, x ' )  exp 
 2 2

SVM 2
Combination of
decision results




SVM N
f3
d2  0.2 d  0
3
2
fN
d N  0.1 MKL (sparsity)
d
m
 1, d m  0
m
(L1 norm constraint)
Iterative Approach to Solve
QCLP
 Due to a very large number of quadratic constraints, the subject
QCLP problem is hard to solve.
 So, take iterative approach
 Iteratively update (t ,  ) based on a limited number of active
constraints.
Each Iteration of QCLP
 The intermediate solution pair (t ,  )
max  
n
is therefore obtained from
t
,t
subject to  T Y  0, 0    C
l
l
t  S ( , d ), d  I
1 T
T
where S ( , d )   e   YK ( d )Y 
2
Lagrangian :
L (t , u )  t   ul ( S ( , d  t ), ul  0
l
l
L
From KKT condition
0
t
u
l
l
1
Iterative QCLP vs. MKL
Lagrangian :
L (t , u )  t   ul ( S ( , d  t )
l
l
ul  0
L
From KKT condition
0
t
u
l
1
l
max min  ul ( S ( , d )
l

u
l
l
1 T
 max min  e   Y  ul K l ( d )Y 
u

2
l
T
subject to
u
l
l
 1, ul  0
Variable Length Features
1 T
M (d )   YK (d )Y   w
2
d  1,1,1, ,1
2
 Applying threshold toM (d ) (e.g. 30%) leads to
variable length features
 Stop iterations when the portion of the 2-norm of w from the
least contributing features exceeds the predefined TH
GKEL Preliminary Performance
SKEL:
Initial SVMs: 50
Chemical
After optimization: 8
GKEL:
SVMs with nonzero weights: 7 (22)
Plume Data
Relaxation into QCLP
maxRn min dD S ( , d )
subject to  T Y  0, 0    C ,
*
1 T
S ( , d )   e   YK (d )Y
2
T
S ( * , d * )
S ( , d )
1. Fix  and optimze d * ,
S ( , d )
*
then S( ,d )  S ( , d * )
2. Increse t up to t *  S ( , d * )
3. For a fixed d increase t to find
maximing  *
d*
QCLP
max  R n ,tR t
subject to  T Y  0, 0    C
t  S ( , d l ), d l  D
1 T
T
where S ( , d )   e   YK ( d )Y 
2
D : Prohibitively large
Nearly impossible to solve
L1 and Sparsity
L2 Optimization
L1 Optimization
Linear inequality
constraints
Download