GKEL

Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Prudhvi Gurram, Heesung Kwon Image Processing Branch U.S. Army Research Laboratory Outline  Current Issues  Sparse Kernel-Based Ensemble Learning (SKEL)  Generalized Kernel-Based Ensemble Learning (GKEL)  Simulation Results  Conclusions Current Issues Sample Hyper spectral Data (Visible + near IR, 210 bands)  High dimensionality of hyperspectral data vs. Curse of dimensionality  Small set of training samples (small targets) Grass  The decision function of a classifier is over fitted to the small number of training samples  Idea is to find the underlying discriminant structure NOT the noisy nature of the data  Goal is to regularize the learning to make decision surface robust to noisy samples and outliers  Use Ensemble Learning Military vehicle Kernel–based Ensemble Learning (Suboptimal technique)  Idea is not all the subsets are useful for the given task  So select a small number of subsets useful for the task Random Subsets of spectral bands Sub-classifiers Used: Support Vector Machine (SVM) SVM 1 Decision Surface f1 Training Data Random subsets of spectral bands SVM 3 Decision Surface f3 SVM 2 Decision Surface f2 d1 d2 SVM N Decision Surface fN d3 dN Majority Voting Ensemble Decision d1  d2   dN Sparse Kernel-based Ensemble Learning (SKEL)  To find useful subsets, developed SKEL built on the idea of multiple kernel learning (MKL)  Jointly optimizes the SVM-based sub-classifiers in conjunction with the weights  In the joint optimization, the L1 constraint is imposed on the weights to make them sparse Training Data Random Subsets of Features (random bands) SVM 1 f1 SVM 2 f2 d1  0 Optimal subsets useful for the given task  x  x' ' k (x, x )  exp   2 2  SVM 2 fN d N  0.1 MKL (sparsity) d Combined Kernel Matrix     SVM N f3 d2  0.2 d  0 3 2 m  1, d m  0 m (L1 norm constraint) Optimization Problem Optimization Problem (Multiple Kernel Learning, Rakotomamonjy at al) : 1 1 min  fm { f m },b , d 2 m dm 2 H s.t. yi ( f m ( xi )  b)  1 i, m d m =1, d m  0 m m f m : kernel-based decision function dm : weighting coefficient L1 norm Sparsity Generalized Sparse Kernel-based Ensemble (GKEL)  SKEL  SKEL is a useful classifier with improved performance  However, some constraints in using SKEL  SKEL has to use a large number of initial SVMs to maximize the ensemble performance causing a memory error due to the limited memory size  The numbers of features selected for all the SVMs have to be the same also causing sub-optimality in choosing feature subspaces  GKEL  Relaxes the constraints of SKEL  Uses a bottom-up approach, starting from a single classifier, subclassifiers are added one by one until the ensemble converges, while a subset of features is optimized for each sub-classifier. Sparse SVM Problem  GKEL is built on the sparse SVM problem* that finds optimal sparse features maximizing the margin of the hyperplane, Primal optimization problem: 1 2 min min w  C  i d D w , 2 i subject to yi ( w, xi  b)  1  i for all i ~w d where x  x d , w : Elementwise product D  {d ∣ d j  {0,1}, j  1, , m} i.e. d : binary vector, {1, 0, 0,1, , 0}  Goal is to find an optimal d resulting in optimal that maximizes the margin of the hyperplane * Tan et al, “Learning sparse SVM for feature selection on very HD datasets,” ICML 2010 w~ Dual Problem of Sparse SVM  Using Lagrange multipliers and the KKT conditions, the primal problem can be converted to the dual problem  The mixed integer programming problem is NP hard 1 T maxRn min dD  e   YK (d )Y  2 T subject to  Y  0, 0    C , T where  : a vector of Lagrange multipliers e : a vector of all ones Y : diag(yi ), K (d ) : Kernel matrix based on sparse feature vectors xi  xi d  Since there are a large number of different combinations of sparse features, the number of possible kernel matrices K ( d ) is huge  Combinatorial Problem !!! Relaxation into QCLP  To make the mixed integer problem tractable, relax it into Quadratically Constrained Linear Programming (QCLP)  The objective function S ( , d ) is converted into inequality constraints lower bounded by a real value t max  R n ,tR t subject to  T Y  0, 0    C t  S ( , d l ), d l  D 1 T T where S ( , d )   e   YK ( d )Y  2  Since the number of possible K (d ) is huge, so is the number of the constraints , therefore it’s still hard to solve the QCLP problem  But, among many constraints, most of the constraints are not actively used to solve the optimization problem  Goal is to find a small number of constraints that are actively used Illustrative Example  Suppose an optimization problem with a large number of inequality constraints (SVM)  Among many constraints, most of the constraints in the problem are not used to find the feasible region and an optimal solution  Only a small number of active constraints are used to fine the feasible region (Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)  Use a technique called the restricted master problem that finds the active constraints by identifying the most violated constraints one by one iteratively  Find the first most violated constraint aX  b  0 aX  b  0 (Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)  Use the restricted master problem that finds the most violated constraints (features) one by one iteratively  Find the first most violated constraint  Based on previously found constraints, find the next most violated constraint (Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)  Use the restricted master problem that finds the most violated constraints (features) one by one iteratively  Find the first most violated constraint  Based on previously found constraints, find the next one  Continue the iterative search until no violated constraints are found (Yisong Yue, “Diversified Retrieval as Structured Prediction,” ICML 2008)  Use the restricted master problem that finds the most violated constraints (features) one by one iteratively  Find the first most violated constraint  Then the next one  Continue until no violated constraints are found Flow Chart  Flow chart of the QCLP problem based on the restricted master problem Initialize (t0 ,  0 ), 0  1 ,I  N Find dˆi given (ti 1 , i 1 ) I=I  d i max R n ,tR t subject to  T Y  0, 0    C t  S ( , dˆ l ), dˆ l  I S ( , d )   T e  1 T  YK ( d )Y  2 I : Restricted set of spase features Yes S (i 1 , d i )  ti 1 No Update (ti , i ) given I = I  d i Terminate d : a subset of features that maximally violates t  S ( , dˆ ), i.e., min S ( , d ) d 1  Find max M (d )   T YK (d )Y  d 2 Most Violated Features min S ( , d ) d  max dˆ M (d )  1 T  YK (d )Y  2 t  S ( , d )  Linear Kernel f1 f 2 f3 fn - Calculate M (di ) for each feature separately and select features with top values - Does not work for non-linear kernels  Non-linear Kernel f1 f 2 f3 fn - Individual feature ranking no longer works because it exploits non-linear correlations among all the features (e.g. Gaussian RBF kernel) - Calculate M (di ) for  i where d i being all the features except i th feature, - Eliminate the least contributing feature - Repeat elimination until threshold condition is met (e.g. if change in M (d ) exceeds 30% then stop the iteration) - Variable length features for different SVMs How GKEL Works 1 0 ˆ d0    , I  {d 0 } , d1} , d 2} , , d N 1} N ( i , w2i ) ( i , w1i ) ( i , w3i ) SVM 1 SVM 2 SVM 3  I  {dˆ0 , dˆ1 , W  {w1 , w2 , ( i , wNi ) SVM N dˆi  dˆ j , dˆ N } : selected features (variable lengths) , wN } : weights A bottom-up approach is used Images for Performance Evaluation Hyperspectral Images (HYDICE) (210 bands, 0.4 – 2.5 microns) Forest Radiance I Desert Radiance II : Training samples Performance Comparison (FR I) Single SVM (Gaussian kernel) SKEL (10 to 2 SVMs) (Gaussian kernel) GKEL (3 SVMs) (Gaussian kernel) ROC Curves (FR I)  Since each SKEL run uses different random subsets of spectral bands, 10 SKEL runs were used to generate 10 ROC curves Performance Comparison (DR II) Single SVM (Gaussian kernel) SKEL (10 to 2 SVMs) (Gaussian kernel) GKEL (3 SVMs) (Gaussian kernel) Performance Comparison (DR II) 10 ROC curves from 10 SKEL runs, each run with different random subsets of spectral bands Performance Comparison  Data downloaded from the UCI machine learning database called Spambase data used to predict whether an email is spam or not SKEL: Initial SVMs: 25 After optimization: 12 GKEL: SVMs with nonzero weights: 14 Spambase Data Conclusions  SKEL and a generalized version of SKEL have been introduced  SKEL starts from a large number of initial SVMS and then is optimized to a small number of SVMs useful for the given task  GKEL starts from a single SVM and Individual classifiers are added one by one optimally to the ensemble until the ensemble converges  GKEL and SKEL performs generally better than regular SVM  GKEL performs as good as SKEL while using less resources (memory) than SKEL Q&A ? Optimally Tuning Kernel Parameters  Prior to the L1 optimization, kernel parameters of each SVM are optimally tuned.  Gaussian kernel with single bandwidth has been used treating all the bands equally - suboptimal  x  x' 2  T 1 1  k (x, x' )  exp  k ( x , x ')  exp ( x  x '  ( x  x '))   2  2     1 0 0  : Gaussian kernel (Sphere Kernel) 3 2  =  0 0  0  0   L  : Full-band diagonal Gaussian kernel  Estimate the upper bound to Leave-one-out (LOO) error (the Radius-Margin bound) 1 R2 L  f RM , R: the radius of the minimum enclosing hypersphere 2 l   : The margin of the hyperplane  Goal is to minimize the RM bound using the gradient descent technique f RM : the gradient of f RM  p Ensemble Learning Sub-classifier 1 Sub-classifier N Sub-classifier 2 -1  The performance of each classifier is better than random guess and independent each other  By increasing the number of classifiers performance is improved. 1 -1  Ensemble decision Regularized Decision Function (Robust to noise and outliers) SKEL : Comparison (Top-Down Approach) Training Data Random Subsets of Features (random bands) SVM 1 SVM 2 f2 f1 d1  0  x  x' k (x, x ' )  exp   2 2  SVM 2 Combination of decision results     SVM N f3 d2  0.2 d  0 3 2 fN d N  0.1 MKL (sparsity) d m  1, d m  0 m (L1 norm constraint) Iterative Approach to Solve QCLP  Due to a very large number of quadratic constraints, the subject QCLP problem is hard to solve.  So, take iterative approach  Iteratively update (t ,  ) based on a limited number of active constraints. Each Iteration of QCLP  The intermediate solution pair (t ,  ) max   n is therefore obtained from t ,t subject to  T Y  0, 0    C l l t  S ( , d ), d  I 1 T T where S ( , d )   e   YK ( d )Y  2 Lagrangian : L (t , u )  t   ul ( S ( , d  t ), ul  0 l l L From KKT condition 0 t u l l 1 Iterative QCLP vs. MKL Lagrangian : L (t , u )  t   ul ( S ( , d  t ) l l ul  0 L From KKT condition 0 t u l 1 l max min  ul ( S ( , d ) l  u l l 1 T  max min  e   Y  ul K l ( d )Y  u  2 l T subject to u l l  1, ul  0 Variable Length Features 1 T M (d )   YK (d )Y   w 2 d  1,1,1, ,1 2  Applying threshold toM (d ) (e.g. 30%) leads to variable length features  Stop iterations when the portion of the 2-norm of w from the least contributing features exceeds the predefined TH GKEL Preliminary Performance SKEL: Initial SVMs: 50 Chemical After optimization: 8 GKEL: SVMs with nonzero weights: 7 (22) Plume Data Relaxation into QCLP maxRn min dD S ( , d ) subject to  T Y  0, 0    C , * 1 T S ( , d )   e   YK (d )Y 2 T S ( * , d * ) S ( , d ) 1. Fix  and optimze d * , S ( , d ) * then S( ,d )  S ( , d * ) 2. Increse t up to t *  S ( , d * ) 3. For a fixed d increase t to find maximing  * d* QCLP max  R n ,tR t subject to  T Y  0, 0    C t  S ( , d l ), d l  D 1 T T where S ( , d )   e   YK ( d )Y  2 D : Prohibitively large Nearly impossible to solve L1 and Sparsity L2 Optimization L1 Optimization Linear inequality constraints

GKEL

Related documents

Products

Support

GKEL

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib