Slides

Large Scale Manifold Transduction Michael Karlen Ayse Erkan Jason Weston Ronan Collobert ICML 2008 Index • Introduction • Problem Statement • Existing Approaches – Transduction :- TSVM – Manifold – Regularization :- LapSVM • Proposed Work • Experiments Introduction • Objective :- Discriminative classification using unlabeled data. • Popular methods – Maximizing margin on unlabeled data as in TSVM so that decision rule lies in low density. – Learning cluster or manifold structure from unlabeled data as in cluster kernels, label propagation and Laplacian SVMs. Problem Statement • Inability of the existing techniques to scale to very large datasets, also online data. Existing Techniques • TSVM – Problem Formulation :L U i 1 i 1 min  w   l  f ( xi ), yi     l * ( f ( xi *)) 2 w ,b where l *  f ( x*)   max(0,1 | f ( x*) |) l  f ( x )   max(0,1  yf ( x*)) - Non-Convex Problem • Problems with TSVM :– When dimension >> L ( no of Labeled examples), all unlabeled points may be classified to one class while still classifying the labeled data correctly, giving lower objective value. • Solution :– Introduce a balancing constraint in the objective function. Implementations to Solve TSVM • S3VM :– Mixed Integer Programming. Intractable for large data sets. • SVMLight-TSVM :– Initially fixes labels of unlabeled examples and then iteratively switches those labels to improve TSVM objective function, solving convex objective function at each step. – Introduces balancing constraint. – Handles few thousand examples. • VS3VM:– A concave-convex minimization approach was proposed to solve successive convex problems. – Only Linear Case with no balancing constraints. • Delta-TSVM :– Optimize TSVM by gradient descent in primal. – Needs entire Kernel matrix (for non-linear case) to be in memory, hence inefficient for large datasets. – Introduce a balancing constraint. 1 U 1 L  f ( xi *)   yi U i 1 L i 1 • CCCP-TSVM:– – – – Concave-Convex procedure. Non-linear extension of VS3VMs. Same balancing constraint as delta-TSVM. 100-time faster than SVM-light and 50-times faster than delta-TSVM. – 40 hours to solve 60,000 unlabeled example in nonlinear case. Still not scalable enough. • Large Scale Linear TSVMs :– Same label switching technique as in SVM-Light, but considered multiple labels at once. – Solved in the primal formulation. – Not good for non-linear case. Manifold-Regularization • Two Stage Problem :– Learn an embedding • E.g. Laplacian Eigen-maps, Isomap or spectral clustering. – Train a Classifier in this new space. • Laplacian SVM :L U min  l ( f ( xi ), yi )   w    Wij f ( xi *)  f ( x j *) w,b i 1 2 i , j 1 Laplacian Eigen Map 2 Using Both Approaches • LDS (Low Density Separation) – First, Isomap-like embedding method of “graph”-SVM is used, whereby data is clustered. – In the new embedding space, Delta-TSVM is applied. • Problems – The two-stage approach seems ad-hoc – Method is slow. Proposed Approach • Objective Function 1 L  U  l ( f ( xi ), yi )  2  Wij l ( f ( xi *), y * ({i, j})) L i 1 U i , j 1 where y * ( N )  sign   f ( xk *)   kN  • Non-Convex Details • The primal problem is solved by gradient descent. So, online semi-supervised learning is possible. • For non-linear case, a multi-layer architecture is implemented. This makes training and testing faster than computing the kernel. (Hard Tanh – function is used) • Also, recommendation for online balancing constraint is given. Balancing Constraint • A cache of last 25c predictions f(xi*), where c is the number of class, is preserved. • Next balanced prediction is made by assuming a fixed estimate pest(y) of the probability of each class, which can be estimated from the distribution of labeled data. ptrn ( y  i )  i : yi  i L • One of the two decisions are made :– Delta-bal :- Add the delta-TSVM balancing function multiplied by a scaling factor to the objective. Disadvantage of identifying optimal scaling factor. – Igonore-bal :- Based on the distribution of examples-label pairs in the cache, If the next unlabeled example has too many examples assigned to it, do not make a gradient step. • Further a smooth version of ptrn can be achieved by labeling the unlabeled data by k nearest neighbors of each labeled data. • We derive pknn, that can be used for implementing the balancing constraint. Online Manifold Transduction Experiments • Data Sets Used Test Error for Various Methods Large Scale Datasets

Slides

Related documents

Products

Support

Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib