Large Scale Manifold Transduction Michael Karlen Ayse Erkan Jason Weston Ronan Collobert ICML 2008 Index • Introduction • Problem Statement • Existing Approaches – Transduction :- TSVM – Manifold – Regularization :- LapSVM • Proposed Work • Experiments Introduction • Objective :- Discriminative classification using unlabeled data. • Popular methods – Maximizing margin on unlabeled data as in TSVM so that decision rule lies in low density. – Learning cluster or manifold structure from unlabeled data as in cluster kernels, label propagation and Laplacian SVMs. Problem Statement • Inability of the existing techniques to scale to very large datasets, also online data. Existing Techniques • TSVM – Problem Formulation :L U i 1 i 1 min w l f ( xi ), yi l * ( f ( xi *)) 2 w ,b where l * f ( x*) max(0,1 | f ( x*) |) l f ( x ) max(0,1 yf ( x*)) - Non-Convex Problem • Problems with TSVM :– When dimension >> L ( no of Labeled examples), all unlabeled points may be classified to one class while still classifying the labeled data correctly, giving lower objective value. • Solution :– Introduce a balancing constraint in the objective function. Implementations to Solve TSVM • S3VM :– Mixed Integer Programming. Intractable for large data sets. • SVMLight-TSVM :– Initially fixes labels of unlabeled examples and then iteratively switches those labels to improve TSVM objective function, solving convex objective function at each step. – Introduces balancing constraint. – Handles few thousand examples. • VS3VM:– A concave-convex minimization approach was proposed to solve successive convex problems. – Only Linear Case with no balancing constraints. • Delta-TSVM :– Optimize TSVM by gradient descent in primal. – Needs entire Kernel matrix (for non-linear case) to be in memory, hence inefficient for large datasets. – Introduce a balancing constraint. 1 U 1 L f ( xi *) yi U i 1 L i 1 • CCCP-TSVM:– – – – Concave-Convex procedure. Non-linear extension of VS3VMs. Same balancing constraint as delta-TSVM. 100-time faster than SVM-light and 50-times faster than delta-TSVM. – 40 hours to solve 60,000 unlabeled example in nonlinear case. Still not scalable enough. • Large Scale Linear TSVMs :– Same label switching technique as in SVM-Light, but considered multiple labels at once. – Solved in the primal formulation. – Not good for non-linear case. Manifold-Regularization • Two Stage Problem :– Learn an embedding • E.g. Laplacian Eigen-maps, Isomap or spectral clustering. – Train a Classifier in this new space. • Laplacian SVM :L U min l ( f ( xi ), yi ) w Wij f ( xi *) f ( x j *) w,b i 1 2 i , j 1 Laplacian Eigen Map 2 Using Both Approaches • LDS (Low Density Separation) – First, Isomap-like embedding method of “graph”-SVM is used, whereby data is clustered. – In the new embedding space, Delta-TSVM is applied. • Problems – The two-stage approach seems ad-hoc – Method is slow. Proposed Approach • Objective Function 1 L U l ( f ( xi ), yi ) 2 Wij l ( f ( xi *), y * ({i, j})) L i 1 U i , j 1 where y * ( N ) sign f ( xk *) kN • Non-Convex Details • The primal problem is solved by gradient descent. So, online semi-supervised learning is possible. • For non-linear case, a multi-layer architecture is implemented. This makes training and testing faster than computing the kernel. (Hard Tanh – function is used) • Also, recommendation for online balancing constraint is given. Balancing Constraint • A cache of last 25c predictions f(xi*), where c is the number of class, is preserved. • Next balanced prediction is made by assuming a fixed estimate pest(y) of the probability of each class, which can be estimated from the distribution of labeled data. ptrn ( y i ) i : yi i L • One of the two decisions are made :– Delta-bal :- Add the delta-TSVM balancing function multiplied by a scaling factor to the objective. Disadvantage of identifying optimal scaling factor. – Igonore-bal :- Based on the distribution of examples-label pairs in the cache, If the next unlabeled example has too many examples assigned to it, do not make a gradient step. • Further a smooth version of ptrn can be achieved by labeling the unlabeled data by k nearest neighbors of each labeled data. • We derive pknn, that can be used for implementing the balancing constraint. Online Manifold Transduction Experiments • Data Sets Used Test Error for Various Methods Large Scale Datasets