Slides

advertisement
Large Scale Manifold
Transduction
Michael Karlen
Ayse Erkan
Jason Weston
Ronan Collobert
ICML 2008
Index
• Introduction
• Problem Statement
• Existing Approaches
– Transduction :- TSVM
– Manifold – Regularization :- LapSVM
• Proposed Work
• Experiments
Introduction
• Objective :- Discriminative classification
using unlabeled data.
• Popular methods
– Maximizing margin on unlabeled data as in
TSVM so that decision rule lies in low density.
– Learning cluster or manifold structure from
unlabeled data as in cluster kernels, label
propagation and Laplacian SVMs.
Problem Statement
• Inability of the existing techniques to scale
to very large datasets, also online data.
Existing Techniques
• TSVM
– Problem Formulation :L
U
i 1
i 1
min  w   l  f ( xi ), yi     l * ( f ( xi *))
2
w ,b
where
l *  f ( x*)   max(0,1 | f ( x*) |)
l  f ( x )   max(0,1  yf ( x*))
- Non-Convex Problem
• Problems with TSVM :– When dimension >> L ( no of Labeled examples), all
unlabeled points may be classified to one class while
still classifying the labeled data correctly, giving lower
objective value.
• Solution :– Introduce a balancing constraint in the objective
function.
Implementations to Solve TSVM
• S3VM :– Mixed Integer Programming. Intractable for large data
sets.
• SVMLight-TSVM :– Initially fixes labels of unlabeled examples and then
iteratively switches those labels to improve TSVM
objective function, solving convex objective function
at each step.
– Introduces balancing constraint.
– Handles few thousand examples.
• VS3VM:– A concave-convex minimization approach was
proposed to solve successive convex problems.
– Only Linear Case with no balancing constraints.
• Delta-TSVM :– Optimize TSVM by gradient descent in primal.
– Needs entire Kernel matrix (for non-linear case) to be
in memory, hence inefficient for large datasets.
– Introduce a balancing constraint.
1 U
1 L
 f ( xi *)   yi
U i 1
L i 1
• CCCP-TSVM:–
–
–
–
Concave-Convex procedure.
Non-linear extension of VS3VMs.
Same balancing constraint as delta-TSVM.
100-time faster than SVM-light and 50-times faster
than delta-TSVM.
– 40 hours to solve 60,000 unlabeled example in nonlinear case. Still not scalable enough.
• Large Scale Linear TSVMs :– Same label switching technique as in SVM-Light, but
considered multiple labels at once.
– Solved in the primal formulation.
– Not good for non-linear case.
Manifold-Regularization
• Two Stage Problem :– Learn an embedding
• E.g. Laplacian Eigen-maps, Isomap or spectral
clustering.
– Train a Classifier in this new space.
• Laplacian SVM :L
U
min  l ( f ( xi ), yi )   w    Wij f ( xi *)  f ( x j *)
w,b
i 1
2
i , j 1
Laplacian Eigen Map
2
Using Both Approaches
• LDS (Low Density Separation)
– First, Isomap-like embedding method of
“graph”-SVM is used, whereby data is
clustered.
– In the new embedding space, Delta-TSVM is
applied.
• Problems
– The two-stage approach seems ad-hoc
– Method is slow.
Proposed Approach
• Objective Function
1 L
 U
 l ( f ( xi ), yi )  2  Wij l ( f ( xi *), y * ({i, j}))
L i 1
U i , j 1
where
y * ( N )  sign   f ( xk *) 
 kN

• Non-Convex
Details
• The primal problem is solved by gradient
descent. So, online semi-supervised learning is
possible.
• For non-linear case, a multi-layer architecture is
implemented. This makes training and testing
faster than computing the kernel. (Hard Tanh –
function is used)
• Also, recommendation for online balancing
constraint is given.
Balancing Constraint
• A cache of last 25c predictions f(xi*), where c is
the number of class, is preserved.
• Next balanced prediction is made by assuming a
fixed estimate pest(y) of the probability of each
class, which can be estimated from the
distribution of labeled data.
ptrn ( y  i ) 
i : yi  i
L
• One of the two decisions are made :– Delta-bal :- Add the delta-TSVM balancing
function multiplied by a scaling factor to the
objective. Disadvantage of identifying optimal
scaling factor.
– Igonore-bal :- Based on the distribution of
examples-label pairs in the cache, If the next
unlabeled example has too many examples
assigned to it, do not make a gradient step.
• Further a smooth version of ptrn can be
achieved by labeling the unlabeled data by
k nearest neighbors of each labeled data.
• We derive pknn, that can be used for
implementing the balancing constraint.
Online Manifold Transduction
Experiments
• Data Sets Used
Test Error for Various Methods
Large Scale Datasets
Download