Torralba and Quatton..

advertisement
Transfer Learning for
Image Classification
Transfer Learning Approaches
Leverage data from related tasks to improve performance:
 Improve generalization.
 Reduce the run-time of evaluating a set of classifiers.
Two Main Approaches:
 Learning Shared Hidden Representations.
 Sharing Features.
Sharing Features: efficient boosting
procedures for multiclass object
detection
Antonio Torralba
Kevin Murphy
William Freeman
Snapshot of the idea
Goal:
 Reduce the computational cost of multiclass object recognition
 Improve generalization performance
Approach:
 Make boosted classifiers share weak learners
 Some Notation:
T  {( x1, y1 ), (x2 , y 2 ),..., (xn , y n )}
x i  R d y i  [ yi1 , yi2 ,..., yim ] yik  {1,1}
Training a single boosted classifier
 Consider training a single boosted classifier:
Candidate weak learners
Weighted stumps
Fit an additive model
Training a single boosted classifier
Minimize the exponential loss
Greedy Approach
Gentle Boosting
Standard Multiclass Case: No Sharing
Additive model for
each class
Minimize the sum of
exponential losses
Each class has its own set of weak learners:
Multiclass Case: Sharing Features
subset of classes
corresponding additive
classifier
classifier for the k class
At iteration t add one weak learner to one
of the
additive models:
Minimize the sum of
exponential losses
Naive approach:
Greedy Heuristic:
Sharing features for multiclass object detection
Torralba, Murphy, Freeman. CVPR 2004
Learning efficiency
Sharing features shows sub-linear
scaling of features with objects (for
area under ROC = 0.9).
Red: shared features
Blue: independent features
How the features are shared across objects
Basic features: Filter responses computed at different locations of the image
Uncovering Shared Structures
in Multiclass Classification
Yonatan Amit
Michael Fink
Nathan Srebro
Shimon Ullman
Structure Learning Framework
Linear Classifiers
Linear Transformations
Class Parameters
Structural Parameters
 Find optimal parameters:
n
arg min W ,
 Loss(W ,  x , y )  c  (W )  c  ( )
i 1
i
i
1
2
Multiclass Loss Function
 Hinge Loss:
max( 0,1  ywt  x)
 Maximal Hinge Loss:
Loss (W ,  x, y )  max y j , y p : y j 1, y p  1 (1  wTp  x  wTj  x)
Snapshot of the idea
 Main Idea: Enforce Sharing by finding low rank parameter matrix W
Transformation on x
Loss (W ,  x, y )  max y j , y p : y j 1, y p  1 (1  wTp  x  wTj  x)
Transformation on w
 Consider the m by d parameter matrix:
 Can be factored:
 Rows of theta form a basis
Low Rank Regularization Penalty
 Rank of a d by m matrix
is the smallest z, such that:
 A regularization penalty designed to minimize the rank of W’ would
tend to produce solutions where a few basis are shared by all classes.
 Minimizing the rank would lead to a hard combinatorial problem
 Instead use a trace norm penalty:
eigen value of W’
Putting it all together
n
arg min W '  Loss (W ' , x i , y i )  c(W ' )
i 1
No longer in the objective
 For optimization they use a gradient based method
that minimizes a smooth approximation of the objective
Mammals Dataset
Results
Transfer Learning for
Image Classification via
Sparse Joint Regularization
Ariadna Quattoni
Michael Collins
Trevor Darrell
Training visual classifiers when a few examples are available

Problem:
 Image classification from a few examples can be hard.
 A good representation of images is crucial.

Solution:
 We learn a good image representation using:
unlabeled data + labeled data from related problems
Snapshot of the idea:

Use unlabeled dataset + kernel function to compute a new representation:



Complex features, high dimensional space
Some of them will be very discriminative (hopefully)
Most will be irrelevant

If we knew the relevant features we could learn from fewer examples.

Related problems may share relevant features.

We can use data from related problems to discover them !!
Semi-supervised learning
Step 1: Learn representation
Visual
Representation
Large dataset of unlabeled
data
Unsupervised
learning
F : I  Rh
Step 2: Train Classifier
h-dimensional
training set
Small training set of
labeled images
Compute
F : I  Rh
Train Classifier
Semi-supervised learning:
 Raina et al. [ICML 2007] proposed an approach that learns a sparse set
of high level features (i.e. linear combinations of the original features)
from unlabeled data using a sparse coding technique.
 Balcan et al. [ML 2004] proposed a representation based on computing kernel
distances to unlabeled data points.
Learning visual representations using unlabeled data
only
 Unsupervised learning in data space
 Good thing:
 Lower dimensional representation preserves relevant statistics of the
data sample.
 Bad things:
 The representation might still contain irrelevant features, i.e. features
that are useless for classification.
Learning visual representations from unlabeled data + labeled
data from related categories
Step 1: Learn a representation
Labeled images from
related categories
Large Dataset of
unlabeled images
Kernel Function
k:XX R
Discriminative
representation
Create
New
Representation
Select Discriminative
Features of the New
Representation
F : I  Rh
Our contribution
 Main differences with previous approaches:
 Our choice of joint regularization norm allows us to express the
joint loss minimization as a linear program (i.e. no need for greedy
approximations.)
 While previous approaches build joint sparse classifiers on the
feature space our method discovers discriminative features in a space
derived using the unlabeled data and uses these discriminative features
to solve future problems.
Overview of the method
 Step I: Use the unlabeled data to compute a new representation space
[Kernel SVD].
 Step II: Use the labeled data from related problems to discover discriminative
features in the new space [Joint Sparse Regularization].
 Step III: Compute the new discriminative representation for the samples of
the target problem.
 Step IV: Train the target classifier using the representation of step III.
Step I: Compute a representation using the unlabeled data
 Perform Kernel SVD on the Unlabeled Data.
 A) Compute kernel matrix of unlabeled images:
U
K
K ij : k ( xi , x j )
 B) Compute a projection matrix A by taking all the eigen vectors of K.
Step I: Compute a representation using the unlabeled
data
 C) Project labeled data from related problems to the new space:
D
 Notational shorthand:
x'  z ( x)
Sidetrack
 Another possible method for learning a representation from the unlabeled data
would be to create a projection matrix Q by taking the h eigen vectors of A
corresponding to the h largest eigenvalues.
 We will call this approach the Low Rank Baseline
 Our method differs significantly from the low rank approach in that
we use training data from related problems to select discriminative features in
the new space.
Step II: Discover relevant features by joint sparse
approximation
 A classifier is a function:
F : X Y
where: X is the input space, i.e. the representation learnt from unlabeled data.
and Y  {1,1} is a binary label, in our application is going to be 1 if an image
belongs to a particular topic and -1 otherwise.
 A loss function:
F : { f ( x), y}  R
Step II: Discover relevant features by joint sparse
approximation
 Consider learning a single sparse linear classifier (on the space learnt from
the unlabeled data) of the form:
f ( x' )  wt x'
 A sparse model will have only a few features with non-zero coefficients.
 A natural choice for parameter estimation would be:
min
w

( x , y )D
p
l ( f ( x' ), y )   | w j |
j 1
Classification
error
L1 penalizes
non-sparse solutions
 Donoho [2004] has proven (in a regression setting) that the solution with
smallest L1 norm is also the sparsest solution.
Step II: Discover relevant features by joint sparse
approximation
 Goal: Find a subset of features R such that each problem in:
D  {D1 , D2 ,..., Dm }
can be well approximated by a sparse classifier whose non-zero coefficients
correspond to features in R.
 Solution : Regularized Joint loss minimization:
f k ( x' )  w tk x'
m
min W [ w1 ,w 2 ,.. w m ] 
l ( f
k 1 ( x , y )Dk
k
( x' ), y )   (W )
Classification
error on training set k
penalizes
solutions that
utilize too many
features
Step II: Discover relevant features by joint sparse
approximation
 How do we penalize solutions that use too many features?
W
w11
w12

w1m
w12
w22

2
wm




w1p
w2p
 wmp
Coefficients for
classifier 2
Coefficients for
for feature 2
 (W ) # non  zero  rows
 Problem : not a proper norm, would lead to a hard combinatorial problem .
Step II: Discover relevant features by joint sparse
approximation
 Instead of using the #non-zero-rows pseudo-norm we will use a convex
relaxation [Tropp 2006]
p
 (W )   max (| wik |)
i 1
k
 This norm combines:
An L1 norm on the maximum absolute values of
the coefficients promotes sparsity on max values.
Use few features
The L∞ norm on each row promotes non-sparsity
on the rows
Share features
 The combination of the two norms results in a solution where only a
few features are used but the features used will contribute in solving
many classification problems.
Step II: Discover relevant features by joint sparse
approximation
 Using the L1- L∞ norm we can rewrite our objective function as:
m
min W [ w1 ,w 2 ,.. w m ] 
l ( f
k 1 ( x , y )Dk
p
k
( x' ), y )   max (| wik |)
i 1
k
 For any convex loss this is a convex function, in particular when considering
the hinge loss:
l ( f ( x))  max( 0,1  yf ( x))
the optimization problem can be expressed as a linear program.
Step II: Discover relevant features by joint sparse
approximation
 Linear program formulation ( hinge loss):
m | Dk |
p
k 1 j 1
i 1
min [ W,ε ,t ]   kj   ti
 Max value constraints:
for : k  1 : m and
for : i  1 : p
 ti  wik  ti
 Slack variables constraints:
for : k  1 : m and
for : j  1 :| Dk |
y kj f k ( x'kj ) 1   kj
 kj  0
Step III: Compute the discriminative features representation
 Define the set of relevant features to be:
R  {r : max(| wrk |)   }
 Create a new representation by taking all the features in x’ corresponding to the
indexes in R
Experiments: Dataset
Reuters Dataset
10382 images, 108 topics .
Predict 10 most frequent topics; (binary prediction)
Data Partitions
3000 unlabeled images.
2382 images as testing data.
5000 images as source of supervised training data
Ln labeled training sets of sizes: 1, 5, 10 ,15…50.
Ln Training set with n positive examples and 2*n
Dataset
SuperBowl
[341]
Sharon
[321]
Golden globes
[167]
Grammys
[170]
Danish Cartoons Australian open
[178]
[209]
Figure skating Academy Awards
[146]
[135]
Trapped coal
miners [196]
Iraq
[ 125]
Baseline Representation
 ‘Bag of words’ representation that combines: color, texture and raw local
image information
 Sampling:
 Sample image patches on a fixed grid
 For each image patch compute:
Color features based on HSV color histograms
Texture features based on mean responses of Gabor filters at different scales
and orientations
Raw features, normalized pixel values
 Create visual dictionary: for each feature type we do vector quantization and
create a dictionary V of 2000 visual words.
Baseline representation
 Compute baseline representation:
 Sample image patches over a fix grid
 For every feature type map each patch to its closest visual word
in the corresponding dictionary.
 The final representation is given by:
g ( I )  [c1 , c2 ,..c|Vc | ; t1 , t 2 ,..t|Vt | ; r1 , r2 ...r|Vr | ]
where ci is the number of times that an image patch was mapped to
the i-th color word
Setting
 Step 1: Learn a representation using the unlabeled datataset and
labeled datatasets from 9 topics.
 Step 2: Train a classifier for the 10th held out topic using the learnt
representation.
 As evaluation we use the equal error rate averaged over the 10 topics.
Experiments
 Three models, all linear SVMs
 Baseline model (RFB):
Uses raw representation.
 Low Rank baseline (LRB):
 Uses as a representation: v( x)  Qt ( x)
where Q is consists of the h highest eigenvectors of the
matrix A computed in the first step of the algorithm
 Sparse Transfer Model (SPT)
 Uses Representation computed by our
algorithm
 For both LRB and SPT we used and RBF kernel when computing the
Representation from unlabeled data.
Results:
0.5
RFB
SPT
LRB
Equal Error Rate
0.45
0.4
0.35
0.3
0.25
1
5
10
15
20
25
30
35
# positive training examples
40
45
50
Results:
Average equal error rate for models trained with 5 examples
0.5
RFB
SPT
Equal Error Rate
0.45
0.4
0.35
0.3
0.25
SB
GG
DC
Gr
AO
Sh
TC
FS
AA
Ir
Mean Equal Error rate per topic for classifiers trained with five positive examples; for the RFB
model and the SPT model. SuperBowl; GG: Golden Globes; DC: Danish Cartoons; Gr: Grammys;
AO: Australian Open; Sh:Sharon; FS: Figure Skating; AA: Academy Awards; Ir: Iraq.
Results
Conclusion
 Summary:
 We described a method for learning discriminative sparse image
representations from unlabeled images + images from related tasks.
 The method is based on learning a representation from the unlabeled
data and performing joint sparse approximation on the data from
related tasks to find a subset of discriminative features.
 The induced representation improves performance when learning with
very few examples.
Future work
 Combine different representations
 Develop an efficient algorithm for solving the joint optimization that
will scale to very large datasets.
Joint Sparse Approximation
Discovers image representations which improve learning with
few examples
The LP formulation is feasible for small problems but
becomes intractable for larger data-sets.
Outline

A Joint Sparse Approximation Model for Multi-task Learning

An Efficient Algorithm

Experiments
Joint Sparse Approximation as a Constrained Convex
Optimization Problem
m
1
l ( f k ( x), y ) A convex function

k 1 | Dk | ( x , y )Dk
min W 
d
s.t. max (| Wik |)  Q
i 1
k
Convex constraints
 We will use a Projected SubGradient method.
Main advantages: simple, scalable, guaranteed convergence rates.
 Projected SubGradient methods have been recently proposed:
 L2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007]
 L1 regularization [Duchi et al. 2008]
Euclidean Projection into the L1-∞ ball
 Projection:
1
1
min  ( Bi , j  Ai , j ) 2
B ,μ 2
i, j
s.t.
i , j
d

i 1
i
i, j
i
Bi , j  i
Q
i  0
Bi , j  0
2
find μ, s . t .
Inspecting the
Lagrangian
shows that it can
be reduced to:
d

i 1
i
Q
i : i  0,
i
i  0
A
Ai , j 
i, j
 i  
i
 0
set : Ai , j  i  Bi , j  Ai , j
Ai , j  i  Bi , j  i
 We reduced the projection to finding new maximums for each
feature across tasks and using them to truncate A.
 The total mass removed from a feature across tasks should be the
same for all features whose coefficients don’t vanish.
Euclidean Projection into the L1-∞ ball
3 Features, 6 problems, Q=14
3

i
 14
i 1
R1 ( 1 ) 
A
A1, j 
1, j
 8
 1
1

3
Regret
 8
1
2
t t
t t t
1
3
2
4
Feature I
5
t
t t
t t t
6
1
3
2
4
Feature II
5
t
t t
t t t
6
1
3
2
4
5
Feature III
t
6
Euclidean Projection into the L1-∞ ball
 8
 8
 8
 7
An Efficient Algorithm For Finding μ
 Recall that we need to find a vector of maximums µ that sums to
Q and a constant θ such that the mass loss for every feature is
the same.
 Consider the mass loss function and its inverse:
Ri : i

i
Ri1 : 
This is a piece wise linear function
So is its inverse
 Consider a function that takes a mass loss and computes the corresponding
sum of µ ( the norm of the new matrix)
d
N( )   Ri1 ( ) This is also piece wise linear
i 1
 We can construct this function and pick the θ for which N( )  Q
Complexity
 The total cost of the algorithm is dominated by the initial sort
of each row of A:O(dm log( m))
 The merge of these rows to build the piece
wise linear function: O(dm log( d ))
 The total cost is in the order of: O(dm log( dm))
 Notice that we only need to consider non-zero rows of A, so
d is really the number of non-zero rows rather than the actual
number of features
Outline

A Joint Sparse Approximation Model for Multi-task Learning

An Efficient Algorithm

Experiments
Synthetic Experiments
 All tasks consisted of predicting functions of the form: sign ( w  x )
 To generate jointly sparse vectors we randomly selected 10% of
the features to be the relevant feature set V . Then for each task we
randomly selected a subset v and zeroed all parameters outside v.
 We use the same projected subgradient method, and compared three
different types of projection steps:
 L1−∞ projection
 L2 projection
 L1 projection
Synthetic Experiments
Performance on predicting
relevant features
Test Error
Synthetic Experiments Results: 60 problems 200 features 10% relevant
Feature Selection Performance
50
100
90
45
80
40
70
60
30
50
Error
35
40
25
20
15
10
L2
L1
L1-LINF
20
Precision L1-INF
Recall L1
Precision L1
Recall L1-INF
30
20
40
80
160
# training examples per task
320
640
10
10
20
40
80
160
# training examples per task
320
640
Dataset: News story prediction
SuperBowl
Golden
globes
Sharon
Grammys
Danish Australian
Cartoons
open
Figure
skating
Academy
Awards
Trapped
coal miners
Iraq
 40 tasks
 Raw image representation: Bag of visual words
 Linear Kernel
 3000 dimensions
Image Classification Experiments
Reuters Dataset Results
0.46
0.44
Mean EER
0.42
0.4
0.38
0.36
L2
L1
L1-INF
0.34
0.32
15
30
60
120
# training examples per task
Absolute Weights L1
240
Absolute Weights L1-INF
60
0.06
500
500
50
0.05
1000
1000
1500
30
2000
20
2500
3000
10
5
10
15
20
25
30
35
40
0.04
Feature
Feature
40
1500
0.03
2000
0.02
2500
3000
0.01
5
10
15
20
task
25
30
35
40
Conclusion and Future Work
 Presented a simple and effective algorithm for training joint
models with L1−∞ constraints.
 The algorithm scales linearly with the number of examples and
O(dm log(dm) ) with the number of problems and dimensions.
 The experiments on a real image dataset show that this
algorithm can find solutions that are jointly sparse, resulting in
lower test error.
 We believe our approach can be extended to work on an
online multi-task learning setting.
Future Work: Online Multitask Classification [Cavallanti et al.
2008]
 There are m binary classification tasks indexed by 1 ,…., M
 At each time step t=1,2,…,T the learner receives a task index k and
the corresponding instance vector.
 Based on this information the learner outputs a binary prediction
an then observes the correct label yk
 We are interested in comparing the learner’s mistake count to that
of the optimal predictors:
T
inf  l ( w , y , x )
w1,....,wm R d i 1
i
t
i
t
i
t

Thanks!
Lagrangian
Lagrangian
Download