maya_gupta

advertisement
Similarity-based Classifiers:
Problems and Solutions
Classifying based on similarities:
Van Gogh
Van Gogh
Or
Monet ?
Monet
2
the Similarity-based Classification Problem
Training Samples: f (x i ; yi )gni= 1; x i 2 Ð; yi 2 G; i = 1; : : : ; n
(paintings)
(painter)
3
the Similarity-based Classification Problem
Training Samples: f (x i ; yi )gni= 1; x i 2 Ð; yi 2 G; i = 1; : : : ; n
Underlying Similarity Function: Ã : Ð £ Ð ! R
£
¤
£
¤T
Training Similarities: S = Ã(x i ; x j ) n£ n ; y = y1 : : : yn
4
the Similarity-based Classification Problem
Training Samples: f (x i ; yi )gni= 1; x i 2 Ð; yi 2 G; i = 1; : : : ; n
Underlying Similarity Function: Ã : Ð £ Ð ! R
£
¤
£
¤T
Training Similarities: S = Ã(x i ; x j ) n£ n ; y = y1 : : : yn
£
¤T
Test Similarities: s = Ã(x; x 1) : : : Ã(x; x n ) ; Ã(x; x)
Problem: Est imat e t he class label y^
for t est sample x given S, y, s, and Ã(x; x).
?
5
Examples of Similarity Functions
Computational Biology
– Smith-Waterman algorithm (Smith & Waterman, 1981)
– FASTA algorithm (Lipman & Pearson, 1985)
– BLAST algorithm (Altschul et al., 1990)
Computer Vision
–
–
–
–
Tangent distance (Duda et al., 2001)
Earth mover’s distance (Rubner et al., 2000)
Shape matching distance (Belongie et al., 2002)
Pyramid match kernel (Grauman & Darrell, 2007)
Information Retrieval
– Levenshtein distance (Levenshtein, 1966)
– Cosine similarity between tf-idf vectors (Manning &
Schütze, 1999)
6
Approaches to Similarity-based Classification
Classify x given S, y, s, and Ã(x; x).
7
Approaches to Similarity-based Classification
Classify x given S, y, s, and Ã(x; x).
8
Can we treat similarities as kernels?
Kernels are inner products in some Hilbert space.
9
Can we treat similarities as kernels?
Kernels are inner products in some Hilbert space.
x
Example Inner Product hx; zi = x T z.
z
hx; zi
Propert ies of an Inner Product hx; zi :
conjugate symmetric, real
linear: hax; zi = a < x; z >
positive de¯nite: hx; xi > 0 unless x = 0
An inner product implies a norm: kxk =
p
hx; xi
10
Can we treat similarities as kernels?
Kernels are inner products in some Hilbert space.
Inner products are similarities.
Are our notions of similarities always inner products?No!
11
Example: Amazon similarity
Ð space of all books,
Á(A; B) = % buy book A after viewing book B on Amazon
10
S
20
30
Inner product-like?
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
96 books
12
Example: Amazon similarity
Ð space of all books,
Á(A; B) = % buy book A after viewing book B on Amazon
10
assymmet ric!
S
20
30
40
50
60
70
80
Á(HTF, Bishop) = 3
90
10
20
30
40
50
60
70
96 books
80
90
Á(Bishop, HTF) = 8
13
Example: Amazon similarity
Ð space of all books,
Á(A; B) = % buy book A after viewing book B on Amazon
1.2
10
S
30
Not PSD!
1
0.8
Eigenvalue
20
40
50
60
0.6
0.4
0.2
70
80
0
90
-0.2
10
20
30
40
50
60
70
96 books
80
90
negat ive
0
10
20
30 40 50 60
Eigenvalue Rank
Rank
70
80
90
Well, let’s just make S be a kernel matrix
First , symmet rize:
S Ã 12 (S + ST ) ) S = U¤ U T ;
¤ = diag(¸ 1; : : : ; ¸ n )
Clip:
Sclip = U diag(max(¸ 1; 0); : : : ; max(¸ n ; 0))U T
0
0
Sclip is the PSD matrix closest to S
in terms of the Frobenius norm.
PSD Cone
Sclip
S
15
Well, let’s just make S be a kernel matrix
First , symmet rize:
S Ã 12 (S + ST ) ) S = U¤ U T ;
¤ = diag(¸ 1; : : : ; ¸ n )
Flip:
S° ip = U diag(j¸ 1j; : : : ; j¸ n j) U T
0
0
(similar e®ect : Snew = ST S)
16
Well, let’s just make S be a kernel matrix
First , symmet rize:
S Ã 12 (S + ST ) ) S = U¤ U T ;
¤ = diag(¸ 1; : : : ; ¸ n )
Shift :
Sshift = U (¤ + jmin(¸ min (S); 0)j I ) U T
0
0
17
Well, let’s just make S be a kernel matrix
First , symmet rize:
S Ã 12 (S + ST ) ) S = U¤ U T ;
¤ = diag(¸ 1; : : : ; ¸ n )
Sshift = U (¤ + jmin(¸ min (S); 0)j I ) U T
0
0
Flip, Clip or Shift?
Best bet is Clip.
18
Well, let’s just make S be a kernel matrix
First , symmet rize:
S Ã 12 (S + ST ) ) S = U¤ U T ;
¤ = diag(¸ 1; : : : ; ¸ n )
Learn the best kernel matrix for the SVM:
(Luss NIPS 2007, Chen et al. ICML 2009)
n
1X
min min
L (f (x i ); yi ) + ´ kf k2K + ° kK ¡ SkF
K º 0 f 2HK n
i= 1
19
Approaches to Similarity-based Classification
Classify x given S, y, s, and Ã(x; x).
20
Let the similarities to the training samples be features
£
¤T
Let Ã(x; x 1) : : : Ã(x; x n ) 2 Rn be the feature vect or for x.
– SVM (Graepel et al., 1998; Liao & Noble, 2003)
– Linear programming (LP) machine (Graepel et al., 1999)
– Linear discriminant analysis (LDA) (Pekalska et al., 2001)
– Quadratic discriminant analysis (QDA) (Pekalska & Duin, 2002)
– Potential support vector machine (P-SVM) (Hochreiter &
Obermayer, 2006; Knebel et al., 2008)
1
minimize ky ¡ S®k22 + ²k®k1 + ° k®k1
®
2
Asymptotically does this work?
Our results suggest you need to choose a slow-growing subset of n.
21
# samples
AMAZON47
classes
AURAL CALTECH FACE
MIREX
SONAR 101 classes REC
10
2 classes
139 classes classes
n = 204
n =100
n = 8677
n = 945
VOTING
VDM
2 classes
n = 3090 n = 435
SVM
(clip)
81.24
13.00
33.49
4.18
57.83
4.89
SVM simas-feature
(linear)
76.10
14.25
38.18
4.29
55.54
5.40
SVM simas-feature
(RBF)
75.98
14.25
38.16
3.92
55.72
5.52
P-SVM
70.12
14.25
34.23
4.05
63.81
22
5.34
# samples
AMAZON47
classes
AURAL CALTECH FACE
MIREX
SONAR 101 classes REC
10
2 classes
139 classes classes
n = 204
n =100
n = 8677
n = 945
VOTING
VDM
2 classes
n = 3090 n = 435
SVM-kNN
(clip)
(Zhang et
al. 2006)
17.56
13.75
36.82
4.23
61.25
5.23
SVM
(clip)
81.24
13.00
33.49
4.18
57.83
4.89
SVM simas-feature
(linear)
76.10
14.25
38.18
4.29
55.54
5.40
SVM simas-feature
(RBF)
75.98
14.25
38.16
3.92
55.72
5.52
P-SVM
70.12
14.25
34.23
4.05
63.81
23
5.34
Approaches to Similarity-based Classification
Classify x given S, y, s, and Ã(x; x).
24
Weighted Nearest-Neighbors
Take a weighted vote of the k-nearest-neighbors:
Xk
y^ = arg max
g2 G
wi I f yi = gg
i= 1
Algorithmic parallel of the exemplar model of human learning.
?
25
Weighted Nearest-Neighbors
Take a weighted vote of the k-nearest-neighbors:
Xk
y^ = arg max
g2 G
wi I f yi = gg
i= 1
Algorithmic parallel of the exemplar model of human learning.
For wi ¸ 0 and
P
i
wi = 1, get class posterior estimate:
P^ (Y = gjX = x) =
Xk
wi I f yi = gg
i= 1
Good for asymmetric costs
Good for interpretation
Good for system integration.
26
Design Goals for the Weights
?
27
Design Goals for the Weights
?
Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).
28
Design Goals for the Weights
?
29
Design Goals for the Weights (Chen et al. JMLR 2009)
?
Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).
30
Linear Interpolation Weights
Linear interpolation weights will meet these goals:
X
X
wi x i = x; such t hat wi ¸ 0;
wi = 1
i
x1
x2
x
i
x4
x3
non-unique
solut ion
31
Linear Interpolation Weights
Linear interpolation weights will meet these goals:
X
X
wi x i = x; such t hat wi ¸ 0;
wi = 1
i
x1
x2
x
i
x4
x1
x4 x
x3
x2
x3
non-unique
solut ion
no solut ion
32
LIME weights
Linear interpolation weights will meet these goals:
X
wi x i = x; such t hat wi ¸ 0;
i
X
wi = 1
i
Linear interpolation with maximum entropy (LIME)
weights (Gupta et al., IEEE PAMI 2006):
minimize
w
°
°2
k
° Xk
°
X
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
33
LIME weights
Linear interpolation weights will meet these goals:
X
wi x i = x; such t hat wi ¸ 0;
i
X
wi = 1
i
Linear interpolation with maximum entropy (LIME)
weights (Gupta et al., IEEE PAMI 2006):
minimize
w
°
°2
k
° Xk
°
X
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
34
LIME weights
Linear interpolation weights will meet these goals:
X
wi x i = x; such t hat wi ¸ 0;
i
X
wi = 1
i
Linear interpolation with maximum entropy (LIME)
weights (Gupta et al., IEEE PAMI 2006):
minimize
w
°
°2
k
° Xk
°
X
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
maximum entropy !
push weight s to be equal
35
LIME weights
Linear interpolation weights will meet these goals:
X
wi x i = x; such t hat wi ¸ 0;
i
X
wi = 1
i
Linear interpolation with maximum entropy (LIME)
weights (Gupta et al., IEEE PAMI 2006):
minimize
w
°
°2
k
° Xk
°
X
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
maximum entropy = exponent ial solut ion
consist ent (Friedlander Gupta I EEE I T 2005)
noise averaging
36
Kernelize Linear Interpolation (Chen et al. JMLR 2009)
LIME weights:
minimize
w
°
°2
° Xk
°
Xk
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
Let X = [x 1; : : : x k ], re-writ e wit h mat rices
and change t o ridge regularizer:
1 T T
¸
w X X w ¡ x T X w + wT w
w
2
2
subject t o w ¸ 0; 1T w = 1;
minimize
37
Kernelize Linear Interpolation
LIME weights:
minimize
w
°
°2
° Xk
°
Xk
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
Let X = [x 1; : : : x k ], re-writ e wit h mat rices
and change t o ridge regularizer:
1 T T
¸
w X X w ¡ x T X w + wT w
w
2
2
subject t o w ¸ 0; 1T w = 1;
minimize
regularizes the
variance of the
weights
38
Kernelize Linear Interpolation
LIME weights:
minimize
w
°
°2
° Xk
°
Xk
°
°
wi x i ¡ x ° + ¸
wi log wi
°
°
°
i= 1
Xk
subject t o
2
i= 1
wi = 1; wi ¸ 0; i = 1; : : : ; k:
i= 1
Let X = [x 1; : : : x k ], re-writ e wit h mat rices
and change t o ridge regularizer:
1 T T
¸
w X X w ¡ x T X w + wT w
w
2
2
subject t o w ¸ 0; 1T w = 1;
minimize
only need inner products –
can replace with kernel or
similarities!
39
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
1 T
¸ T
T
minimize
w Sw ¡ s w + w w
w
2
2
subject t o w ¸ 0; 1T w = 1:
40
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
1 T
¸ T
T
minimize
w Sw ¡ s w + w w
w
2
2
subject t o w ¸ 0; 1T w = 1:
affinity:
£
¤T
s = Ã(x; x 1) : : : Ã(x; x n ) ;
so wi high if Ã(x; x i ) high
41
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
1 T
¸ T
T
minimize
w Sw ¡ s w + w w
w
2
2
subject t o w ¸ 0; 1T w = 1:
diversity:
1 T
1X
w Sw =
Ã(x i ; x j )wi wj
2
2
i ;j
42
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
1 T
¸ T
T
minimize
w Sw ¡ s w + w w
w
2
2
subject t o w ¸ 0; 1T w = 1:
Make S PSD,
problem is a QP
QP w/ box constraints
Can solve with SMO
43
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
1 T
¸ T
T
arg min
w Sw ¡ s w + w w
2
2
w
subject t o w ¸ 0; 1T w = 1:
Remove the constraints on the weights:
1 T
¸ T
T
arg min w Sw ¡ s w + w w
2
2
w
´ (S + ¸ I ) ¡ 1 s
Can show equivalent to local ridge regression:
KRR weights.
44
Weighted k-NN: Example 1
2
5
60
S= 6
40
0
0
5
0
0
3
0
07
7;
05
5
0
0
5
0
2 3
4
6 37
7
s= 6
4 25
1
KRI weights
wK RI = arg
0.6
KRR weights
1 T
¸
w Sw ¡ sT w + wT w
2
0;1 T w= 1 2
w¸
0.6
w1
0.5
0.4
wK RR = (S + ¸ I ) ¡ 1s
min
0.5
w1
0.4
w2
w2
0.25
w3
0.25
w3
0.1
0.1
0 -2
10
0
w4
10
0
¸¸
10
2
w4
-0.1
10
-2
10
¸¸
0
10
2
45
Weighted k-NN: Example 2
2
5
61
S= 6
41
1
1
5
4
2
3
1
27
7;
25
5
1
4
5
2
2 3
3
6 37
7
s= 6
4 35
3
KRI weights
wK RI = arg
0.4
KRR weights
1 T
¸
w Sw ¡ sT w + wT w
2
0;1 T w= 1 2
wK RR = (S + ¸ I ) ¡ 1s
min
w¸
0.45
w1
0.4
0.35
w1
0.35
0.3
w4
0.3
w4
0.25
0.25
0.2
0.2
0.15 -2
10
w2 , w3
0.15
10
0
¸¸
10
2
w2 , w3
0.1
10
-2
10
¸¸
0
10
2
46
Weighted k-NN: Example 3
2
5
61
S= 6
41
1
1
5
4
2
3
1
27
7;
25
5
1
4
5
2
2 3
2
6 47
7
s= 6
4 35
3
KRI weights
wK RI = arg
KRR weights
1 T
¸
w Sw ¡ sT w + wT w
2
0;1 T w= 1 2
w¸
0.7
0.6
wK RR = (S + ¸ I ) ¡ 1s
min
1
w2
0.8
0.5
0.6
0.4
0.4
0.25
0.25
w4
0
w1
-0.2
0.1
0 -2
10
w2
w4
w1
w3
-0.4
w3
10
0
¸¸
10
2
10
-2
10
¸¸
0
10
2
47
LOCAL
GLOBAL
Amazon47
Aural
Sonar
Caltech101
Face
Rec
Mirex
Voting
# samples
204
100
8677
945
3090
435
# classes
47
2
101
139
10
2
k-NN
16.95
17.00
41.55
4.23
61.21
5.80
affinity k-NN
15.00
15.00
39.20
4.23
61.15
5.86
KRI k-NN (clip)
17.68
14.00
30.13
4.15
61.20
5.29
KRR k-NN (pinv)
16.10
15.25
29.90
4.31
61.18
5.52
SVM-KNN (clip)
17.56
13.75
36.82
4.23
61.25
5.23
SVM sim-as-kernel
(clip)
81.24
13.00
33.49
4.18
57.83
4.89
SVM sim-as-feature
(linear)
76.10
14.25
38.18
4.29
55.54
5.40
SVM sim-as-feature
(RBF)
75.98
14.25
38.16
3.92
55.72
5.52
P-SVM
70.12
14.25
34.23
4.05
63.81
5.34
48
LOCAL
GLOBAL
Amazon47
Aural
Sonar
Caltech101
Face
Rec
Mirex
Voting
# samples
204
100
8677
945
3090
435
# classes
47
2
101
139
10
2
k-NN
16.95
17.00
41.55
4.23
61.21
5.80
affinity k-NN
15.00
15.00
39.20
4.23
61.15
5.86
KRI k-NN (clip)
17.68
14.00
30.13
4.15
61.20
5.29
KRR k-NN (pinv)
16.10
15.25
29.90
4.31
61.18
5.52
SVM-KNN (clip)
17.56
13.75
36.82
4.23
61.25
5.23
SVM sim-as-kernel
(clip)
81.24
13.00
33.49
4.18
57.83
4.89
SVM sim-as-feature
(linear)
76.10
14.25
38.18
4.29
55.54
5.40
SVM sim-as-feature
(RBF)
75.98
14.25
38.16
3.92
55.72
5.52
P-SVM
70.12
14.25
34.23
4.05
63.81
5.34
49
LOCAL
GLOBAL
Amazon47
Aural
Sonar
Caltech101
Face
Rec
Mirex
Voting
# samples
204
100
8677
945
3090
435
# classes
47
2
101
139
10
2
k-NN
16.95
17.00
41.55
4.23
61.21
5.80
affinity k-NN
15.00
15.00
39.20
4.23
61.15
5.86
KRI k-NN (clip)
17.68
14.00
30.13
4.15
61.20
5.29
KRR k-NN (pinv)
16.10
15.25
29.90
4.31
61.18
5.52
SVM-KNN (clip)
17.56
13.75
36.82
4.23
61.25
5.23
SVM sim-as-kernel
(clip)
81.24
13.00
33.49
4.18
57.83
4.89
SVM sim-as-feature
(linear)
76.10
14.25
38.18
4.29
55.54
5.40
SVM sim-as-feature
(RBF)
75.98
14.25
38.16
3.92
55.72
5.52
P-SVM
70.12
14.25
34.23
4.05
63.81
5.34
50
LOCAL
GLOBAL
Amazon47
Aural
Sonar
Caltech101
Face
Rec
Mirex
Voting
# samples
204
100
8677
945
3090
435
# classes
47
2
101
139
10
2
k-NN
16.95
17.00
41.55
4.23
61.21
5.80
affinity k-NN
15.00
15.00
39.20
4.23
61.15
5.86
KRI k-NN (clip)
17.68
14.00
30.13
4.15
61.20
5.29
KRR k-NN (pinv)
16.10
15.25
29.90
4.31
61.18
5.52
SVM-KNN (clip)
17.56
13.75
36.82
4.23
61.25
5.23
SVM sim-as-kernel
(clip)
81.24
13.00
33.49
4.18
57.83
4.89
SVM sim-as-feature
(linear)
76.10
14.25
38.18
4.29
55.54
5.40
SVM sim-as-feature
(RBF)
75.98
14.25
38.16
3.92
55.72
5.52
P-SVM
70.12
14.25
34.23
4.05
63.81
5.34
51
Approaches to Similarity-based Classification
Classify x given S, y, s, and Ã(x; x).
52
Generative Classifiers
Model t he probability of what you see given each class:
Linear discriminant analysis
Quadrat ic discriminant analysis
Gaussian mixt ure models...
Pro: Produces class probabilit ies
53
Generative Classifiers
Model t he probability of what you see given each class:
Linear discriminant analysis
Quadrat ic discriminant analysis
Gaussian mixt ure models...
Our Goal: Model P(T(s)jg)
class
descriptive statistics of s
We use: T(s) = [Ã(x; ¹ 1); Ã(x; ¹ 2); : : : ; Ã(x; ¹ G)]
¹ h is a centroid for each class
54
Similarity Discriminant Analysis
(Cazzanti and Gupta, ICML 2007, 2008, 2009)
Model P(T(s)jg)
Assume G similarit ies
class-conditionally independent
Est imat e P(Ã(x; ¹ h jg) as max-ent dist r.
given empirical mean. Result is exponent ial.
Reduce model bias by applying locally (local SDA)
Reduce est . variance by regularizing over localities
55
Similarity Discriminant Analysis
(Cazzanti and Gupta, ICML 2007, 2008, 2009)
Model P(T(s)jg)
Assume G similarit ies
class-conditionally independent
Reg. Local SDA
Performance:
Competitive
Est imat e P(Ã(x; ¹ h jg) as max-ent dist r.
given empirical mean. Result is exponent ial.
Reduce model bias by applying locally (local SDA)
Reduce est . variance by regularizing over localities
56
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
57
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
58
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
59
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
60
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
61
Lots of Open Questions
Making S PSD.
Fast k-NN search for similarities
Similarity-based regression
Relationship with learning on graphs
Try it out on real data
Fusion with Euclidean features (see our FUSION 2009 papers)
Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML
2008)
62
Code/Data/Papers:
idl.ee.washington.edu/similaritylearning
Similarity-based Classification by Chen et al., JMLR 2009
Training and Test Consistency
£
¤T
For a test sample x, given s = Ã(x; x 1) : : : Ã(x; x n ) , shall we classify x as
y^ = sgn((c?) T s + b?) ?
No! If a training sample was used as a test sample, could change its class!
64
Data Sets
Amazon
Aural Sonar
Protein
10
20
20
20
40
30
40
40
60
50
80
60
60
100
70
80
80
120
90
140
100
20
30
40
50
60
70
80
90
20
40
60
80
100
35
70
1
30
60
25
50
0.8
0.6
0.4
0.2
0
-0.2
Eigenvalue
Eigenvalue
1.2
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
10
20
15
10
5
10
20
30 40 50 60
Eigenvalue Rank
70
Eigenvalue Rank
80
90
-5
40
20
40
60
80
100
120
140
60
80
100
Eigenvalue Rank
120
140
40
30
20
10
0
0
0
20
-10
0
10
20
30
40 50 60 70
Eigenvalue Rank
Eigenvalue Rank
80
90
0
Eigenvalue Rank
65
Data Sets
Voting
Yeast-5-7
100
200
Yeast-5-12
50
50
100
100
150
150
300
400
200
200
300
200
400
50
100
150
200
50
120
120
200
100
100
150
100
50
0
-50
Eigenvalue
Eigenvalue
250
Eigenvalue
Eigenvalue
Eigenvalue
Eigenvalue
100
80
60
40
20
0
0
50
100
150 200 250 300
Eigenvalue Rank
Eigenvalue Rank
350
400
-20
100
150
200
80
60
40
20
0
0
20
40
60
80 100 120 140 160 180
Eigenvalue Rank
Eigenvalue Rank
-20
0
20
40
60
80 100 120 140 160 180
Eigenvalue Rank
Eigenvalue Rank
66
SVM Review
Empirical risk minimization (ERM) with regularization:
n
1X
minimize
L (f (x i ); yi ) + ´ kf k2K
f 2HK
n
i= 1
L
hinge loss
Hinge loss:
L(f (x); y) = max(1 ¡ yf (x); 0)
0-1 loss
2
1
1
0
1
2 yf ( x)
SVM Primal:
1 T
1 » + ´ cT K c
c;b;»
n
subject to diag(y)(K c + b1) ¸ 1 ¡ »; » ¸ 0:
minimize
67
Learning the Kernel Matrix
Find for classification the best K regularized toward S:
n
1X
min min
L (f (x i ); yi ) + ´ kf k2K + ° kK ¡ SkF
K º 0 f 2HK n
i= 1
SVM that learns the full kernel matrix:
1 T
1 » + ´ cT K c + ° kK ¡ SkF
c;b;»;K
n
subject t o diag(y)(K c + b1) ¸ 1 ¡ »;
minimize
» ¸ 0; K º 0:
68
Related Work
SVM Dual:
1 T
® diag(y)K diag(y)®
®
2
subject t o yT ® = 0; 0 · ® · C1:
maximize
1T ® ¡
Robust SVM (Luss & d’Aspremont, 2007):
µ
maximize
®
1
min 1T ® ¡ ®T diag(y)K diag(y)® + ½kK ¡ Sk2F
Kº 0
2
¶
subject t o yT ® = 0; 0 · ® · C1:
“This can be interpreted as a worst-case robust classification
problem with bounded uncertainty on the kernel matrix K.”
69
Related Work
Let
A = f ® 2 n j yT ® = 0; 0 · ® · C1g
Rewrite the robust SVM as
1 T
max min 1 ® ¡ ® diag(y)K diag(y)® + ½kK ¡ Sk2F
®2 A K º 0
2
T
Theorem (Sion, 1958)
Let M and N be convex spaces one of which is compact, and f(μ,ν) a function
on M N, which is quasiconcave in M, quasiconvex in N, upper semicontinuous in μ for each ν N, and lower semi-continuous in ν for each μ
M, then
sup¹ 2 M inf º 2N f (¹ ; º ) = inf º 2 N sup¹ 2 M f (¹ ; º ):
70
Related Work
Let
A = f ® 2 n j yT ® = 0; 0 · ® · C1g
Rewrite the robust SVM as
1 T
max min 1 ® ¡ ® diag(y)K diag(y)® + ½kK ¡ Sk2F
®2 A K º 0
2
T
By Sion’s minimax theorem, the robust SVM is equivalent to:
zero duality gap
1 T
min max 1 ® ¡ ® diag(y)K diag(y)® + ½kK ¡ Sk2F
K º 0 ®2 A
2
T
L(x; ¸ ?) or f (x)
Compare
n
1X
min min
L (f (x i ); yi ) + ´ kf k2K + ° kK ¡ SkF
K º 0 f 2HK n
i= 1
L(x ?; ¸ ) or g(¸ )
x
¸
71
Learning the Kernel Matrix
It is not trivial to directly solve:
1 T
1 » + ´ cT K c + ° kK ¡ SkF
c;b;»;K
n
subject t o diag(y)(K c + b1) ¸ 1 ¡ »;
minimize
» ¸ 0; K º 0:
Lemma (Generalized Schur Complement)
Let K 2 Rn£ n , z 2 Rn and u 2 R . Then
·
¸
K z
º 0
zT u
if and only if K º 0, z is in the range of K, and u ¡ zT K yz ¸ 0.
Let z = K c, and notice that cT K c = zT K yz since K K yK = K .
72
Learning the Kernel Matrix
It is not trivial to directly solve:
1 T
1 » + ´ cT K c + ° kK ¡ SkF
c;b;»;K
n
subject t o diag(y)(K c + b1) ¸ 1 ¡ »;
minimize
» ¸ 0; K º 0:
However, it can be expressed as a convex conic program:
1 T
1 »+ ´ u + °v
z;b;»;K ;u;v
n
subject t o diag(y)(z + b1) ¸ 1 ¡ »; » ¸ 0;
·
¸
K z
º 0; kK ¡ SkF · v:
zT u
minimize
– We can recover the optimal c? by c? = (K ?) y z?.
73
Learning the Spectrum Modification
Concerns about learning the full kernel matrix:
– Though the problem is convex, the number of variables is O(n2).
– The flexibility of the model may lead to overfitting.
74
Download