Slides (Lampert)

advertisement
Machine Learning II
Peter Gehler
TU Darmstadt
Feb 4, 2011
Acknowledgement
Slides from Christop H. Lampert
I.S.T. Austria, Vienna
Slides and Additional Material
http://www.christoph-lampert.org
also look for
Christoph H. Lampert
Kernel Methods in Computer Vision
Foundations and Trends in Computer Vision
and Computer Graphics, now publisher, 2009
Selecting and Combining Kernels
Selecting From Multiple Kernels
Typically, one has many different kernels to choose from:
different functional forms
I
linear, polynomial, RBF, . . .
different parameters
I
polynomial degree, Gaussian bandwidth, . . .
Different image features give rise to different kernels
Color histograms,
SIFT bag-of-words,
HOG,
Pyramid match,
Spatial pyramids, . . .
How to choose?
Ideally, based on the kernels’ performance on task at hand:
I
estimate by cross-validation or validation set error
Classically part of “Model Selection”.
Cross-Validation
Classical case: Split dataset D into N disjoint sets Dj
Random sub-sampling: Split dataset randomly into
train/validation set and repeat (with repetition)
Leave-One-Out: N = |D|
Train fi on ∪j6=i Dj
Test on Di
P
CV Error = 1/N i err(fi , Di )
stratified CV: split such that class distribution is same as in
entire set
Kernel Parameter Selection
Remark: Model Selection makes a difference!
Action Classification, KTH dataset
Method
Accuracy
(on test data)
80.66
85.19
Dollár et al. VS-PETS 2005: ”SVM classifier“
Nowozin et al., ICCV 2007: ”baseline RBF“
identical features, same kernel function
difference: Nowozin used cross-validation for model selection
(bandwidth and C )
Message: never rely on default parameters C !
Kernel Parameter Selection
Rule of thumb for kernel parameters
For generalize Gaussian kernels:
k(x, x 0 ) = exp(−
1 2
d (x, x 0 ))
2γ
with any distance d, set
γ ≈ mediani,j=1,...,n d(xi , xj ).
Many variants:
I
I
mean instead of median
only d(xi , xj ) with yi 6= yj ...
In general, if there are several classes, then the kernel matrix :
Kij = k(xi , xj )
should have a block structure w.r.t. the classes.
1.0
0
y
1.0
0.5
x
0.0
0.9
10
0.8
20
0.6
0.7
0.4
−1.0
−1.0
0.0
1.0
0.2
0
2.0
two moons
0.8
20
0.0
30
−0.8
40
−1.6
10
20
30
40
50
0.9
10
0.8
20
0.6
0.7
0.4
0.9
10
0.8
20
0.6
0.7
0.2
30
0.4
20
30
40
50
40
0.2
γ = 0.01
0
0.9
0.8
20
0.6
0.7
10
20
30
40
50
0.4
0.2
0
0.9
10
0.8
20
0.6
0.7
20
30
40
γ = 100
50
0.0
30
0.4
0.2
0.9
0.8
20
0.6
0.7
20
30
40
γ = 1000
0.6
0.7
0.5
30
0.4
0.3
40
0.2
0.1
0
10
50
0.0
20
30
40
50
30
0.4
0.9
10
0.8
20
0.6
0.7
0.5
30
0.4
0.3
40
0.2
0.3
40
0.2
0.1
0
10
20
30
40
50
0.1
0
0.0
10
20
30
40
50
0
0.9
10
0.8
20
0.6
0.7
1.0
0
0.9
10
0.8
20
0.6
0.7
0.5
30
0.4
0.5
30
0.4
0.3
40
0.2
0.3
40
0.2
0.1
10
20
30
40
0.0
γ = 10
1.0
0
0.0
1.0
0
0.5
0.1
10
20
−2.4
1.0
0.3
40
0
0.8
Gauss: γ = 0.001
0.5
0.1
10
10
γ=1
0.3
40
50
10
0.0
1.0
0.5
30
40
0
γ = 0.1
10
30
0.1
0
0.0
1.0
20
0.3
0.1
10
10
0.5
0.3
40
0
linear
1.0
0
0.5
30
0.0
label “kernel”
1.0
0
0
1.6
0.1
−2.0
0
10
0.3
40
0.9
2.4
0.5
30
−0.5
1.0
0
0
50
γ = 0.6
rule of thumb
0.0
0.1
0
10
20
30
40
γ = 1.6
5-fold CV
50
0.0
Kernel Selection ↔ Kernel Combination
Is really one of the kernels best?
Kernels are typcally designed to capture one aspect of the data
I
texture, color, edges, . . .
Choosing one kernel means to select exactly one such aspect.
Combining aspects if often better than Selecting.
Method
Colour
Shape
Texture
HOG
HSV
siftint
siftbdy
combination
Accuracy
60.9 ± 2.1
70.2 ± 1.3
63.7 ± 2.7
58.5 ± 4.5
61.3 ± 0.7
70.6 ± 1.6
59.4 ± 3.3
85.2 ± 1.5
Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
Combining Two Kernels
For two kernels k1 , k2 :
product k = k1 k2 is again a kernel
I
Problem: very small kernel values suppress large ones
average k = 12 (k1 + k2 ) is again a kernel
I
I
Problem: k1 , k2 on different scales. Re-scale first?
convex combination kβ = (1 − β)k1 + βk2 with β ∈ [0, 1]
Model selection: cross-validate over β ∈ {0, 0.1, . . . , 1}.
Combining Many Kernels
Multiple kernels: k1 ,. . . ,kK
all convex combinations are kernels:
k=
K
X
j=1
βj kj
with βj ≥ 0,
K
X
β = 1.
j=1
Kernels can be “deactivated” by βj = 0.
Combinatorial explosion forbids cross-validation over all
combinations of βj (testing two values per β is 2K )
Proxy: instead of CV, maximize SVM-objective.
Each combined kernel induces a feature space.
In which combined feature spaces can we best
I
I
explain the training data, and
achieve a large margin between the classes?
Feature Space View of Kernel Combination
Each kernel kj induces
a Hilbert Space Hj and a mapping ϕj : X → Hj .
β
The weighted kernel kj j := βj kj induces
the same Hilbert Space Hj , but
q
β
a rescaled feature mapping ϕj j (x) := βj ϕj (x).
β
q
β
q
k βj (x, x 0 ) ≡ hϕj j (x), ϕj j (x 0 )iH = h βj ϕj (x), βj ϕj (x 0 )iH
= βj hϕj (x), ϕj (x 0 )iH = βj k(x, x 0 ).
The linear combination k̂ := K
j=1 βj kj induces
c := ⊕K H , and
the product space H
j=1 j
the product mapping ϕ̂(x) := (ϕβ1 1 (x), . . . , ϕβnn (x))t
P
k̂(x, x 0 ) ≡ hϕ̂(x), ϕ̂(x 0 )iHb =
K
X
β
β
hϕj j (x), ϕj j (x 0 )iH =
j=1
K
X
j=1
βj k(x, x 0 )
Feature Space View of Kernel Combination
Implicit representation of a dataset using two kernels:
Kernel k1 , feature representation ϕ1 (x1 ), . . . , ϕ1 (xn ) ∈ H1
Kernel k2 , feature representation ϕ2 (x1 ), . . . , ϕ2 (xn ) ∈ H2
Kernel Selection would most likely pick k2 .
For k = (1 − β)k1 + βk2 , top is β = 0, bottom is β = 1.
Feature Space View of Kernel Combination
β = 0.00
5
√
margin = 0.0000
1.00φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.00φ1 (xi )
5
Feature Space View of Kernel Combination
β = 1.00
5
√
margin = 0.1000
0.00φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
1.00φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.99
5
√
margin = 0.2460
0.01φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.99φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.98
5
√
margin = 0.3278
0.02φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.98φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.97
5
√
margin = 0.3809
0.03φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.97φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.95
5
√
margin = 0.4515
0.05φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.95φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.90
5
√
margin = 0.5839
0.10φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.90φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.80
5
√
margin = 0.7194
0.20φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.80φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.70
5
√
margin = 0.7699
0.30φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.70φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.65
5
√
margin = 0.7770
0.35φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.65φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.60
5
√
margin = 0.7751
0.40φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.60φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.50
5
√
margin = 0.7566
0.50φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.50φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.40
5
√
margin = 0.7365
0.60φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.40φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.30
5
√
margin = 0.7073
0.70φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.30φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.20
5
√
margin = 0.6363
0.80φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.20φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.10
5
√
margin = 0.4928
0.90φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.10φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.03
5
√
margin = 0.2870
0.97φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.03φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.02
5
√
margin = 0.2363
0.98φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.02φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.01
5
√
margin = 0.1686
0.99φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.01φ1 (xi )
5
Feature Space View of Kernel Combination
β = 0.00
5
√
margin = 0.0000
1.00φ2 (xi )
H1 × H2
4
3
2
1
0
√
−1
−1
0
1
2
3
4
0.00φ1 (xi )
5
Multiple Kernel Learning
Determine the coefficients βj that realize the largest margin.
First, how does the margin depend on βj ?
Remember standard SVM (here without slack variables):
minkwk2H
w∈H
subject to
yi hw, xi iH ≥ 1
for i = 1, . . . n.
H and ϕ were induced by kernel k.
New samples are classified by f (x) = hw, xiH .
Multiple Kernel Learning
Insert
K
X
0
k(x, x ) =
βj kj (x, x 0 )
(1)
j=1
with
I
I
I
Hilbert space H = ⊕j √
Hj ,
√
feature map ϕ(x) = ( β1 ϕ1 (x), . . . , βK ϕK (x))t ,
weight vector w = (w1 , . . . , wK )t .
such that
kwk2H =
X
kwj k2Hj
(2)
j
hw, ϕ(xi )iH =
Xq
βj hwj , ϕj (xi )iHj
j
(3)
Multiple Kernel Learning
For fixed βj , the largest margin hyperplane is given by
min
wj ∈Hj
X
kwj k2Hj
j
subject to
yi
Xq
βj hwj , ϕj (xi )iHj ≥ 1
for i = 1, . . . n.
j
Renaming vj =
min
vj ∈Hj
q
X
j
βj wj (and defining
0
0
= 0):
1
kvj k2Hj
βj
subject to
yi
X
hvj , ϕj (xi )iHj ≥ 1
j
for i = 1, . . . n.
Multiple Kernel Learning
Therefore, best hyperplane for variable βj is given by:
min
X
vj ∈Hj
P
j
β =1
j j
βj ≥0
1
kvj k2Hj
βj
(4)
subject to
yi
X
hvj , ϕj (xi )iHj ≥ 1
for i = 1, . . . n.
(5)
j
This optimization problem is jointly-convex in vj and βj .
There is a unique global minimum, and we can find it efficiently!
Multiple Kernel Learning
Same for soft-margin with slack-variables:
min
vj ∈Hj
P
β =1
j j
βj ≥0
ξi ∈R+
X
j
X
1
kvj k2Hj + C
ξi
βj
i
(6)
subject to
yi
X
hvj , ϕj (xi )iHj ≥ 1 − ξi
for i = 1, . . . n.
(7)
j
This optimization problem is jointly-convex in vj and βj .
There is a unique global minimum, and we can find it efficiently!
Flower Classification: Dataset
17 types of flowers - 80 images per class
7 different precomputed kernels
Data from Nilsback&Zissermann CVPR06
Combining Good Kernels
Observation: if all kernels are reasonable, simple combination
methods work as well as difficult ones (and are much faster):
Single features
Method Accuracy Time
Colour
60.9 ± 2.1
3s
Shape
70.2 ± 1.3
4s
Texture 63.7 ± 2.7
3s
HOG
58.5 ± 4.5
4s
HSV
61.3 ± 0.7
3s
siftint
70.6 ± 1.6
4s
siftbdy 59.4 ± 3.3
5s
Combination methods
Method
Accuracy
Time
product
85.5 ± 1.2
2s
averaging
84.9 ± 1.9
10 s
CG-Boost
84.8 ± 2.2 1225 s
MKL (SILP)
85.2 ± 1.5
97 s
MKL (Simple) 85.2 ± 1.5
152 s
LP-β
85.5 ± 3.0
80 s
LP-B
85.4 ± 2.4
98 s
Mean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset
[Gehler, Nowozin: ICCV2009]
Message: Never use MKL without comparing to simpler baselines!
Combining Good and Bad kernels
Observation: if some kernels are helpful, but others are not, smart
techniques are better.
Performance with added noise features
90
85
accuracy
80
75
70
65
60
55
50
45
01
product
average
CG−Boost
MKL (silp or simple)
LP−β
LP−B
5
10
25
50
no. noise features added
Mean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset
[Gehler, Nowozin: ICCV2009]
MKL Toy Example 1
Support-vector regression to learn samples of f (t) = sin(ωt)
kx − x 0 k2
kj (x, x ) = exp
2σj2
0
!
with 2σj2 ∈ {0.005, 0.05, 0.5, 1, 10}.
Multiple-Kernel Learning correctly identifies the right bandwidth.
Software for Multiple Kernel Learning
Existing toolboxes allow Multiple-Kernel SVM training:
I
Shogun (C++ with bindings to Matlab, Python etc.) [?]
http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun
I
SimpleMKL (Matlab) [?]
http://asi.insa-rouen.fr/enseignants/˜arakotom/code/mklindex.html
I
SKMsmo (Matlab) [?]
http://www.di.ens.fr/˜fbach/
(older and slower than the others)
Typically, one only has to specify the set of kernels to select
from and the regularization parameter C .
Summary
Kernel Selection and Combination
Model selection is important to achive highest accuracy
Combining several kernels is often superior to selecting one
I
I
Simple techniques often work as well as difficult ones.
Always try single best, averaging, product first.
Learning Structured Outputs
From Arbitrary Inputs to Arbitrary Outputs
With kernels, we can handle “arbitrary” input spaces:
we only need a pairwise similarity measure for objects:
I
I
I
images, e.g.
gene sequences, e.g. string kernels
graphs, e.g. random walk kernels
We can learn mappings
f : X → {−1, +1}
or
f : X → R.
What about arbitrary output spaces?
We know: kernels correspond to feature maps: ϕ : X → H.
But: we cannot invert ϕ, there is no ϕ−1 : H → X .
Kernels do not readily help to construct
f :X →Y
with Y 6= R.
“True” Multiclass SVM
Multiclass Classification
When are we interested in f : X → Y? E.g. multi-class classification:
f : X → {ω1 , . . . , ωK }
Common solution: one-vs-rest training
I
For each class, train a separate fc : X → R.
F
F
I
Positive examples: {xi : yi = ωc }
Negative examples: {xi : yi 6= ωc } (i.e. the rest)
Final decision is f (x) = argmaxc∈C fc (x)
Problem: fc know nothing of each other.
E.g. scales of output could be vastly different:
I
f1 : X → [−2, 2], f2 : X → [−100, 100]
Task:
Learn a real multi-class SVM that knows about the argmax decision
Multiclass Classification
Express the K learning problems as one:
Assume kernel k : X × X → R with feature map ϕ : X → H.
Define class-dependent feature maps:
xi →
7 (ϕ(x), 0, 0, . . . , 0) =: ϕ1 (xi )
xi 7→ (0, ϕ(x), 0, . . . , 0) =: ϕ2 (xi )
..
.
xi 7→ (0, 0, . . . , 0, ϕ(x)) =: ϕK (xi )
if yi = c1 ,
if yi = c2 ,
if yi = cK ,
Combined weight vector w = (w1 , . . . , wK ).
Combined Hilbert spaces: Hmc := ⊕K
j=1 H
Equivalent: per-class weight-vector or per-class feature map:
hϕ(xi ), wj iH = hϕj (xi ), wiHmc
Multiclass Classification
We add all one-vs-rest SVM problems
min
wj ∈H
X
kwj k2H
j
subject to
hwj , ϕ(xi )iH ≥ 1
−hwj , ϕ(xi )iH ≥ 1
for i = 1, . . . n with yi = cj ,
for i = 1, . . . n with yi 6= cj ,
for all j = 1, . . . , K .
Uncoupled constraints: same solutions wj as for separate SVMs.
Same decision as before: classify new samples using
f (x) = argmax hwj , ϕ(x)iH
j=1,...,K
Multiclass Classification
Rewrite hϕ(xi ), wj iH = hϕj (xi ), wiHmc and
P
j
kwj k2H = kwk2Hmc :
min kwk2Hmc
w∈Hmc
subject to
hw, ϕj (xi )iHmc ≥ 1
−hw, ϕj (xi )iHmc ≥ 1
for i = 1, . . . n with yi = cj .
for i = 1, . . . n with yi 6= cj .
Solution w = (w1 , . . . , wK ) and classification rule
f (x) = argmax hw, ϕj (x)iHmc
j=1,...,K
are the same as before.
Multiclass Classification
Now, introduce coupling constraints for better decisions:
min kwk2H
w∈Hmc
subject to
hw, ϕj (xi )iH −hw, ϕk (xi )iH ≥ 1
with yi = cj , for all k 6= j
for i = 1, . . . n.
Before: correct class has margin of 1 compared to 0.
Now: correct class has margin of 1 compared to all other classes
Classification rule stay the same:
f (x) = argmax hw, ϕj (x)iHmc
j=1,...,K
Called Crammer-Singer Multiclass SVM
Joint Feature Map
We have defined one feature map ϕj per output class cj ∈ Y:
ϕj : X → Hmc
Instead, we can say we have defined one joint feature map Φ,
that depends on the sample x and on the class label y:
Φ : X × Y → Hmc
Φ( x, y ) := ϕj (x) for y = cj
Joint Feature Map Multiclass Classification
Multiclass SVM with joint feature map Φ : X × Y → Hmc :
min kwk2H
w∈Hmc
subject to
hw, Φ(xi , yi )iH −hw, Φ(xi , y)iH ≥ 1,
for all y 6= yi
for i = 1, . . . n.
Classify new samples using:
f (x) = argmax hw, Φ(x, y)iHmc
y∈Y
Φ(x, y) occurs only inside of scalar products: Kernelize!
Joint Kernel Multiclass Classification
Joint Kernel Function:
kjoint : (X × Y) × (X × Y) → R
Similarity between two (sample,label) pairs.
Example: multiclass kernel of Φ:
kmc ( (x, y) , (x 0 , y 0 ) ) = k(x, x 0 ) · δy=y 0
Check: kmc ( (x, y) , (x 0 , y 0 ) ) = hΦ(x, y), Φ(x 0 , y 0 )iHmc
Y can have more structure than just being a set {1, . . . , K }.
Example: multiclass kernel with class similarities
kjoint ( (x, y) , (x 0 , y 0 ) ) = k(x, x 0 ) · kclass (y, y 0 )
where kclass (y, y 0 ) measures similarity e.g. in a label hierarchy.
What would we like to predict?
Natural Language Processing:
I
I
Automatic Translation (output: sentences)
Sentence Parsing (output: parse trees)
Bioinformatics:
I
I
Secondary Structure Prediction (output: bipartite graphs)
Enzyme Function Prediction (output: path in a tree)
Robotics:
I
Planning (output: sequence of actions)
Computer Vision
I
I
I
Image Segmentation (output: segmentation mask)
Human Pose Estimation (output: positions of body parts)
Image Retrieval (output: ranking of images in database)
Computer Vision Example: Semantic Image Segmentation
7→
input: images
output: segmentation masks
input space X = {images} =
ˆ [0, 255]3·M ·N
output space Y = {segmentation masks} =
ˆ {0, 1}M ·N
(structured output) prediction function: f : X → Y
f (x) := argmin E(x, y)
y∈Y
energy function E(x, y) =
P
i
wi> ϕu (xi , yi ) +
P
i,j
Images: [M. Everingham et al. "The PASCAL Visual Object Classes (VOC) challenge", IJCV 2010]
wij> ϕp (yi , yj )
Computer Vision Example: Human Pose Estimation
input: image
body model
output: model fit
input space X = {images}
output space Y = {positions/angles of K body parts} =
ˆ R4K .
prediction function: f : X → Y
f (x) := argmin E(x, y)
y∈Y
energy E(x, y) =
P
i
wi> ϕfit (xi , yi ) +
P
i,j
wij> ϕpose (yi , yj )
Images: [Ferrari, Marin-Jimenez, Zisserman: "Progressive Search Space Reduction for Human Pose Estimation", CVPR 2008.]
Computer Vision Example: Point Matching
input: image pairs
output: mapping y : xi ↔ y(xi )
prediction function: f : X → Y
f (x) := argmax F (x, y)
y∈Y
scoring function F (x, y) =
P >
P
wi ϕsim (xi , y(xi )) + i,j wij> ϕdist (xi , xj , y(xi ), y(xj )) +
i
P
>
i,j,k wijk ϕangle (xi , xj , xk , y(xi ), y(xj ), y(xk ))
[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]
Computer Vision Example: Object Localization
output:
object position
(left, top
right, bottom)
input:
image
input space X = {images}
output space Y = R4 bounding box coordinates
prediction function: f : X → Y
f (x) := argmax F (x, y)
y∈Y
scoring function F (x, y) = w > ϕ(x, y) where ϕ(x, y) = h(x|y ) is
a feature vector for an image region, e.g. bag-of-visual-words.
[M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008]
Computer Vision Examples: Summary
Image Segmentation
y = argmin E(x, y)
E(x, y) =
y∈{0,1}N
X
wi> ϕ(xi , yi ) +
i
X
wij> ϕ(yi , yj )
i,j
Pose Estimation
y = argmin E(x, y)
E(x, y) =
y∈R4K
X
wi> ϕ(xi , yi ) +
i
X
wij> ϕ(yi , yj )
i,j
Point Matching
y = argmax F(x, y)
F(x, y) =
y∈Πn
X
i
wi> ϕ(xi , yi ) +
X
wij> ϕ(yi , yj ) +
i,j
Object Localization
y = argmax F (x, y)
y∈R4
F (x, y) = w > ϕ(x, y)
X
i,j,k
>
wijk
ϕ(yi , yj , yk )
Grand Unified View
Predict structured output by maximization
y = argmax F (x, y)
y∈Y
of a compatiblity function
F (x, y) = hw, ϕ(x, y)i
that is linear in a parameter vector w.
A generic structured prediction problem
X : arbitrary input domain
Y: structured output domain, decompose y = (y1 , . . . , yk )
Prediction function f : X → Y by
f (x) = argmax F (x, y)
y∈Y
Compatiblity function (or negative of "energy")
F (x, y) = hw, ϕ(x, y)i
=
k
X
wi> ϕi (yi , x)
unary terms
i=1
+
k
X
wij> ϕij (yi , yj , x)
binary terms
i,j=1
+
...
higher order terms (sometimes)
Example: Sequence Prediction – Handwriting Recognition
X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80
Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }.
feature functions
has only unary terms
ϕ(x, y) = ϕ1 (x, y1 ), . . . , ϕ5 (x, y5 ) .
F (x, y) = hw1 , ϕ1 (x, y1 )i + · · · + hw, ϕ5 (x, y5 )i
Q
V
E
S
T
Output
Input
Advantage: computing y ∗ = argmaxy F (x, y) is easy.
We can find each yi∗ independently, check 5 · 26 = 130 values.
Problem: only local information, we can’t correct errors.
Example: Sequence Prediction – Handwriting Recognition
X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80
Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }.
one global feature function




yth pos.
z }| {
ϕ(x, y) = (0, . . . , 0, Φ(x) , 0, . . . , 0) if y ∈ D dictionary,

(0, . . . , 0, 0 , 0, . . . , 0) otherwise.
QUEST
Output
Input
Advantage: access to global information, e.g. from dictionary D.
Problem: argmaxy hw, ϕ(x, y)i has to check 265 = 11881376 values.
We need separate training data for each word.
Example: Sequence Prediction – Handwriting Recognition
X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80
Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }.
feature function with unary and pairwise terms
ϕ(x, y) = ϕ1 (y1 , x), ϕ2 (y2 , x), . . . , ϕ5 (y5 , x),
ϕ1,2 (y1 , y2 ), . . . , ϕ4,5 (y4 , y5 )
Q
U
E
S
T
Output
Input
Compromise: computing y ∗ is still efficient (Viterbi best path)
Compromise: neighbor information allows correction of local errors.
During the last lectures we learned how to evaluate argmaxy F (x, y)
(for some of these models)
chain
tree
loop-free graphs: Viterbi decoding / dynamic programming
grid
arbitrary graph
loopy graphs: approximate inference (e.g. loopy BP)
Today: how to learn a good function F (x, y) from training data.
Parameter Learning in Structured Models
Given: parametric model (family): F (x, y) = hw, ϕ(x, y)i
Given: prediction method: f (x) = argmaxy∈Y F (x, y)
Not given: parameter vector w (high-dimensional)
Supervised Training:
Given: example pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y.
typical inputs with "the right" outputs for them.
{
,
,
,
Task: determine "good" w
,
}
Structured Output SVM
Two criteria for decision function f :
Correctness: Ensure f (xi ) = yi for i = 1, . . . , n.
Robustness: f should also work if xi are perturbed.
Translated to structured prediction f (x) = argmaxy∈Y hw, ϕ(x, y)i:
Ensure for i = 1, . . . , n,
argmax hw, ϕ(xi , y)i = yi ,
y∈Y
⇔ hw, ϕ(xi , yi )i > hw, ϕ(xi , y)i
Minimize kwk2 .
for all y ∈ Y \ {yi }.
Structured Output SVM
Optimization Problem:
min
d
w∈R ,ξ∈Rn+
n
1
CX
kwk2 +
ξi
2
n i=1
subject to, for i = 1, . . . , n,
hw, ϕ(xi , yi )i ≥ ∆(yi , y) + hw, ϕ(xi , y)i − ξi ,
for all y ∈ Y.
∆(yi , y) ≥ 0: Loss function ("predict y, correct would be yi ")
Optimization problem very similar to normal (soft-margin) SVM
I
I
I
quadratic in w, linear in ξ
constraints linear in w and ξ
convex
But there are n(|Y| − 1) constraints!
I
I
numeric optimization needs some tricks
computationally expensive
Example: A "True" Multiclass SVM

1
Y = {1, 2, . . . , K }, ∆(y, y 0 ) = 
for y 6= y 0
.
0 otherwise.
ϕ(x, y) = Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = K KΦ(x)
= Φ(x)ey>
with ey =y-th unit vector
Solve:
n
1
CX
2
ξi
min kwk +
w,ξ 2
n i=1
subject to, for i = 1, . . . , n,
hw, ϕ(x i , y i )i ≥ 1 + hw, ϕ(x i , y)i − ξ i
Classification: MAP
for all y ∈ Y \ {y i }.
f (x) = argmax hw, ϕ(x, y)i
y∈Y
Crammer-Singer Multiclass SVM
Example: Hierarchical Multiclass Classification
Loss function can reflect hierarchy:
cat
1
∆(y, y 0 ) := (distance in tree)
2
∆(cat, cat) = 0, ∆(cat, dog) = 1,
dog
car
bus
∆(cat, bus) = 2,
etc.
Solve:
n
CX
1
ξi
min kwk2 +
w,ξ 2
n i=1
subject to, for i = 1, . . . , n,
hw, ϕ(x i , y i )i ≥ ∆(y i , y) + hw, ϕ(x i , y)i − ξ i
for all y ∈ Y \ {y i }.
Example: Object Localization
output:
object position
(left, top
right, bottom)
input:
image
ϕ(x, y) = Φ(x|y ) feature vector of image inside box region
y∩y 0
∆(y, y 0 ) := area
(box overlap).
area y∪y 0
F (x, y) = hw, ϕ(x, y)i: quality score for region y in image x
hw, ϕ(x i , y i )i ≥ ∆(y i , y) + hw, ϕ(x i , y)i − ξ i
Interpretation:
I
I
I
correct location must have largest score of all regions
highly overlapping regions can have similar score
non-overlapping one must have clearly lower score
[M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008]
Example: Object Localization Results
Experiments on PASCAL VOC 2006 dataset:
Compare S-SVM with conventional training for sliding windows
Identical setup: same features, same image-kernel, etc.
Precision–recall curves for VOC 2006 bicycle, bus and cat.
Structured prediction training improved precision and recall.
Summary
Structured-Output SVM
Task: predict f : X → Y instead of f : X → R.
Key idea:
I
I
learn F : X × Y → R with F = hw, ϕ(x, y)i
predict via f (x) := argmaxy∈Y F (x, y).
Convex optimization problem, similar to SVM
I
but very many constraints: computationally expensive
Field of active research, many open questions
I
I
I
I
how to speed up training?
how to handle complicated ϕ ("higher order terms")
how to combine S-SVM with approximate inference?
...
Download