Texture-Layout Filters

advertisement
1
Jamie Shotton
Machine Intelligence Laboratory, University of Cambridge
John Winn, Carsten Rother, Antonio Criminisi
Microsoft Research Cambridge, UK
Presenter: Kuang-Jui Hsu
Date :2011/7/04(Mon.)
• Introduction
• A Conditional Random Field Model
of object Classed
• Boosted Learning of Texture, Layout,
and Context
• Result and Comparisons
2
• Achieving automatic detection, recognition, and
segmentation of object classes in photographs
• Not only the accuracy of segmentation and
recognition, but also the efficiency of the algorithm
• At a local level, the appearance of an image patch
leads to ambiguities in its class label
• To overcome it, it is necessary to incorporate longer
range information
• To achieve, construct a discriminative model for
labeling images which exploits all three type of
information: texture appearance, layout, and context
3
• Overcome problems associated with object
recognition techniques that rely on sparse feature
• Authors’ technique based on dense feature is capable
of coping with textured and untextured and with
multiple objects which inter- or self-occlude
4
• Three contributions:
• Use a novel type of feature called the texture-layout filter
• A new discriminative model that combines texture-layout
filters with lower-level image features
• Demonstrate how to train this model efficiently on a very
large dataset by exploiting both boosting and piecewise
training methods
5
• Use a conditional random field (CRF) model to learn
the conditional distribution over the class labeling
given an image
• Incorporate texture, layout, color, location, and edge
cues
6
• Definition:
log 𝑃 𝒄 𝒙, 𝜽
texture-layout
=
color
location
πœ“π‘– 𝑐𝑖 , 𝒙; πœ½πœ“ + πœ‹ 𝑐𝑖 , π‘₯𝑖 , πœ½πœ‹ + πœ†(𝑐𝑖 , 𝑖, πœ½πœ† )
𝑖
edge
+
partition function
πœ™(𝑐𝑖 , 𝑐𝑗 , g𝑖𝑗 𝒙 ; πœ½πœ™ ) − log Z(𝜽, x)
(𝑖,𝑗)∈πœ€
c:class labels
x: an image
𝑖: the node 𝑖 in the graph
𝜽 = πœ½πœ“ , πœ½πœ‹ , πœ½πœ† , πœ½πœ™ : the parameters to learn
πœ€: the set of edges in a 4-connected grid structure
7
• Definition:πœ“π‘– 𝑐𝑖 , 𝒙; πœ½πœ“ = log 𝑃(𝑐𝑖 |𝒙, 𝑖)
𝑃(𝑐𝑖 |𝒙, 𝑖): the normalized distribution given by a
boosted classifier
• This classifier models the texture, layout, and texture
context of the object classed by combining novel
discriminative features called texture-layout filter
8
• Represented as Gaussian Mixture Models (GMMs) in
CIELab color space where the mixture coefficients
depend on label
• Conditional probability of the color x of a pixel
𝑃 π‘₯|𝑐 =
𝑃 π‘₯ π‘˜ 𝑃(π‘˜|𝑐)
π‘˜
with color clusters (mixture component)
𝑃 π‘₯|π‘˜ = 𝒩(π‘₯|πœ‡π‘˜ , Σπ‘˜ )
π‘˜: color cluster
πœ‡π‘˜ , Σπ‘˜ : the mean and variance respectively of color
cluster π‘˜
9
𝑃 π‘₯|𝑐 =
𝑃 π‘₯ π‘˜ 𝑃(π‘˜|𝑐)
π‘˜
The color models use Gaussian Mixture Models where
the mixture coefficients 𝑃(π‘˜|𝑐) are conditioned on the
class label c
10
• But, predict the class label c given the image color of
a pixel x and the color cluster k
• Use the simple inference method by Bayes rule
• For pixel π‘₯𝑖
𝑃(π‘˜, π‘₯𝑖 ) 𝑃 π‘₯𝑖 π‘˜ 𝑃(π‘˜)
𝑃(π‘˜)
𝑃 π‘˜ π‘₯𝑖 =
=
= 𝑃(π‘₯𝑖 |π‘˜)
𝑃(π‘₯𝑖 )
𝑃(π‘₯𝑖 )
𝑃(π‘₯𝑖 )
So, 𝑃 π‘˜ π‘₯𝑖 ∝ 𝑃(π‘₯𝑖 |π‘˜)
• Definition:
πœ‹ 𝑐𝑖 , π‘₯𝑖 , πœ½πœ‹ = log
πœ½πœ‹ 𝑐𝑖 , π‘˜ 𝑃(π‘˜|π‘₯𝑖 )
π‘˜
11
• Definition:
πœ‹ 𝑐𝑖 , π‘₯𝑖 , πœ½πœ‹ = log
πœ½πœ‹ 𝑐𝑖 , π‘˜ 𝑃(π‘˜|π‘₯𝑖 )
π‘˜
• Learned parameter πœ½πœ‹ 𝑐𝑖 , π‘˜ represent the
distribution 𝑃(𝑐𝑖 |π‘˜)
For discriminative inference, the arrows in the graphical
model are reversed using Bayes rule
12
• Definition:
πœ† 𝑐𝑖 , 𝑖, πœ½πœ† = log πœ½πœ† (𝑐𝑖 , 𝑖)
𝑖: the normalized version of the pixel index i ,where the
normalization allows for image of different sizes
• The πœ½πœ† is also learned
13
• Definition:
πœ™ 𝑐𝑖 , 𝑐𝑗 , g𝑖𝑗 𝒙 ; πœ½πœ™ = −πœ½π‘‡πœ™ g𝑖𝑗 𝒙 [𝑐𝑖 ≠ 𝑐𝑗 ]
• g𝑖𝑗 : the edge feature measures the difference in color
between the neighboring pixels
g𝑖𝑗 = exp(−𝛽 π‘₯𝑖 − π‘₯𝑗
1
2
π‘₯𝑖 , π‘₯𝑗 : three-dimensional vectors representing the colors
of the pixels 𝑖, 𝑗
𝛽= 2
π‘₯𝑖 − π‘₯𝑗
2
−1
14
• Given the CRF model and its learned parameters, find
the most probable labeling, 𝒄∗
• The optimal labeling is found by applying the alphaexpansion graph cut algorithm
15
• A current configuration (set of labels) c and fixed
label 𝛼 ∈ 1, … , 𝐢 , where 𝐢is the number of classes
• Each pixel 𝑖 makes a binary decision: it can either
keep its old label or switch to label 𝛼
• A binary vector 𝒔 ∈ 0,1 𝑝 which defines the
auxiliary configuration c[𝒔]as
𝑐𝑖 ,
𝑐𝑖 𝒔 =
𝛼 ,
𝑖𝑓 𝑠𝑖 = 0
𝑖𝑓 𝑠𝑖 = 1
• Start with an initial configuration 𝒄0 , given by the
mode of the texture-layout potentials
• Compute optimal alpha-expansion moves for label 𝛼
in some order, accepting the moves only they increase
the objective function
16
• There are two methods to learn the parameters:
• Maximum a-posteriori (MAP) – poor results
• Piecewise training
• Only πœ½πœ‹ , πœ½πœ† , πœ½πœ™ are learned by these methods
• πœ½πœ“ is learned during boosted learning
17
• Maximizes the conditional likelihood of the labels
given the train data,
𝐿 𝜽 =
log 𝑃(𝒄𝑛 |𝒙𝑛 , 𝜽) + log 𝑃(𝜽)
𝑛
𝒄𝑛 ,𝒙𝑛 : the training data of input and output
log 𝑃(𝜽): prevent overfitting
• The maximization of 𝐿 πœƒ with respect to 𝜽 can be
achieved using a gradient ascent algorithm
18
• Conjugate gradient ascent did eventually converge to
a solution, evaluating the learned parameter against
validation data gave poor results with almost
improvement
• The lack of alignment between object edges and label
boundaries in the roughly labeled training set forced
the learned parameters to tend toward zero
19
• Based on the piecewise training method of “Piecewise
Training of Undirected Models” [C. Sutton et al., 2005]
• The terms are trained independently, and the recombined
• The training method minimized an upper bound on the
log partition function:
Let 𝑧 𝜽, 𝒙 = log 𝑍(𝜽, 𝒙), and index the terms in the
model by r
𝑧 𝜽, 𝒙 ≤
π‘§π‘Ÿ (πœ½π‘Ÿ , 𝒙)
π‘Ÿ
πœ½π‘Ÿ : the parameters of the rth term
π‘§π‘Ÿ (πœ½π‘Ÿ ): the partition function for a model with the rth term
20
𝑧 𝜽, 𝒙 ≤
π‘§π‘Ÿ (πœ½π‘Ÿ , 𝒙)
π‘Ÿ
Proof:
Use the Jensen’s inequality:
π‘Žπ‘– π‘₯𝑖
π‘Žπ‘– πœ‘ π‘₯𝑖
πœ‘
≤
,
π‘Žπ‘—
π‘Žπ‘—
π‘Žπ‘– π‘₯𝑖
π‘Žπ‘– πœ‘ π‘₯𝑖
πœ‘
≥
,
π‘Žπ‘—
π‘Žπ‘—
𝑖𝑓 πœ‘ 𝑖𝑠 π‘Ÿπ‘’π‘Žπ‘™ π‘π‘œπ‘›π‘£π‘’π‘₯
𝑖𝑓 πœ‘ 𝑖𝑠 π‘Ÿπ‘’π‘Žπ‘™ π‘π‘œπ‘›π‘π‘Žπ‘£π‘’
π‘Žπ‘– : the positive weights
𝑧 𝜽, 𝒙 = log 𝑍(𝜽, 𝒙) is concave
21
• Replacing 𝑧 𝜽, 𝒙 with
bound
π‘Ÿ π‘§π‘Ÿ (πœ½π‘Ÿ , 𝒙)
gives a lower
• The bound can be loose, especially if the terms in the
model are correlated
• Performing piecewise parameter training leads to
over-counting during inference in the combined
model
• Because of over-counting, πœƒπœ“π‘›π‘’π‘€ = 2πœƒπœ“π‘œπ‘™π‘‘
• To avoid this, weight the logarithm of each duplicate
term by a factor of 0.5, or raise the term to the power
of 0.5
22
23
• Four types of parameter have to be learned
•
•
•
•
Texture-layout potential parameters
Color potential parameters
Location potential parameters
Edge potential parameters
• But the first parameters is learned during boosted
learning, and each others are learned by the piecewise
learing
24
• The color potentials are learned at test time for each
image independently
• First, the color clusters, 𝑃 π‘₯|π‘˜ = 𝒩(π‘₯|πœ‡π‘˜ , Σπ‘˜ ), are
learned in an unsupervised manner using K-means
• Then, an iterative algorithm, reminiscent of EM
alternates between inferring class labeling 𝒄∗ , and updating the color potential parameters as
πœƒπœ‹ 𝑐𝑖 , π‘˜ =
𝑖
𝑐𝑖∗
𝑐𝑖 =
𝑃(π‘˜|π‘₯𝑖 ) + π›Όπœ‹
𝑖 𝑃(π‘˜|π‘₯𝑖 ) + π›Όπœ‹
π“Œπœ‹
25
• Training these parameters by maximizing the
likelihood of the normalized model containing just
that potential and raising the result to a fixed power
π‘€πœ† to compensate for over-counting
𝑁𝑐,𝑖 + π›Όπœ†
πœƒπœ† 𝑐, 𝑖 =
𝑁𝑖 + π›Όπœ†
πœ”πœ†
𝑁𝑐,𝑖 : the number of pixels of class c at normalized
location 𝑖in the training set
𝑁𝑐,𝑖 : the total number of pixels at location 𝑖
26
• The value of the two contrast-related parameters were
manually selected to minimize the error on the
validation
27
• Based on a novel set of features called texture-layout
filter
• Capable of jointly capturing texture, spatial layout,
and textural context
28
1. The training images are convolved with a 17dimensional filter-band at scale πœ…
2. The 17D responses for all training pixels are whitened
3. An un supervised clustering is performed
4. Each pixel in each image is assigned to the nearest
cluster center, producing the texton map
» Denote the texton map as T where pixel i has value
𝑇𝑖 ∈ 1, … , 𝐾
29
• Each texture-layout filter is a pair π‘Ÿ, 𝑑 of an image
region, π‘Ÿ, and a texton𝑑
• π‘Ÿ: defined in coordinates relative to the pixel 𝑖 being
classified
• For simplicity, a set β„› of candidate rectangles are
chosen at random, such their top-left and bottom-right
corners lie within a fixed bounding box covering
about half the image area
30
Feature response:
𝑣 π‘Ÿ,𝑑 𝑖 =
1
π‘Žπ‘Ÿπ‘’π‘Ž(π‘Ÿ)
𝑗∈(π‘Ÿ+𝑖)
𝑇𝑗 = 𝑑 , i: location
31
efficiently computed
over a whole image
with integral images
[P.Viola et al.,2001]
Process:
1. Separated into K channels (one for each channel)
2. For each channel, a separated integral images is
calculated
(𝑖)
(𝑖)
(𝑖)
3. Feature response: 𝑣 π‘Ÿ,𝑑 𝑖 = π‘‡π‘Ÿ(𝑖)
−
𝑇
−
𝑇
+
𝑇
π‘Ÿπ‘π‘™
π‘Ÿπ‘‘π‘Ÿ
π‘Ÿπ‘π‘™
π‘π‘Ÿ
𝑇 (𝑑) : the integral image of T for texton channel t
32
• Some classes may have large within-class textural
differences, but repeatable layout of texture within a
particular object instance
• It uses the texton at pixel i being classified, 𝑇𝑖 , rather
than a particular learned texton
33
𝑃(𝒄|𝒙, 𝜽) ≈ 𝑃(𝒄π‘₯ |𝒙, 𝜽π‘₯ ) × π‘ƒ(𝒄𝑦 |𝒙, πœ½π‘¦ )
34
• Employ an adapted version of the Joint Boost
algorithm [A. Torralba et al.,2007]
• Iteratively selects discriminative texture-layout as
“weak learner”
• Combine them into a powerful classifier
𝑃 𝑐 𝒙, 𝑖 ,used by the texture-layout potentials
• Joint Boost shares each weak learner between a set of
classes C
35
• Strong classifier: 𝐻 𝑐, 𝑖 =
π‘š
𝑀
β„Ž
π‘š=1 𝑖 (𝑐)
• Use the multiclass logistic transformation
𝑃(𝑐|𝒙, 𝑖) ∝ exp 𝐻(𝑐, 𝑖) [J. Friedman et al.,2000]
• Each weak learner based on feature response 𝑣 π‘Ÿ,𝑑 𝑖
π‘Ž 𝑣 π‘Ÿ,𝑑 𝑖 > πœƒ + 𝑏,
𝑖𝑓 𝑐 ∈ 𝐢
β„Žπ‘– (𝑐) =
π‘˜π‘,
π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
with parameter π‘Ž, 𝑏, π‘˜ 𝑐
𝑐∉𝐢 , πœƒ, 𝐢, π‘Ÿ, 𝑑
36
• Each training example i (a pixel in a training image ) is
paired with a target value 𝑧𝑖𝑐 ∈ {−1, +1}, and assigned a
weight 𝑀𝑖𝑐 specifying its classification accuracy for class c
after m-1 rounds
• Round m choose a new weak learner by minimizing an
error function 𝐽𝑀𝑠𝑒 :
𝑀𝑖𝑐 (𝑧𝑖𝑐 − β„Žπ‘–π‘š (𝑐))2
𝐽𝑀𝑠𝑒 =
𝑐
𝑖
• Re-weighted:
𝑀𝑖𝑐
≔
𝑐 −𝑧𝑖𝑐 β„Žπ‘–π‘š (𝑐)
𝑀𝑖 𝑒
37
• Minimizing the error function 𝐽𝑀𝑠𝑒 requires an
expensive brute-force search over the possible weak
learner β„Žπ‘–π‘š (𝑐)
• Given the sharing set 𝑁, features (π‘Ÿ, 𝑑), and threshold
πœƒ, a closed form exist for π‘Ž, 𝑏, π‘Žπ‘›π‘‘ π‘˜ 𝑐 𝑐∉𝑁 :
𝑏=
𝑐∈𝑁
𝑐∈𝑁
π‘Ž+𝑏 =
𝑐 𝑐
𝑀
𝑖 𝑖 𝑧𝑖 𝑣(𝑖, π‘Ÿ, 𝑑) ≤ πœƒ
𝑐
𝑀
𝑖 𝑖 𝑣(𝑖, π‘Ÿ, 𝑑) ≤ πœƒ
𝑐∈𝑁
𝑐∈𝑁
𝑐 𝑐
𝑀
𝑖 𝑖 𝑧𝑖 𝑣(𝑖, π‘Ÿ, 𝑑) > πœƒ
𝑐
𝑀
𝑖 𝑖 𝑣(𝑖, π‘Ÿ, 𝑑) > πœƒ
𝑐 𝑐
𝑀
𝑖 𝑖 𝑧𝑖
𝑐
π‘˜ =
𝑐
𝑀
𝑖 𝑖
by minimizing 𝐽𝑀𝑠𝑒
38
• Employ the quadratic-cost greed algorithm to speed
up the search [A. Torralba et al.,2007]
• Optimization over πœƒ ∈ Θ can be made efficient by
careful use of histograms of weighted feature
responses:
• By treating Θ as an ordered set, histograms of
values𝑣 π‘Ÿ,𝑑 𝑖 , weighted appropriately by 𝑀𝑖𝑐 𝑧𝑖𝑐 and
𝑀𝑖𝑐 , are built over bin corresponding to the thresholds
in Θ
• These histogram are accumulated to give the
thresholded sums for the calculation of a and b for all
value of Θ at once
39
• Employ a random feature selection procedure to
speed up the minimization over features
• This algorithm examines only a randomly chosen
fraction πœ‰ β‰ͺ 1 of the possible features
[ S. Baluja et al.]
40
41
Adding more texture-layout filters improve classification
42
• The effect of different model potentials
(a): the original input image
(b): only using the texture-layout potentials
(c): with out color modeling
(d): full CRF model
43
• Texton-dependent layout filter
44
• MSRC 21-class database result
45
• Accuracy of segmentation for the MSRC 21-class
database
46
• Comparison with He et al.
47
• TV sequences
48
Download