Efficient Deli , Zhenguo Li , Rongrong Ji

advertisement
Computer Vision and Image Understanding 120 (2014) 81–90
Contents lists available at ScienceDirect
Computer Vision and Image Understanding
journal homepage: www.elsevier.com/locate/cviu
Efficient semantic image segmentation with multi-class ranking prior q
Deli Pei a,b,c,d, Zhenguo Li e, Rongrong Ji f,⇑, Fuchun Sun b,c,d,⇑
a
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
c
State Key Laboratory of Intelligent Technology and Systems, Beijing 100084, China
d
Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China
e
Huawei Noah’s Ark Lab, Hong Kong, China
f
Department of Cognitive Science, Xiamen University, Xiamen 361005, China
b
a r t i c l e
i n f o
Article history:
Received 3 October 2012
Accepted 6 October 2013
Available online 25 October 2013
Keywords:
Computer vision
Machine learning
Semantic segmentation
Structural SVMs
a b s t r a c t
Semantic image segmentation is of fundamental importance in a wide variety of computer vision tasks,
such as scene understanding, robot navigation and image retrieval, which aims to simultaneously decompose an image into semantically consistent regions. Most of existing works addressed it as structured prediction problem by combining contextual information with low-level cues based on conditional random
fields (CRFs), which are often learned by heuristic search based on maximum likelihood estimation. In
this paper, we use maximum margin based structural support vector machine (S-SVM) model to combine
multiple levels of cues to attenuate the ambiguity of appearance similarity and propose a novel multiclass ranking based global constraint to confine the object classes to be considered when labeling regions
within an image. Compared with existing global cues, our method is more balanced between expressive
power for heterogeneous regions and the efficiency of searching exponential space of possible label combinations. We then introduce inter-class co-occurrence statistics as pairwise constraints and combine
them with the prediction from local and global cues based on S-SVMs framework. This enables the joint
inference of labeling within an image for better consistency. We evaluate our algorithm on two challenging datasets which are widely used for semantic segmentation evaluation: MSRC-21 dataset and Stanford
Background dataset and experimental results show that we obtain high competitive performance compared with state-of-the-art methods, despite that our model is much simpler and efficient.
Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction
Semantic segmentation is a fundamental but challenging problem in computer vision, which aims to assign each pixel in an image a pre-defined semantic label. It can be seen as an extension of
the traditional object detection which aims at detecting prominent
objects in the foreground of an image, with closed relation to some
other fundamental computer vision tasks such as image segmentation and image classification. Semantic segmentation has many
applications in practice, including scene understanding, robot navigation, and image retrieval.
Semantic image segmentation algorithms in early stage typically solve this problem from a pixel-wise labeling perspective
[1,2]. Although using pixels as labeling units is simple and straight-
q
This paper has been recommended for acceptance by Nicu Sebe.
⇑ Corresponding authors. Addresses: Department of Computer Science and
Technology, Tsinghua University, Beijing 100084, China (F. Sun). Department of
Cognitive Science, School of Information Science and Technology, Xiamen University, Xiamen 361005, China (R. Ji).
E-mail addresses: derrypei@gmail.com (D. Pei), li.zhenguo@huawei.com (Z. Li),
rrji@xmu.edu.cn (R. Ji), fcsun@mail.tsinghua.edu.cn (F. Sun).
1077-3142/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.cviu.2013.10.005
forward, pixel itself contains limited and ambiguous information
that cannot always be discriminative enough to determine its correct label. On the other hand, the proliferation of unsupervised image segmentation algorithms, such as mean shift [3], graph based
segmentation [4,38], quick shift [5], TurboPixel [6] and SLIC [7], enables higher order features representation of regions. Therefore,
more recently semantic segmentation approaches based on region-wise labeling [8–13] are also well investigated, which make
use of region-level features that are not only more informative
but also robust to noise, clutter, illuminate variance et al. In such
a setting, an initial unsupervised segmentation is commonly
adopted for pre-processing. However, image segmentation is still
far away from being perfect without regard to the extensive attempts in the last several decades. From this point of view, how
to make best use of these imperfect unsupervised image segmentation algorithms on the semantic segmentation problem is of fundamental importance yet is still unclear.
Although higher order features extracted from regions are more
expressive and informative than those from pixels, sematic ambiguity still exists because of the appearance similarity. A general
consent is that contextual information within an image is a very
useful cue to attenuate this ambiguity, which can be used to
82
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
suppress/encourage the presence of object classes during labeling.
Context refers to any information that is not extracted directly
from local appearance and can be summarized into two categories:
pairwise constraints and global cues. Pairwise constraints, such as
smoothness based on contrast [14,9], relative location [10,11] and
co-occurrence [8,11,15] are used to model the pairwise relationship between regions within an image. Global constraints are usually used to enforce higher level consistency of region sets or image
level. Some approaches are proposed to model these cues, such as
using image classification results [13], Potts potential [12], pN Potts
potential [16] and its improved versions robust PN potential [14],
pN-based hierarchical CRFs [17], and Harmony potential [9]. These
models will be further discussed in Section 2.
In terms of the methodology, most of the existing methods
[16,12,14,10,11,15,9] use conditional random fields (CRFs) to combine these constraints from different levels and make joint inference of labeling within an image, which is also known as
structured prediction. In contrast to many sophisticated algorithms
for inference, these models [10,11,9,14,15] are usually learned by
gradient descent or heuristic search on validation set based on
maximum likelihood estimation. On the other hand, Zhu et al.
[18] showed that the max-margin based learning algorithm is
more robust for structured prediction compared with the maximum likelihood estimation based learning algorithm in many machine learning applications.
In this paper, we use maximum margin based structural support vector machine (S-SVMs) model to combine multiple levels
of cues to attenuate the ambiguity of appearance similarity and
we propose multi-class ranking based global constraints to confine
the object classes to be considered when labeling regions within an
image.
For global cues, we first rank all the object classes for an image
(class with higher probability present in the image gets larger
score) using multi-class ranking algorithm [20] and transform
the ranking scores into image-level soft constraint to confine the
possible classes present in the image. The advantages of this global
cues can be seen from two aspects: on the one hand, compared
with robust pN potential [14] which limits their parent node to take
only one single label, our method ranks all the classes for an image
and thus is more representative to heterogeneous regions. On the
other hand, since we compute the ranking scores for all the classes
and transform them to soft constraint, we do not need to make
hard decision for every class and thus avoid searching exponential
space of possible label combination as harmony potential [9]. The
global cues are integrated with the prediction obtained from region
feature and logistic regression to encouraging more likely classes
while suppressing the others.
We then introduce inter-class co-occurrence statistics as pairwise constraints and combine them with the prediction from previous stage under S-SVMs framework. This enables the joint
inference of labeling within an image for better consistency. Moreover, our model can be can be efficiently learned with cutting plane
algorithm [19] instead of using heuristic search approach as in
CRFs learning. Experimental results show that we obtain high competitive performance with state-of-the-art methods with a much
simpler and efficient model on two challenging datasets: MSRS21 and Stanford Background Dataset.
Probably the most related work is [21], which discussed the
application of structural SVM in image semantic segmentation
and compared with alternative maximum likelihood method.
However, our model is different from their model in designing
pairwise and global constraints as well as loss function in parameter learning. The standard contrast-dependent Potts model was
used as pairwise constraint in contrast to our co-occurrence property. With regard to global constraints, they used very simple and
straightforward K image-level classification results and the
advantage of multiple classes ranking over 1-VS-All classifiers is
discussed in [20].
The remainder of the paper is organized as follows: In the next
section we review the related work. Our model is presented in Section 3, including the problem formulation and model details. Sections 4 and 5 describe the inference and learning methods.
Implementation details and performance evaluation are shown in
Section 6 while conclusions are drawn in Section 7.
2. Related work
Despite of the success in inferring pixel labels [1,2], more recent
methods tend to infer labels over regions or superpixels for the sake
of lower computational complexity and incorporating higher level
semantic cues. For these approaches, traditional image segmentation algorithms such as Normalized Cut [8], meanshift [14,17,13],
graph-based image segmentation [10], quick shift [9,22] are
adopted to get initial segments. More recently, several over-segmentation algorithms [6,7] are developed to bypass the problem
of tradition segmentation algorithms, such as the semantic ambiguity (regions span multiple object classes) and the difficulty to determine the optimal number of segment regions. These algorithms try
to seek the trade-off between reducing image complexity through
pixel-grouping and avoiding under-segmentation [6]. Images are
decomposed into much smaller regions than object size, e.g.
100–300 regions. Many traditional segmentation algorithms can
also be adopted to generate superpixels by setting a finer level
region segments. Qualitative results of different segmentation algorithms are given in Fig. 1, where each image is decomposed into
approximate 150 superpixels. It can be seen that over-segmentation algorithms tent to segment an image into regions with regions
with approximate size while the region size of traditional segmentation may vary a lot with the complexity of the content.
Although various powerful features have been proposed recently (e.g. color histogram, texture and SIFT, these feature are still
not informative enough to achieve high classification performance
because of the appearance similarity. To attenuate this ambiguity
of feature representation, some pairwise constraints, such as
smoothness [14,9,23], relative location [10,11] and co-occurrence
[8,11,15], are further introduced to attenuate the ambiguity of feature representation: (i) The assumption for pairwise smoothing
term is that adjacent regions tend to have same label, and subsequently spatially adjacent regions with different labels will be punished. To keep the boundary, appearance contrast is considered in
smoothing term, by which regions with larger appearance contrast
will be punished less for their inconsistent labels. However, the dilemma of this smoothing term is that regions with similar appearance will naturally tend to have same label. This is contradicted
with the objective of smoothing term that expecting spatially adjacent regions with variant appearance to have same label. (ii) The
co-occurrence statistics exploit the property that some classes
(e.g. boat, water) are more likely to present within an image than
others (e.g. car, water). Thus the existence of one class can be used
as the evidence of expecting the presence of some highly related
classes and suppress the presence of other unlikely classes. For instance, Rabinovich et al. [8,11] construct context matrices by
counting the co-occurrence frequency among object labels in the
training set to incorporate semantic contextual information. Ladicky et al. [15] claimed that the co-occurrence cost should depend
only on the labels present in an image, it should be invariant to the
number and location of pixels that object occupies. (iii) Gould et al.
[10] encoded the inter-class spatial relationship as a local feature
in a two-stage classification process. However, because of the 2D
projection, relative location in images is usually uninformative
and hence degenerates to co-occurrence constraint.
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
83
Fig. 1. Over-segmentation examples of different segmentation algorithms, with approximate 150 superpixels for each image.
Pairwise constraint can only capture local context information
between regions. A more recent trend is building a hierarchical
model by adding an extra global constraint to pairwise framework to incorporate constraints on higher level, such as the group
of segments or image level. Plath et al. [12] proposed a Potts potential to model the label consistency of regions in a hierarchical
tree structure, which punished all nodes that have inconsistent
labels with their parent label. Kohli et al. [14] adapted the pN
Potts potential proposed in [16] to a segment quality sensitive
higher order potential, named robust PN potential. The cost of
inconsistent labeling in high contrast region will be less compared to low contrast region. However, a drawback of both high
order potentials [12,14] is that they both limited their parent
node to take only one single label, which is often not the case
and makes it unable to handle heterogeneous regions. Csurka
and Perronnin [13] proposed to use image classification results
to reduce the number of classes to be considered in an image.
But this hard constraint schema did not take into account the
classification accuracy and classification errors could propagate
to following stages and affect the overall performance. The work
[17] proposed a novel hierarchical CRF framework which allowed
for integration of features computed at different levels to avoid
single choice of quantization. Gonfaus et al. [9] proposed more
expressive constraint named harmony potential, which restrict
the power set over all possible labels on image level first and then
use it as a higher order constraint. However, the exponential
sized power set makes the exact inference infeasible. And heuristic method such as branch-and-bound sampling has to be applied
to get an approximation of the best assignment, which results in
taking into account a small subset only.
Besides the context cues directly extracted from the image, priors from various vision tasks are also introduced to improve the
performance. Several approaches considered jointing the object
detection and multi-class image segmentation by feeding information from one task to the other [24–26,15,9]. Heitz et al. [24] developed Cascaded Classification Models (CCM) to combine the
subtasks of scene categorization, object detection, multiclass image segmentation for holistic scene understanding. However, since
these subtasks were only coupled by their input/output variables
in a loose style, each of them is still optimized separately and information sharing was limited and may cause inconsistent representation. Gould et al. [25] proposed a hierarchical region-based
approach that combined joint object detection with image segmentation to reason simultaneously about pixels, regions and ob-
jects in an image. Ladicky et al. [15] integrated the results from
sliding window detectors with low-level pixel-based unary and
pairwise relations into a conditional random field framework
(CRF) for joint reasoning about regions, objects and their attributes
and similar idea in [9].
3. Model
In this section we will specify our model for the structural prediction. First, we consider superpixels obtained by an unsupervised image segmentation, and use xi, i = 1, . . . , N to denote the
feature vector of superpixel i and yi 2 C = {c1, c2, . . . , cK} for its corresponding label where N and K are the numbers of superpixels
and classes, respectively. The whole image can then be represented as the collection of superpixel feature vectors, X = {xij i = 1, . . . , N}, and an assignment of labels to the set of
superpixels is referred to as a labeling of the image, denoted by
Y = {yij i = 1, . . . , N}. Our objective is to learn a function F(X, Y) that
is able to capture the compatibility of the prediction Y and the
observation X, such that the better the prediction Y describes
the image content X, the higher value F(X, Y) becomes. Thus, given
the observation X, the optimal prediction Y can be found by maximizing F(X, Y) over all possible labelings:
^ ¼ arg max FðX; YÞ:
Y
ð1Þ
Y
Following the structural SVMs [27], we assume the compatibility function F is linear in terms of a combined feature representation of inputs and outputs u(X, Y) (also known as joint feature
map):
FðX; YÞ ¼ hx; uðX; YÞi:
ð2Þ
The joint feature map u(X, Y) can be designed in order to capture multi-scale, multi-layer and contextual cues. Given the joint
feature map, the task in learning is to train an optimal model
parameter x using training set. The local constraint, also called
unary potential, captures the local appearance evidence for
labeling superpixels; the mid-level constraint usually exploits
pairwise relationship, such as smoothness, relative location and
co-occurrence, between superpixels. In some approaches, certain
global constraint is also applied to infer possible labeling from
image level rather than superpixels. We will specify how we
define these constraints and combine them together in the
following sections.
84
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
3.1. The unary potential
First we detail our feature representation for superpixels. The
raw features of a superpixel consist of two ingredients, appearance-based descriptors and bag-of-word (BoW) representation.
Following [10], the appearance-based descriptors include color
and texture features which compute mean, standard deviation,
skewness, and kurtosis statistics of the superpixel’s color distribution and filter responses. In addition, we also extract the location
and geometry features of the superpixel. For more details we refer
the reader to [10]. The BoW representation has been shown useful
in many state-of-the-art vision systems. Therefore we also incorporate it for superpixel representation. Moreover, as shown in
[22,28], BoW features extracted not only inside superpixels, but
also in their neighborhood can describe superpixels more effectively. Thus, for each superpixel wee extract BoW from both itself
and its adjacent regions and then concatenate them together. The
final representation of the raw features becomes:
superpixel casts a vote for its support to all the other superpixels’
class labels given its region size and the confidence of its initial
guess, which is defined as follows:
usi ;sj ðyi ; yj Þ ¼
Pðyi jsi ÞSi þ Pðyj jsj ÞSj
P
;
i Si
where P(yijsi) is the probability of superpixel si taking label yi defined in (4) and Si is the size of the superpixel i.
Thus, each superpixel i receives N 1 votes from all the other
superpixels for its label assignment yi:
V si ðyi Þ ¼
N
X
lyi ;yj usi ;sj ðyi ; yj Þ:
>
ð3Þ
sai
sbi
where
is the appearance descriptor,
is the concatenated BoW
feature, and ha, hb are the weight parameters to be learned by cross
validation.
Instead of using the above raw feature, we compute an intermediate representation from these raw features via logistic regression, which makes feature more compact. Given the raw feature
representation si of a superpixel, the probability of taking label
l 2 C = {c1, c2, . . . , cK} can be computed by the following logistic
regression model:
Pðljsi Þ ¼
8
expðbl0 þbTl si Þ
>
>
< 1þPcK1 expðbt0 þbT si Þ if l ¼ c1 ; . . . ; cK1 ;
t¼c1
t
1
>
>
: 1þPcK1 expðbt0 þbT si Þ if l ¼ cK ;
t¼c1
ð4Þ
t
where b is the learned parameter for the logistic regression. We
concatenate class probabilities to form the K-dimensional intermediate representation:
xi ¼ ðPðc1 jsi Þ; Pðc2 jsi Þ; . . . ; PðcK jsi ÞÞ> :
ð5Þ
Moreover, we assign the most probable label to the superpixel
as an initial label guess for further joint inference:
li ¼ arg max Pðljsi Þ:
ð6Þ
l2C
As a baseline, the performances of raw features under various
over-segmentation algorithms are systematically evaluated in
Section 6 and compared with those obtained from structured
prediction using contextual information.
The unary potential can be written as follows:
F unary ðX; YÞ ¼
X
xTyi xi ;
ð7Þ
i
where xc1 ; . . . ; xcK 2 RK are the model parameters for the unary
potential.
3.2. The pairwise potential
The unary potential part computes not only the intermediate
representation of superpixels, but also the initial labeling of each
superpixel based on local features. However the performance of
such labeling may not be satisfactory due to the ambiguity on
low level representation. To leverage the semantic context between superpixels and attenuate the ambiguity, we introduce a
voting strategy to exploit co-occurrence property of objects within
an image. Based on the initial label obtained in Section 3.1, each
ð9Þ
j¼1;j–i
We define the pairwise potential by aggregating votes of all
superpixels for their label assignments Y and then we have
F pair ðX; YÞ ¼
X
XX
V si ðyi Þ ¼
lyi ;yj usi ;sj ðyi ; yj Þ;
i
si ¼ ðha sai ; hb sbi Þ ;
ð8Þ
i
ð10Þ
j–i
where lci ;cj are K2 model parameters for pairwise potential, describing the preferences of co-occurrent class pair in the data.
3.3. The global constraint
When most of the superpixels are correctly labeled, the cooccurrence property is beneficial to rectify the minor superpixels
that are mislabeled. However, as the proportion of mislabeled
superpixels increases, it is more likely that the error is propagated
to other superpixels due to the voting scheme. To resolve this problem, global constraint on image level is further introduced to confine the possible classes present in an image. However, the existing
global consistency potentials are either too simple in expressive
power, which only allows regions have a single class label such
as Potts [12] and Robust PN-based [14], or too complicated which
has to search exponential space of likely combinations of labels
such as Harmony potential [9].
We propose a new efficient global constraint which has a better
trade-off between expressive power for to heterogeneous regions
and the efficiency of searching exponential space of possible label
combinations. With the help of multi-class ranking algorithm in
[20], we first rank all the object classes from image level and then
transform the ranking into soft constraint.
To obtain the multi-class ranking score, each image is represented by kernel descriptor [29] and its corresponding binary label
1
K
vector li ¼ fli ; . . . ; li g 2 f1; þ1gK , where K is the total number of
j
object classes, li ¼ þ1 denotes the presence of class j in image Ii
j
and li ¼ 1 denotes the absence. We aim to learn K classification
function ft(Ii), Rd ? R, t 2 C = {c1, c2, . . . , cK}, one for each class, such
that for any image I; fci ðIÞ scores higher than fcj ðIÞ when I is more
likely belonging to class ci than class cj.1
The ranking score indicates the confidence of assigning specific
label to a given image. Although it is informative, the result is still
very rough. Therefore, rather than setting a threshold to binary this
vector to obtain possible label set as in [13,30], we transform this
ranking score into soft constraint using a sigmoid function:
ht ðIÞ ¼
1
þ q;
1 þ a expðb f t ðIÞÞ
ð11Þ
where a, b, q are parameters to be learned from a validation set.
Now each image Ij can then be represented by a K dimensional
vector r j ¼ fhc1 ðIj Þ; . . . ; hcK ðIj Þg. Then this soft constraint is
integrated with the unary potential to impose image label prior
1
We use the code available at http://www.cse.msu.edu/bucakser/software.html.
85
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
to superpixels within image Ij. Thus the intermediate representation of superpixel defined in (5) can be revised as follows:
e
x i ¼ ðhc1 ðIj Þ Pðc1 j si Þ; hc2 ðIj Þ Pðc2 j si Þ; . . . ; hcK ðIj Þ PðcK jsi ÞÞ> :
ð12Þ
To illustrate the benefit of our proposed soft constraint strategy,
we compare it with two alternative global constraint strategies:
Top n labels hard constraint and Threshold-t constraint. The Top
n constraint selects the most probable n labels for each image
according to the rank score vector f computed in the above procedure, other labels are simply discarded. In Threshold-t strategy, instead of selecting a fixed number of labels for each image, we filter
out unlikely labels by setting a threshold to the ranking score vector f. We compare with these strategies to show the efficiency of
our proposed approach in Section 6.2.
The advantages of transforming multi-label ranking to global
constraints can be seen from two aspects: on one hand, the multi-label ranking score inferred from image level is more representative to heterogeneous regions by encouraging multiple labels,
compared with the robust pN model [14] which limits their parent
node to take only one single label. On the other hand, instead of
inferring possible label set of an image from exponential sized
power set of labels as in [22], which is intractable and can only
be solved by sampling strategy, we can directly compute the ranking scores of every label for an image and it can be integrated directly with prediction results obtained from local features and
logistic regression.
3.4. Overall compatibility function
Combining all the above development together, we propose the
compatibility function as follows:
FðX; YÞ ¼ F unary ðX; YÞ þ F pair ðX; YÞ
X
XX
¼
xTyi ex i þ
xyi ;yj usi ;sj ðyi ; yj Þ:
i
i
ð13Þ
j–i
The compatibility function combines local and global cues and
contextual information in a unified framework and makes joint
labeling inference, which efficiently attenuates the ambiguity of local appearance similarity and makes the labeling more consistent.
We systematically evaluate our model on two challenging datasets
for semantic segmentation and compare with state-of-the-art
methods in Section 6.
4. Inference
The inference process defined in (1) seeks the most compatible
labeling Y for a given observation X. Typically the process of maximizing this compatible function can be formulated as an integer
programming problem, which is NP-hard in general except some
special cases (e.g. K = 1) and consequently can only be solved
approximately. In this paper, we adopt a greedy search algorithm
in an iterative style because of its simplicity. First we rewrite the
compatibility function as follows:
FðX; YÞ ¼
¼
X
xTyi xei þ
XX
lyi ;yj usi ;sj ðyi ; yj Þ
i
i
i
j–i
j–i
X
X
fxTyi xei þ
lyi ;yj usi ;sj ðyi ; yj Þg
X
¼
gðyi jxi ; y1 ; . . . ; yi1 ; yiþ1 ; . . . ; yN Þ;
Thus the inference process is described as follows: (1) In each
iteration we randomly choose one superpixel and fix all the other
superpixels’ labels. (2) We compute the score function g of all K
possible classes for this superpixel. (3) If the label with largest
score is different with the previous label, then update the label.
(4) The iteration stops when no more label changes or reaches
max iterations.
Like most greedy search algorithm, the initialization is crucial to
the performance. For our case, we found that the local prediction
obtained from logistic regression serves as a natural and good start
prediction. We initialize our prediction with logistic regression results in Eq. (12) instead of random values. The pseudocode of the
above operations is given as follows:
Algorithm 1. inference algorithm
1: Input:
2: Image feature Ij, superpixels si, i = 1, . . . , N
3: Initialization:
^i ¼ maxyi 2C hyi ðIj Þ Pðyi jsi Þ
4: y
5: xei ¼ ðhc1 ðIj Þ Pðc1 jsi Þ; hc2 ðIj Þ Pðc2 jsi Þ; . . . ;
hcK ðIj Þ PðcK jsi ÞÞ
6: rebeat
7:
for all superpixel si do
^1 ; . . . ; y
^i1 ; y
^iþ1 ; . . . ; y
^N Þ
8:
yi ¼ maxyi 2C gðyi j xei ; y
^i then
9:
if yi –y
^i
10:
update yi ! y
11:
end if
12:
end for
13: until no label changes OR reaches max interactions
^i ji ¼ 1; . . . ; Ng
14: return Y ¼ fy
Because the inference is conducted directly on superpixels instead of pixels, the number of variables is significantly reduced,
typically from tens of thousands (e.g. an image of 400 300 pixels)
to several hundred(usually 100–300 superpixels per image). Therefore the inference algorithm converges very fast, typically less than
15 iterations.
5. Learning
In this section, we discuss how to learn the proposed model 14,
i.e., the model parameters x. To find the optimal solution x⁄, we
follow the idea in [27] for structured output prediction, and consider the following maximum-margin optimization problem:
min
1
CX
kxk2 þ
n
2
n i i
s:t: 8i ni P 0
b 2 X n Y; hx; duðX; Y; Y
b Þi P Dð Y
b ; YÞ ni ;
8Y
b Þ ¼ uðX; YÞ uðX; Y
b Þ; ni is a slack variable which
where duðX; Y; Y
becomes non-zero when the margin is violated, Y is the ground
truth label of the given image and X is the structured output space.
b ; YÞ is the loss function that quantifies how incorrect the predicDð Y
b is when Y is the correct output value.
tion Y
One intuitive form of the loss function is 0–1 loss on each
superpixel:
ð14Þ
i
where g() is the potential of superpixel xi being labeled yi while the
rests are y1, . . . , yi1, yi+1, . . . , yN.
ð15Þ
b ; YÞ ¼
Dð Y
N
X
^i ; yi ÞÞ;
ð1 dðy
ð16Þ
i
where d takes 1 when two values are identical and 0 otherwise.
86
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
However loss function defined in (16) penalizes incorrect superpixel labeling equally without taking into account the region size.
Thus the loss of a large mislabeled superpixel is equal to the loss of
a very small one. We then derive a more appropriate loss function
as follows:
b ; YÞ ¼
Dð Y
PN
i
gSi ð1 dðy^i ; yi ÞÞ
P
i Si
;
ð17Þ
where Si is the area of superpixel i and g is a weight factor to be
learned from cross validation.
Because the structured output space X to be sought grows
exponentially with the numbers of superpixels N and object classes
K, the number of constraints in (15) is also exponentially large
which makes it impossible to optimize directly. Current state-ofthe-art approaches typically use cutting plane algorithm proposed
by Joachims et al. [19] and their implementation SVM Struct package.2 For a better efficiency, we follow a variant implementation of
the cutting plan algorithm presented in [31]. The learning algorithm
aims at finding a small set of constraints that ensures a sufficiently
accurate solution. It starts with an unconstrained optimization problem as a relaxation of original problem and maintains a working set
Wi. In each iteration through the training process, the ‘‘most violated’’ constraint is selected and then added to the existing working
set if certain condition is satisfied. Once a constraint is added, we
optimize the problem again to get new solution. Iteration stops
when no constraint has changed or objective precision has reached.
6. Experimental results
the superpixels from different over-segmentation methods using
Logistic Regression. Particularly, to stress the role of over-segmentation, no pairwise or global contextual information is incorporated. We use MSRC-21 in this experiment. For the feature
representation of superpixels, we combine appearance-based and
bag-of-word descriptors (see Section 3.1).
The appearance-based descriptor has 238 features, consisting of
(1) color features computing the mean, standard deviation, skewness, and kurtosis statistics of RGB, Lab, and YCrCb color-space
channels and gray image (4 10 dimensions); (2) texture features
computing the same statistics of 48 filter responses (4 48 dimensions), including first and second derivatives of Gaussian and
Laplacian-of-Gaussian with various orientations and scales; (3)
shape features (3 dimensions); and (4) location features (3 dimensions). To build a bag-of-word (BoW) representation, we divide an
image into 16 16 pixel cells with 75% overlap. Each cell is captured by a 128-dimensional SIFT descriptor. The dictionary size is
400 visual words built with K-means clustering and these descriptors are then quantized using nearest neighbor. To represent a
superpixel, we concatenate the BoW representations of the superpixel and the region around it, giving a BoW feature vector of
length 2 400 = 800. The overall representation of each superpixel
is thus of 238 + 800 = 1038 dimensions.
A Logistic Regressor is trained on the training set obtained by
standard split of MSRC-21 dataset, where the cost parameter is
set to C = 25. For evaluation metric, we follow [17] to use the global
accuracy, which is the proportion of correctly labeled pixels to all
the pixels considered (excluding pixels with void label):
P
Nii
accuracy ¼ P i
;
i;j N ij
ð18Þ
In this section, we evaluate the proposed method on two benchmarking datasets, the MSRC-21 Dataset [32] and the Stanford
Background Dataset (SBD) [33], which are widely used for semantic image segmentation evaluation. MSRC-21 consists of 591
images in 21 classes: building, grass, tree, cow, sheep, sky, airplane,
water, face, car, bicycle, flower, sign, bird, book, chair, road, cat, dog,
body, and boat, where the ground truth is provided at pixel level.
A void label is included to avoid the membership ambiguity of pixels on object boundaries, which is typically ignored in training and
evaluation. Following [2], we divide MSRC-21 into 45% for training,
10% for validation, and 45% for test. SBD is mostly used for background understanding, where various foreground objects, including car, cow, book, boat, chair, person, et al., are merged into one
foreground class. It contains 715 images chosen from the following
public datasets: LabelMe [34], MSRC-21 [32], PASCAL VOC [35],
and Geometric Context [36]. Eight category labels were obtained
using Amazon’s Mechanical Turk (AMT), which include sky, tree,
road, grass, water, building, mountain, and foreground.
where Nij is the number of pixels of label i (ground-truth) being labeled as j.
The results are shown in Fig. 2(a), where different numbers of
superpixels are tested. We can see that the results associated with
FH, SLIC and TP are relatively robust when the number of superpixels is greater than 50, compared to that of MS. Overall, FH performs
slightly better than the rest, and is adopted later in our structured
prediction model.
Moreover, we computer the performance of different segmentations by assigning the dominant labels to superpixels as shown in
Fig. 2(b). It can be seen that the accuracies increase with the number of superpixels as expected. That’s because the greater the granularity, the better the segmentation coincides with the border. On
the other side, we can see that the performance of image semantic
segmentation (Fig. 2(a)) doesnt monotonically increase with the
number of superpixels.
6.1. Influence of over-segmentation
6.2. Influence of global constraints
Though over-segmentation is widely adopted as a key preprocessing in semantic segmentation, its impact on subsequent learning is rarely evaluated. In this section, we test the influence of four
popular over-segmentation techniques, including Mean Shift (MS)
[3], Felzenszwalb and Huttenlocher’s efficient graph-based segmentation (FH) [4], SLIC [5], and TurboPixel (TP) [6]. Note that different methods perform segmentation differently, as shown in
Fig. 1, where each image is segmented into about 150 superpixels,
MS and FH tend to generate larger superpixels in coherent regions
and smaller superpixels in complex regions, while SLIC and TurboPixel appear to produce grid-style balanced superpixels. The question is how such difference in over-segmentation can affect later
superpixel labeling. To this end, we consider the task of labeling
In this section, we evaluate the impact of multi-label ranking as
global constraints on semantic segmentation, by integrating the
global constraints into the experiment in the last section. We compare three ways of using label ranking, Top n constraint, Threshold
constraint, and our proposed soft constraint (see Section 3.3).
In order to capture complementary image properties, we propose to use different features for multi-label ranking, w.r.t. the local features for superpixels (see Section 3.1). We adopt kernel
descriptors [29] for holistic image representation, which construct
kernel descriptors from gradient, color, and local binary pattern
match kernels using kernel principal component analysis (KPCA).
Following the setting in [29], an image is divided into 16 16 pixel
patches with 50% overlap to extract low level features. We compute image-level features using efficient match kernels (EMK) on
1 1, 2 2, and 4 4 pyramid sub-regions, and perform
2
The code is available on http://svmlight.joachims.org/svm_struct.html.
87
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
(a)
SLIC
TurboP
MeanS
GraphB
73
72
71
70
50
100
150
200
250
300
95
Global Accuracy (%)
Global Accuracy (%)
74
(b)
94
93
92
SLIC
TurboP
MeanS
GraphB
91
90
50
100
Number of Superpixels
150
200
250
300
Number of Superpixels
Fig. 2. Influence of initial segmentation on semantic segmentation performance. (a) The performance of semantic segmentation by unary feature and linear classifier. (b) The
performance of semantic segmentation by assigning the dominant labels to superpixels.
constrained kernel singular value decomposition (CKSVD) with
1000 visual words learned by K-means. Overall, each image is represented by a 84,000-dimensional feature vector.
We adopt the efficient multi-class ranking algorithm [20] to
learn K classification functions ft(Ii): Rd ? R, t 2 C = {c1, c2, . . . , cK},
one for each class, with the goal that for any image I; f ci ðIÞ scores
higher than fcj ðIÞ when I is more likely to belong to class ci than to
class cj. We compare the kernel descriptor with the widely used
spatial pyramid matching (SPM) representation with similar settings. The results measured by ROC curve are shown in Fig. 3,
where the area under curve (AUC) of SPM is 90.3%, while the
AUC of the kernel descriptor increases to a higher 94.3%.
Now we are ready to report results after the integration of multi-labeling ranking. Recall that the Top n constraint considers only
the top n labels for each image according to the ranking scores,
while Threshold-t constraint retains those with scores greater than
t. In contrast, our method converts the ranking scores to soft constraint using the sigmoid function defined in Eq. (11) (here we set
a = 3, b = 3, q = 0.4). Either hard or soft constraint is combined
with Logistic Regression as in Eq. (12) and the labels of superpixels
can be inferred by Eq. (6). The results are shown in Table 1. We can
see that the proposed soft constraint method outperforms the
other two hard constraint alternatives under various parameters.
6.3. Results for MSRC-21
In this section, we report our structured prediction results on
MSRC-21. We also report the results obtained by combining local
unary features and Logistic Regression, with or without global constraints, where no pairwise co-occurrence information is incorporated. For comparison, we show the results of six state-of-the-art
methods, taken from [22,30,9,37,17]. The overall results are summarized in Table 2.
From Table 2, we can see that using local unary features and Logistic Regression yields a baseline of 73% pixel-wise global accuracy and 59% average per-class accuracy. Note that in this
baseline, the label of a region (i.e., a superpixel) is decided by its
appearance alone. By integrating the multi-label ranking results,
we improve the global accuracy by 6% and average accuracy by
9%. This shows that global cues can effectively guide the labeling
of local regions by substantially reducing potential classes to be
considered during labeling. This is because region labels will be
strengthened if they are consistent with the global ranking and
suppressed otherwise. By further refining the labeling with pairwise co-occurrence information using structural SVMs framework,
we achieve 84% global accuracy and 76% average accuracy, which
are highly competitive compared to the results reported in previous methods, although our model is much simpler and efficient
1
KDES
SPM
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
0.2
0.4
0.6
0.8
1
Fig. 3. The ROC curve of two feature representation in multi-class ranking. The area
under curve (AUC) of kernel descriptor is 94.3% while the AUC of spatial pyramid
matching is 90.3%.
Table 1
Comparison of different global constraint methods.
Local
Local
Local
Local
Local
Local
Feature
Feature + Top
Feature + Top
Feature + Top
Feature + Top
Feature + Top
Local
Local
Local
Local
Local
Feature + Threshold = 0
Feature + Threshold = 0.2
Feature + Threshold = 0.4
Feature + Threshold = 0.6
Feature + Threshold = 0.8
3
4
5
6
7
labels
labels
labels
labels
labels
Local Feature + Soft const.
Global
Average
72.8
72.6
76.1
77.5
76.8
76.3
58.6
59.8
64.5
65.0
64.9
63.1
75.9
76.7
77.8
78.0
74.3
63.0
64.5
66.0
66.7
60.8
79.1
67.7
in that we decouple the global constraint from pairwise potential
in joint inference and instead integrate it with the local prediction
from Logistic Regression (Section 3.1).
Considering the per-class accuracy, we obtained very good performance on classes such as grass, sky, flower, which can be inferred
easily from local appearance and their accuracies are above 95%.
For some difficult classes, such as bird and boat, the accuracies
are less than 40%, due to the similar appearance, various sizes,
and complex background.
Fig. 4 shows example results of our model. Consider the images
shown in Fig. 4(a), the labeling results obtained by applying Logistic Regression on local appearance features are shown in Fig. 4(b),
where the label of a region is decided by its appearance feature
88
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
Table 2
Quantitative results on the MSRC-21 data set. The computation of these scores follows the protocol defined in [17]. The best performance is highlighted in bold.
Building Grass Tree Cow Sheep Sky Aeroplane Water Face Car Bicycle Flower Sign Bird Book Chair Road Cat Dog Body Boat Global Average
Gould et al. [10]
Ladicky et al. [17]
Gonfaus et al. [9]
Munoz et al. [37]
Csurka and
Perronnin [30]
Lucchi et al. [21]
Boix et al. [22]
72
80
60
63
75
95
96
78
93
93
81
86
77
88
78
66
74
91
84
70
71
87
68
65
79
93
99
88
89
88
74
74
87
69
66
70
87
76
78
63
70
86
73
74
75
69
87
77
81
76
72
82
93
84
81
68
97
97
80
74
55
95
73
51
44
23
30
57
55
25
83
86
95
84
75
40
31
81
80
24
77
95
76
69
79
60
51
81
47
54
50
69
46
59
55
50
66
56
71
43
14
09
46
24
18
77
86
77
78
77
64
75
75
71
64
64
66
94
87
91
84
72
81
87
83
97 90
93 81
76
82
72
78
83 86
86 94
88
96
93
87
62
48
90
90
89
81
85
82
97 0
82 75
83
70
0
52
85
83
77
80
LR w/o global
LR w/global
Structural SVMs
66
76
70
94
98
98
83
87
87
50
68
76
52
66
79
93 68
90 77
96 81
70
70
75
64
73
86
51 82
60 84
74 88
73
77
96
54
57
72
25
32
36
69
79
90
40
58
79
82
86
87
39 18
65 41
74 60
42
57
54
19
20
35
73
79
84
59
68
76
Fig. 4. Example results on MSRC-21 data set by our model. (a) Original images. (b) Logistic regression prediction. (c) Logistic regression prediction with global constraint. (d)
Structured prediction results with multi-class labeling prior and contextual information. (e) Ground-truth labeling.
alone. Taking the first image for example, it can be seen that partial
regions of the bird were mislabeled as dog, cat, sheep or even road
because of the ambiguity of local appearance. Then the multi-label
ranking results give higher confidence to labels like grass, bird and
dog and suppress the presence of road, sheep and cat as in Fig. 4(c).
Finally, by introducing co-occurrence property, most of regions labeled as bird would surpress the presence of dog in an image because these two classes rarely present at the same time. Fig. 4(d)
shows the labeling results obtained by our final structured prediction and post-processing by grouping superpixels into a larger
group, and it can be seen that the final results are much more clean
and consistent.
The proposed method is very efficient. It takes about 800 s for
training structural SVMs on a training set of 335 samples in
MSRC-21, and takes about 1 s for labeling one test image. These
results are running in MATLAB 7.10.0(R2010a) 64 bit on a laptop
with 2.67 GHz i5 CPU and 8 GB RAM.
6.4. Results on Stanford Background dataset
In this section, we report our results on SBD. We follow [33] to
perform 5-fold cross-validation with the dataset randomly divided
into 572 training images and 143 test images for each fold. The
results are shown in Table 3. We can see that our structured
prediction model preforms favorably compared to other state-ofthe-art methods. We also observed that the incorporation of the
global label-ranking, although useful, did not improve the performance significantly. This probably can be explained from two
aspects: First, in SBD the foreground class includes a wide range
of object classes such as person, car, cow, sheep, bicycle, and their
Table 3
Quantitative results on Stanford Background dataset. The best performance is highlighted in bold.
Gould et al. [33]
Munoz et al. [37]
LR w/o global
LR w/global
Structural SVMs
Sky
Tree
Road
Grass
Water
Building
Mountain
Foreground
Global
Average
92.6
91.6
91.9
92.4
94.9
61.4
66.3
70.0
70.0
69.7
89.6
86.7
88.9
89.3
90.0
82.4
83.0
78.3
77.8
81.4
47.9
59.8
54.9
51.5
60.9
82.4
78.4
79.6
79.3
79.9
13.8
5.0
6.3
3.0
13.5
53.7
63.5
54.2
57.9
54.1
76.4
76.9
76.6
77.0
77.8
65.5
66.2
65.5
65.2
68.0
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
89
Fig. 5. Example results on Stanford Background dataset by our model. (a) Original images. (b) Logistic regression prediction. (c) Logistic regression prediction with global
constraint. (d) Structured prediction results with multi-class labeling prior and contextual information. (e) Ground-truth labeling. Note that objects such as cars, person, horse
are merged as one foreground class.
appearances vary drastically among classes, making it very difficult
to model the appearance; Second, the number of classes in SBD is
much less than MSRC-21 and therefore the multi-label ranking and
co-occurrence statistics may be less informative. Beside forground,
another challenging class is mountain, which has few instance in
the dataset, making it very hard to label correctly.
Some example results of our model are shown in Fig. 5, where
we can see that local labeling is not sufficient to address the
appearance ambiguity (Fig. 5(b)). In the presence of pairwise and
global cues, the labeling becomes more robust, as shown in
Fig. 4(d).
7. Conclusion
We have presented a new structured prediction model for
semantic segmentation. Traditional structured prediction
frameworks using pairwise constraints alone suffer degeneration
when a notable number of regions within an image are wrongly labeled in the early stage prediction by Logistic Regression, because
the wrong contextual information of mislabeled regions may
propagate to correct ones. Therefore it is necessary to confine possible labels from image-level. We utilized the multi-label ranking
score and converted it to soft global constraint, which encourage
the presence of some likely labels while suppress the presence of
unlikely labels. Compared with other existing global constraint
schemas, we decoupled the global constraint with pairwise constraint and integrated with unary potential directly, making it
much simpler while remain efficiency. The proposed model was
evaluated on two challenging datasets and experiments showed
that our model obtained highly competitive performance
compared with the state-of-the-art results.
In the future work, we plan to integrate multi-source cues such
as depth into structural SVMs framework. So far we only consider
extracting multi-scale cues from single source, that is the optic image. Features from multiple sources could contain complementary
information and be potentially useful for prompting performance.
Acknowledgments
This work was supported by the National Key Project for Basic
Research of China (2013CB329403), Nature Science Foundation of
China (No. 61373076), the Fundamental Research Funds for the
Central Universities (No.2013121026), and the 985 Project of
Xiamen University.
References
[1] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost: joint appearance,
shape and context modeling for multi-class object recognition and
segmentation, in: Proceedings of European Conference on Computer Vision
(ECCV), 2006, pp. 1–15.
[2] J. Shotton, M. Johnson, R. Cipolla, Semantic texton forests for image
categorization and segmentation, in: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
[3] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space
analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5)
(2002) 603–619.
[4] P. Felzenszwalb, D. Huttenlocher, Efficient graph-based image segmentation,
International Journal of Computer Vision 59 (2) (2004) 167–181.
[5] A. Vedaldi, S. Soatto, Quick shift and kernel methods for mode seeking, in:
Proceedings of European Conference on Computer Vision (ECCV), 2008, pp.
705–718.
[6] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, K. Siddiqi,
Turbopixels: fast superpixels using geometric flows, IEEE Transactions on
Pattern Analysis and Machine Intelligence 31 (12) (2009) 2290–2297.
[7] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic Superpixels,
Technical Report 149300 EPFL (June).
90
D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90
[8] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, S. Belongie, Objects in
context, in: Proceedings of the IEEE International Conference on Computer
Vision (ICCV), 2007, pp. 1–8.
[9] J. Gonfaus, X. Boix, J. Van De Weijer, A. Bagdanov, J. Serrat, J. Gonzalez,
Harmony potentials for joint classification and segmentation, in: Object
Categorization Using Co-Occurrence, Location and Appearance (CVPR), 2010,
pp. 3280–3287.
[10] S. Gould, J. Rodgers, D. Cohen, G. Elidan, D. Koller, Multi-class segmentation
with relative location prior, International Journal of Computer Vision 80 (3)
(2008) 300–316.
[11] C. Galleguillos, A. Rabinovich, S. Belongie, Object categorization using cooccurrence, location and appearance, in: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
[12] N. Plath, M. Toussaint, S. Nakajima, Multi-class image segmentation using
conditional random fields and global classification, in: Proceedings of
International Conference on Machine Learning (ICML), 2009, pp. 817–824.
[13] G. Csurka, F. Perronnin, A simple high performance approach to semantic
segmentation, in: Proceedings of British Machine Vision Conference (BMVC),
2008.
[14] P. Kohli, L. Ladickỳ, P. Torr, Robust higher order potentials for enforcing label
consistency, International Journal of Computer Vision 82 (3) (2009) 302–
324.
[15] L. Ladicky, C. Russell, P. Kohli, P. Torr, Graph cut based inference with cooccurrence statistics, in: Proceedings of European Conference on Computer
Vision (ECCV), 2010, pp. 239–253.
[16] P. Kohli, M. Kumar, P. Torr, P3 & beyond: solving energies with higher order
cliques, in: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2007, pp. 1–8.
[17] L. Ladicky, C. Russell, P. Kohli, P. Torr, Associative hierarchical CRFs for object
class image segmentation, in: Proceedings of the IEEE International Conference
on Computer Vision (ICCV), 2009, pp. 739–746.
[18] J. Zhu, E. Xing, B. Zhang, Laplace maximum margin markov networks, in:
Proceedings of International Conference on Machine Learning (ICML), 2008,
pp. 1256–1263.
[19] T. Joachims, T. Finley, C. Yu, Cutting-plane training of structural SVMs,
Machine Learning 77 (1) (2009) 27–59.
[20] S. Bucak, P. Kumar Mallapragada, R. Jin, A. Jain, Efficient multi-label ranking for
multi-class learning: application to object recognition, in: Proceedings of the
IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2098–
2105.
[21] A. Lucchi, Y. Li, X. Boix, K. Smith, P. Fua, Are spatial and global constraints really
necessary for segmentation? in: IEEE International Conference on Computer
Vision (ICCV), 2011, pp. 9–16.
[22] X. Boix, J. Gonfaus, J. van de Weijer, A. Bagdanov, J. Serrat, J. Gonzàlez, Harmony
potentials, International Journal of Computer Vision 96 (1) (2012) 83–102.
[23] S. Nowozin, P. Gehler, C. Lampert, On parameter learning in CRF-based
approaches to object class image segmentation, in: Proceedings of European
conference on Computer vision (ECCV), ECCV’10, 2010, pp. 98–111.
[24] G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models:
combining models for holistic scene understanding, in: Proceedings of Neural
Information Processing Systems (NIPS), 2008.
[25] S. Gould, T. Gao, D. Koller, Region-based segmentation and object detection, in:
Proceedings of Neural Information Processing Systems (NIPS), vol. 1, 2009.
[26] D. Hoiem, A. Efros, M. Hebert, Closing the loop in scene interpretation, in:
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2008, pp. 1–8.
[27] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine
learning for interdependent and structured output spaces, in: Proceedings of
International Conference on Machine Learning (ICML), 2004, pp. 104–111.
[28] B. Fulkerson, A. Vedaldi, S. Soatto, Class segmentation and object localization
with superpixel neighborhoods, in: Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2009, pp. 670–677.
[29] L. Bo, X. Ren, D. Fox, Kernel descriptors for visual recognition, in: Proceedings
of Neural Information Processing Systems (NIPS), 2010.
[30] G. Csurka, F. Perronnin, An efficient approach to semantic segmentation,
International Journal of Computer Vision 95 (2) (2011) 198–212.
[31] C. Desai, D. Ramanan, C. Fowlkes, Discriminative models for multi-class object
layout, in: Proceedings of the IEEE International Conference on Computer
Vision (ICCV), 2009, pp. 229–236.
[32] A. Criminisi, Micorsoft Research Cambridge Object Recognition Image Database.
<http://research.microsoft.com/en-us/projects/objectclassrecognition>.
[33] S. Gould, R. Fulton, D. Koller, Decomposing a scene into geometric and
semantically consistent regions, in: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1–8.
[34] B. Russell, A. Torralba, K. Murphy, W. Freeman, Labelme: a database and webbased tool for image annotation, International Journal of Computer Vision 77
(1–3) (2008) 157–173.
[35] M. Everingham, L. Van Gool, C. Willianms, J. Winn, A. Zisserman, The Pascal
Visual Object Classes Challenge 2007 (voc2007) Results (2007).
[36] D. Hoiem, A. Efros, M. Hebert, Recovering surface layout from an image,
International Journal of Computer Vision 75 (1) (2007) 151–172.
[37] D. Munoz, J. Bagnell, M. Hebert, Stacked hierarchical labeling, in: Proceedings
of European Conference on Computer Vision (ECCV), 2010, pp. 57–70.
[38] Z. Li, X.-M. Wu, S.-F. Chang, Segmentation using superpixels: A bipartite graph
partitioning approach, in: Proceedings of IEEE Conference on Computer Vision
and Pattern 826 Recognition (CVPR), 2012, pp. 789–796.
Download