Parsing Semantic Parts of Cars Using Graphical by

advertisement
arXiv:1406.2375v2 [cs.CV] 11 Jun 2014
CBMM Memo No. 018
June 13, 2014
Parsing Semantic Parts of Cars Using Graphical
Models and Segment Appearance Consistency
by
1
Wenhao Lu1 ,Xiaochen Lian2 ,Alan Yuille2
Tsinghua Univesity 2 University of California, Los Angeles
yourslewis@gmail.com lianxiaochen@ucla.edu yuille@stat.ucla.edu
Abstract: This paper addresses the problem of semantic part parsing (segmentation) of cars, i.e.assigning every
pixel within the car to one of the parts (e.g.body, window, lights, license plates and wheels). We formulate this as a
landmark identification problem, where a set of landmarks specifies the boundaries of the parts. A novel mixture of
graphical models is proposed, which dynamically couples the landmarks to a hierarchy of segments. When modeling
pairwise relation between landmarks, this coupling enables our model to exploit the local image contents in addition to
spatial deformation, an aspect that most existing graphical models ignore. In particular, our model enforces appearance
consistency between segments within the same part. Parsing the car, including finding the optimal coupling between
landmarks and segments in the hierarchy, is performed by dynamic programming. We evaluate our method on a subset
of PASCAL VOC 2010 car images and on the car subset of 3D Object Category dataset (CAR3D). We show good results
and, in particular, quantify the effectiveness of using the segment appearance consistency in terms of accuracy of part
localization and segmentation.
This work was supported by the Center for Brains, Minds and Machines
(CBMM), funded by NSF STC award CCF - 1231216.
window&
light&
body&
window&
window&
body&
light&
wheel&
light&
wheel&
lic.&plate&
light&
lic.&plate&
Figure 1: The goal of car parsing is to detect the locations of semantic parts and to perform object part segmentation. The inputs (left) are images of a car taken from different viewpoints. The outputs (right) are the locations of
the car parts – the wheels, lights, windows, license plates and bodies – so that each pixel within the car is assigned
to a part.
front / back
right front / left back I
right front / left back II
right side
Figure 2: The proposed mixture-of-trees model. Models of left-front, right-back and right views are not shown due
to the symmetry. The landmarks connected by the solid lines of same colors belong to the same semantic parts.
The black dashed lines show the links between different parts. Best view in color.
1
Introduction
This paper addresses the two goals of parsing an object into its semantic parts and performing object part segmentation, so that each pixel within the object is assigned to one of the parts (i.e. all pixels in the object are labeled).
More specifically, we attempt to parse cars into wheels, lights, windows, license plates and body, as illustrated
in Figure 1. This is a fine-scale task, which differs from the classic task of detecting an object by estimating a
bounding box.
We formulate the problem as landmark identification. We first select representative locations on the boundaries
of the parts to serve as landmarks. They are selected so that locating them yields the silhouette of the parts,
and hence enables us to do object part segmentation. We use a mixture of graphical models to deal with different
viewpoints so that we can take into account how the visibility and appearance of parts alter with viewpoint (see
Figure 2).
A novel aspect of our graphical model is that we couple the landmarks with the segmentation of the image
to exploit the image contents when modeling the pairwise relation between neighboring landmarks. In the ideal
case where part boundaries of car are all preserved by the segmentation, we can assume that the landmarks lie
near the boundaries between different segments. Each landmark is then associated to the appearance of its two
closest segments. This enables us to associate appearance information to the landmarks and to introduce pairwise
coupling terms which enforce that the appearance is similar within parts and different between parts. We call this
segmentation appearance consistency (SAC) between segments of neighboring landmarks. This is illustrated in
Figure 3, where Both of the two neighboring landmarks (the red and green squares) on the boundary between the
window and the body have two segments (belonging to window and body respectively) close to them. Segments
from the same part tend to have homogeneous color and texture appearance (e.g.a and c, b and d in the figure),
while segments from different parts usually do not (e.g.a and b, c and d in the figure). The four blue dashed lines
in the figure correspond to the SAC terms whose strengths will be learnt.
However, in practice, it is always impossible to capture all part boundaries using single level segmentation.
Instead, people try to use a pool of segmentations [11, 3, 14] or segmentation trees [1, 15, 23]. Inspired by those,
we couple the landmarks to a hierarchical segmentation of the image. However, the difference of the sizes of the
parts (e.g.the license plate is much smaller than the body) and the variability of the images mean that the optimal
segmentation level for each part also varies. Therefore the level of the hierarchy used in this coupling must be
chosen dynamically during inference/parsing. This leads us to treat the level of the hierarchy for each part as a
hidden variable. By doing this, our model is able to automatically select the most suitable segmentation level for
each part while parsing the image.
e
a
c
Segment pairs
: (a, b)
: (a, b)
b
d
: (a, e)
: (c, d)
Figure 3: Illustration of segmentation appearance consistency (SAC) and segment pairs. Red and green squares
represent two neighboring landmarks lying on the boundary between window and body. Each landmark has two
segments (a and b for the red landmark, c d for the green landmark) close to it. Our method models and learns
the SACs for every pair of neighboring landmarks (blue dashed lines) and uses them to enhance the reliability
of landmark localization. For the blue landmark, its segment pair is the same as the red landmark, which is the
closest one on the boundary.
2
Related Work
There is an extensive literature dating back to Fischler and Ershlager [10] which represents objects using graphical
models. Nodes of the graphs typically represent distinctive regions or landmark points. These models are typically
used for detecting objects [9, 8] but they can also be used for parsing objects by using the positions of the nodes to
specify the locations of different parts of the object. For example, Zhu et al [25, 26] uses an compositional AND/OR
graph to parse baseball players and horses. More recently, in Zhu and Ramanan’s graphical model for faces [27]
there are nodes which correspond to the eyes and mouth of the face. But we note that these types of models
typically only output a parse of the object and are not designed to perform object part segmentation. They do not
exploit the SAC either.
Recently, a very similar graphical model for cars has been proposed by Hejrati and Ramanan [13], which cannot
do part segmentation since each part is represented by only one node. The more significant difference is that the
binary terms do not consider the local image contents.
There are, however, some recent graphical models that can perform object part segmentation. Bo and Fowlkes [2]
use a compositional model to parse pedestrians, where the semantic parts of pedestrians are composed of segments
generated by the UCM algorithm [1] (they select high scoring segments to form semantic parts and use heuristic
rules for pruning the space of parses). Thomas et al [21] use Implicit Shape Models to determine the semantic part
label of every pixel. Eslami and Williams [5] extend the Shape Bolzmann Machine to model semantic parts and
enable object part segmentation. [21] and [5] did car part segmentation on ETHZ car dataset [21], which contains
non-occluded cars of a single view (semi-profile view).
Image labeling is a related problem since it requires assigning labels to pixels, such as [20, 17, 16, 4, 7, 22]. But
these methods are applied to labeling all the pixels of an image, and are not intended to detect the position of
objects or perform object part segmentation.
3
The Method for Parsing Cars
We represent the car and its semantic parts by a mixture of tree-structured graphical models, one model for each
viewpoint. The model is represented by G = (V, E). The nodes V correspond to landmark points. They are divided
SN
into subsets V = p=1 Vp , where N is the number of parts and Vp consists of landmarks lying at the boundaries of
semantic part p. The edge structures E are manually designed (see Figure 2).
We define an energy function for each graphical model, which consists of unary terms at the landmarks and
binary terms at the edges. The binary terms not only model the spatial deformations as in[27, 8], but also utilize
local image contents, i.e.the segment appearance consistency (SAC) between neighboring landmarks.
To do that, we couple the landmarks to a hierarchical segmentation of the image which is obtained by the SWA
algorithm [19] (see Figure 4 for a typical SWA segmentation hierarchy). Then we associate with each image location
at every segmentation level a pair of nearby segments: If a location is on the segment boundary, the two segments
Level 1 (lowest)
Level 2
Level 3
Level 4
Level 5
Level 6
Figure 4: The segments output by SWA at six levels. Note how the segments covering the semantic parts change
from level 1 to level 6 (e.g.left windows and left wheels). This illustrates that different parts need different levels of
segmentation. For example, the best level for the left-back wheel is level 4 and the best level for the left windows
is level 5. Best view in color.
are on either sides of the boundary, otherwise it shares the same segment pairs with the nearest boundary location.
Then SAC terms are used to model the four pairings of segments from neighboring landmarks (blue dashed lines
in Figure 3 for example). The strengths of SAC terms are learnt from data. In order to do the learning, the four
pairing need to be ordered, or equivalently, the two segments of each location need to be represented in the form of
an ordered tuple (s1 , s2 ). In practice, choosing two segments for a segment boundary location and ordering them
is not straightforward (e.g.a location on T-junction where there are more than two segments nearby). We put
technical details about segment pairs in Section 3.3.
3.1
Score Function
In this section we describe the score function for each graphical model, which is the sum of unary potentials
defined at the graph nodes, representing the landmarks, and binary potentials defined over the edges connecting
neighboring landmarks.
We first define the variables of the graph. Each node has pixel position of landmark li = (xi , yi ). The set of
|V|
all positions is denoted by L = {li }i=1 . We denote by pi the indicator specifying which part landmark i belongs
to, and by h(p) the segmentation level of part p. Then the segment pair of node i, si , can be seen as the function
of h(pi ), which we denote by si,h for simplicity. Similar to the definitions of L, we have H = {h(pi )}N
i=1 and
|V|
S(H) = {si,h }i=1 . The score function of the model for viewpoint v is
S(L, H, v | I) = φ(L, H, v | I) + ψ(L, H, v | I) + βv
In the following we omit v for simplicity. The unary terms φ(L, H | I) is expressed as:
i
Xh f
φ(L, H | I) =
wi · f (li | I) + wie e(h(pi ), li | I)
(1)
(2)
i∈V
The first term in the bracket of Equation 2 measures the appearance evidence for landmark i at location li . We
write f (li | I) for the HOG feature vector (see Section 3.3 for detail) extracted from li in image I. In the second
term, the term e(h(pi ), li | I) is equal to one minus the distance between li and the closest segment boundary at
segmentation level h(pi ). This function penalizes landmarks being far from edges. The unary terms encourage
locations with distinctive local appearances and with segment boundaries nearby to be identified as landmarks.
The binary term ψ(L, H | I) is:
X
X
d
A
ψ(L, H | I) =
wi,j
· d(li , lj ) +
wi,j
· A(si,h , sj,h | I)
(3)
(i,j)∈E
(i,j)∈E
pi =pj
d(li , lj ) = (−|xi − xj − x̄ij |, −|yi − yj − ȳij |) measures the deformation cost for connected pairs of landmarks, where
x̄ij and ȳij are the anchor (mean) displacement of landmark i and j. We adopt L1 norm to enhance our model’s
robustness to deformation. In the second term of Equation 3, A(si,h , sj,h | I) = (α(s1i,h , s1j,h | I), α(s1i,h , s2j,h |
I), α(s2i,h , s1j,h | I), α(s2i,h , s2j,h | I)) is a vector storing the pairwise similarity between segments of nodes i and j.
A
This, together with the strength term wij
, models the SAC. The computation of α(si,h , sj,h | I) is given in Section
3.3. Finally, β is a mixture-specific scalar bias.
d
A
The parameters of the score function are W = {wif } ∪ {wie } ∪ {wij
} ∪ {wij
} ∪ {β}. Note that the score function
is linear in W, therefore similar to [8] we can express the model more simply by
S(L, H | I) = w · Φ(L, H | I)
where w is formed by concatenating the parameters W into a vector.
(4)
Figure 5: The landmark annotations for typical images. Yellow dots are the chosen landmark locations. Please
refer to Section 3.3 for landmark selection criteria.
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
p#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
s1#
s2#
m(p)#
Figure 6: Illustration of segment pair assignment. Right: The look-up table for segment pair assignment, which is
divided into two parts (separated by the dashed line). White represents 1 and black represents 0. Left: an example
of how to construct the binary matrix m(p) for location p and how to determine its segment pair. The hit of m(p)
in the look-up table is marked by the red rectangle. Best view in color.
3.2
Inference and Learning
Inference. The viewpoint v, the positions of the landmarks L and the segmentation levels H are unobserved. Our
model detects the landmarks and searches for the optimal viewpoint and segmentation levels of parts simultaneously,
as expressed by the following equation,
S(I) = max[max S(L, H, v | I)]
v
H,L
(5)
The outer maximizing is done by enumerating all mixtures. Within each mixture, we apply dynamic programming
to estimate the segmentation levels and landmark positions of parts. Then the silhouette of each part can be
directly inferred from its landmarks. In our experiment, it took a half to one minute to do the inference on an
image about 300-pixel height.
Learning. We learn the model parameters by training our method for car detection (this is simpler than training
it for part segmentation). We use a set of image windows as training data, where windows containing cars are
labeled as positive examples and windows not containing cars are negative examples. A loss function is specified
as:
X
1
J (w) = kwk2 + C
max(0, 1 − ti · max w · Φ(Li , Hi | Ii ))
(6)
Li ,Hi
2
i
where ti ∈ {1, −1} is the class label of the object in the training image and C is a constant. Let’s take a closer
look at the inner maximization. The segmentation levels of the semantic parts H are hidden and need to be
estimated. The landmarks for the training images are not perfectly annotated (e.g.they are not exactly on segment
boundaries). To reduce the effect of such imprecision during the learning, we allow landmark locations to change
within a small range (i.e.the locations of landmarks become hidden variables), as long as shifted HOG boxes cover
at least 60% of the true HOG boxes. The CCCP algorithm [24] is used to estimate the parameters by minimizing
the loss function through alternating inference and optimization.
3.3
Implementation Details
Landmarks. The landmarks are specified manually for each viewpoint. They are required to lie on the boundaries
between the car and background (contour landmarks) or between parts (inner landmarks), so that the silhouettes
of parts and the car itself can be identified from landmarks. For front/back view, we use 69 landmarks; for left and
right side views, we use 74 landmarks; for the other views, we use 88 landmarks. The assignment of landmarks to
body
windows
lights
wheels
lic. plates
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0.1
0.2
0.3
0.4
0
0
0.1
0.2
0.3
0.4
0
0
0.1
0.2
0.3
0.4
0
0
0.1
0.2
0.3
0.4
0
0
0.1
0.2
0.3
0.4
0.15
0.2
(a) PASCAL VOC 2010
body
windows
lights
lic. plates
wheels
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0.05
0.1
0
0
0.05
0.1
0.15
0.2
0
0
0.05
0.1
0
0
0.05
0.1
0.15
0.2
0
0
0.05
0.1
(b) CAR3D
Figure 7: Cumulative localization error distribution for parts. X-axis is the average localization error normalized
by image width, and Y-axis is the fraction of the number of testing images. The red solid lines are the performance
using SAC and the blue dashed lines are the performance of [27].
parts is determined by the following rule: contour landmarks are assigned to parts they belong to (e.g.landmarks
of the lower half of wheels), and inner landmarks are assigned to parts that they are surrounding (e.g.landmarks
around license plates). See Figure 5 for some examples.
Appearance features at landmarks. The appearance features f at the landmarks are HOG features. More
specifically, we calculate the HOG descriptor of an image patch centered at the landmark. The patch size is
determined by the 80% percentile of the distances between neighboring landmarks in training images.
Appearance similarity between segments. The similarity α(·, ·) is a two dimensional vector, whose components
are the χ2 distances of two types of features of the segments: color histograms and the grey-level co-occurrence
matrices (GLCM) [12]. The color histograms are computed in the HSV space. They have 96 bins, 12 bins in the
hue plane and 8 bins in the saturation plane. The GLCM is computed as follows: We choose 8 measurements of
the co-occurrence matrix, including HOM, ASM, MAX and means (variances and covariance) of x and y (please
refer to [12] for details); The GLCM feature is computed in the R, G and B channels in 4 directions (0, 45, 90, 135
degrees); As a result, the final feature length is 96 (8 measurements × 3 channels × 4 directions).
Segment pair assignment. To determine the segment pairs for locations on boundaries, we build a look-up table
which consists of 32 3-by-3 binary matrices, as shown in the right of Figure 6. At each boundary location p we
construct a 3-by-3 binary matrix m(p) according to the segmentation pattern of its 3-by-3 neighborhood: locations
covered by the same segment as p’s are given value 1 and other locations are given value 0. We denote the segment
which p belongs to by gp , and the segment which most 0-valued locations in m belong to by ḡp . See the left part
of Figure 6 for an example. If m(p) matches to one of the upper 16 matrices, gp will be s1 and ḡp be s2 of p; If it
matches to one of the lower 16 matrices, ḡp will be s1 and gp be s2 of p. See the appendix for more information.
4
4.1
Experiments
Dataset
We validate our approach on two datasets, PASCAL VOC 2010 (VOC10) [6] and 3D car (CAR3D) [18]. VOC10 is a
hard dataset because the variations of the cars (e.g, appearance and shape) are very large. From VOC10, we choose
car images whose sizes are greater than 80 × 80. This ensures that the semantic parts are big enough for inference
and learning. Currently our method cannot handle occlusion, so we remove images where cars are occluded by
other objects or truncated by image boarder. We augment the image set by flipping the cars in the horizontal
direction. This yields a dataset containing 508 cars. Then we divide images into seven viewpoints spanning over
body
windows
lights
lic. plates
wheels
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
(a) PASCAL VOC 2010
body
windows
lights
lic. plates
wheels
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
(b) CAR3D
Figure 8: Cumulative segmentation error distribution for parts. X-axis is the average segmentation error normalized
by image width, and Y-axis is the fraction of the number of testing images. The red solid lines are the performance
using SAC and the blue dashed lines are the performance of [27].
180◦ spacing at 30◦ . CAR3D provides 960 non-occluded cars. We also divide them into seven viewpoints (instead
of using the original eight viewpoints). We collect 300 negatives images by randomly sampling from non-car images
of PASCAL VOC 2010 using windows of the sizes of training images. These 300 negative images are used for both
datasets. In our experiments, for each dataset, we randomly select half of the images as training data and test the
trained model on the other half.
4.2
Baseline
We compare our method with the model proposed by Zhu and Ramanan [27] on landmark localization and semantic
part segmentation. We simply use their code to localize landmarks and assume the regions surrounded by certain
landmarks are the semantic parts. Note that we use the same landmark and part definitions for both the baseline
and our methods.
4.3
Evaluation
We first evaluate our method on landmark localization. We normalize the localization error as Zhu and Ramanan
did in [27]. In this and the following experiments, we consider parts of same category as a single part (e.g.two
lights of a front-view car are treated as one part). Figure 7 shows the cumulative error distribution curves on both
datasets. We can see that by using SAC we had a big improvement of the landmark localization performance of
all semantic parts on VOC10. We achieved better or comparable performance on CAR3D. Images in CAR3D are
relatively easier than those in VOC10 and therefore SAC cannot bring big performance gain.
Then we evaluate our method on semantic part segmentation. The segmentation error of a part is computed
by (1 − IOU ), where IOU is the intersection of detected segments and ground truth segments over the union of
them. Figure 8 shows the cumulative error distribution curves on both datasets. Again, using SAC our method
improves the performance on almost all parts (improvement on lights and license plate is significant). However,
we got slightly worse result on wheels. The errors occurred when SWA produces segments that are crossing the
boundaries of wheels and the nearby background at all levels. The reason is that due to illumination and shading,
it is difficult to separate wheels and background by appearance.
Figure 9 shows the visualization comparison, from which we can see that our method works better on part
boundaries, especially for lights and license plates. Figure 10 shows more segmentation results on VOC10 and
CAR3D.
Figure 9: Visualized comparison of our method with [27] on car part segmentation. In each pair of results, the
lower one is produced by our method.
5
Conclusion
In this paper, we address the novel task of car parsing, which includes obtaining the positions and the silhouettes
of the semantic parts (e.g., windows, lights and license plates). We propose a novel graphical models which
integrates the SAC coupling terms between neighboring landmarks, including using hidden variables to specify the
segmentation level for each part. This allows us to exploit the appearance similarity of segments within different
parts of the car. The experimental results on two datasets demonstrate the advances of using segment appearance
cues. Currently, the model cannot handle large occlusion and truncation, which is our future direction.
6
Acknowledgement
This material is based upon work supported by the Center for Minds, Brains and Machines (CBMM), funded by
NSF STC award CCF-1231216.
Appendices
A
Segment Pairs
The look-up table is used to choose two segments from those near a boundary point and assign s1 and s2 to them.
The first design criterion is that the assignment should be consistent, which is twofold: moving a contour point in
its vicinity should not change its segment pair assignment; for two nodes in the graphical model whose landmarks
are from the same part, their segment pairs should have the same order (e.g. both’s s1 are assigned to segments
inside the part and s2 assigned to segments outside the part) across different images. This criterion guarantees
A
that learning the parameters (wi,j
in equation 3) of SAC terms are statistically meaningful. The second criterion
is that the look-up table should be able to identify locations with jagged edges, as in such locations it is very hard
to guarantee consistency.
A.1
Design of Look-Up Table
We use 3-by-3 binary matrices to index the local segmentation patterns around contour locations. Figure 11(a)
shows 70 out of all 256 possible matrices (the center is fixed to one), the rest of which are obtained by rotating
these 70 prototypes. Not all of the 256 matrices are suitable for indexing. Some of them correspond to jagged edges
and some of them will not occur on the contours in practice. We pick 8 matrices from the first row in figure 11(a)
and rotate them to generate a set of 32 matrices which compose of the look-up table as shown in figure 11(b).
Figure 10: More segmentation results of our method on VOC10 (upper) and CAR3D (lower).
More formally, for each binary matrix m, we convert it into an 8-bit binary string bm by concatenating its
components in clockwise order starting from the upper left component. Then we use the matrices whose binary
strings satisfy the following constraints
8
X
kbm (i) − bm (mod(i, 8) + 1)k = 2
(7)
i=1
1<
8
X
(1 − bm (i)) < 6
(8)
i=1
where bm (i) is the i-th component of the binary string bm . The first constraint says there should be exactly 2
jumps (i.e. “1” to “0” or “0” to “1”) in bm . The second one requires bm to have enough “1”. See figure 12 for
some examples.
A.2
Segment Pair Assignment on Contour Locations
For convenience, we first repeat the rule of assigning s1 and s2 to two of the segments around a boundary location.
At each boundary location p we construct a 3-by-3 binary matrix m according to the segmentation pattern of its
3-by-3 neighborhood: locations covered by the same segment which covers p are given value 1 and other locations
are given value 0. We denote the segment which p belongs to by gp , and denote by ḡp the segment which most
0-valued locations in m belong to. Then we search in the look-up table for the same binary matrix as m. If there
is a hit from the upper 16 matrices in figure 11(b), gp will be s1 and ḡp will be s2 ; if there is a hit from the lower
16 matrices, gp will be s2 of p and ḡp will be s1 of p; otherwise, we will not apply SAC terms to p in the score
function.
The left of figure 13 shows how to compute the binary matrices for contour locations. The green rectangle marks
the 3-by-3 neighborhood with p in the center. The bronze segment is gp and the cyan segment is ḡp . According to
the above rule, the bronze region is given value 1, and the rest is given value 0; then we got a hit in the look-up
table, which assigns s2 to the bronze segment and s1 to the cyan segment.
On the right of figure 13, we show two segmentation patterns from two locations p and p0 not far from each
other. Although different from each other, they all assign s1 to the bronze segment and s2 to the violet segment.
In fact, in this example, all points along the segment boundaries have the same assignment (i.e.bronze segment to
s1 and violet segment to s2 ). This shows the consistency of the assignment algorithm.
(a)
(b)
Figure 11: (a) 70 out of all 256 3-by-3 binary matrices (black indicates “0” and white indicates “1”), with the
center fixed to one. Matrices in red rectangle are used to generated the 32 binary matrices of the look-up table.
Matrices in the blue dashed rectangle are considered not suitable for indexing. (b) The 32 binary matrices in the
look-up table, separated by a dashed line.
1" 2" 3"
01111111"
8"
4"
01011010"
7" 6" 5"
10001011"
Figure 12: Illustration of how to convert a binary matrix to a binary string. On the left, the numbers in the cells
indicate the order of concatenation. On the right are three examples.
References
[1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical
image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, 2011.
[2] Yihang Bo and Charless C. Fowlkes. Shape-based pedestrian parsing. In CVPR, pages 2265–2272, 2011.
[3] Xi Chen, Arpit Jain, Abhinav Gupta, and Larry S Davis. Piecing together the segmentation jigsaw using
context. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2001–2008.
IEEE, 2011.
[4] David Eigen and Rob Fergus. Nonparametric image parsing using adaptive neighbor sets. In CVPR, pages
2799–2806, 2012.
[5] S. M. Ali Eslami and Chris Williams. A generative model for parts-based object segmentation. In NIPS, pages
100–107, 2012.
[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results.
network.org/challenges/VOC/voc2010/workshop/index.html.
Zisserman.
The
http://www.pascal-
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
p#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
gp#
p#
s1#
m(p)#
P’#
m(p)#
s2#
m(p’)#
Figure 13: Example of how segment pair assignment rule works (left) and illustration of its consistency (right).
[7] Clément Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Scene parsing with multiscale feature
learning, purity trees, and optimal covers. ICML, 2012.
[8] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with
discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010.
[9] Pedro F. Felzenszwalb, David A. McAllester, and Deva Ramanan. A discriminatively trained, multiscale,
deformable part model. In CVPR, 2008.
[10] Martin A Fischler and Robert A Elschlager. The representation and matching of pictorial structures. IEEE
Transactions on Computers, 100(1):67–92, 1973.
[11] Stephen Gould, Richard Fulton, and Daphne Koller. Decomposing a scene into geometric and semantically
consistent regions. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1–8. IEEE, 2009.
[12] Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classification.
IEEE Transactions on Systems, Man and Cybernetics, (6):610–621, 1973.
[13] Mohsen Hejrati and Deva Ramanan. Analyzing 3d objects in cluttered images. In NIPS, pages 602–610, 2012.
[14] M Pawan Kumar and Daphne Koller. Efficiently selecting regions for scene understanding. In Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3217–3224. IEEE, 2010.
[15] Victor S Lempitsky, Andrea Vedaldi, and Andrew Zisserman. Pylon model for semantic segmentation. In
NIPS, volume 24, pages 1485–1493, 2011.
[16] Ce Liu, Jenny Yuen, and Antonio Torralba. Nonparametric scene parsing via label transfer. IEEE Trans.
Pattern Anal. Mach. Intell., 33(12):2368–2382, 2011.
[17] Daniel Munoz, J. Andrew Bagnell, and Martial Hebert. Stacked hierarchical labeling. In ECCV (6), pages
57–70, 2010.
[18] Silvio Savarese and Fei-Fei Li. 3d generic object categorization, localization and pose estimation. In ICCV,
pages 1–8, 2007.
[19] Eitan Sharon, Meirav Galun, Dahlia Sharon, Ronen Basri, and Achi Brandt. Hierarchy and adaptivity in
segmenting visual scenes. Nature, 442(7104):719–846, June 2006.
[20] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. Textonboost for image understanding:
Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International
Journal of Computer Vision, 81(1):2–23, 2009.
[21] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc J. Van Gool. Using recognition
to guide a robot’s attention. In Robotics: Science and Systems, 2008.
[22] Joseph Tighe and Svetlana Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors.
In CVPR, pages 3001–3008. IEEE, 2013.
[23] Olga Veksler. Image segmentation by nested cuts. In Computer Vision and Pattern Recognition, 2000.
Proceedings. IEEE Conference on, volume 1, pages 339–344. IEEE, 2000.
[24] Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915–936,
2003.
[25] Long Zhu, Yuanhao Chen, Yifei Lu, Chenxi Lin, and Alan Yuille. Max margin and/or graph learning for
parsing the human body. In CVPR, pages 1–8. IEEE, 2008.
[26] Long Zhu, Yuanhao Chen, and Alan Yuille. Learning a hierarchical deformable template for rapid deformable
object parsing. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(6):1029–1043, 2010.
[27] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In
CVPR, pages 2879–2886, 2012.
Download